This article was published on February 21, 2024

AI model Poro sets new milestones for multilingual LLMs in Europe

Silo AI is proving that 'we can build competitive models for low-resource' languages


AI model Poro sets new milestones for multilingual LLMs in Europe

Named after the Finnish word for “reindeer,” Poro is the first of a family of open-source multilingual LLMs. The startup is building the models alongside the University of Turku and the EU’s High Performance Language Technologies (HPLT) project.

Poro is a 34.2 billion parameter model, designed to process English, Finnish, and code. It’s been trained on a dataset of 1 trillion tokens.

“What we are proving with Poro is that we can build competitive models for low-resource languages, like Finnish,” Peter Sarlin, co-founder and CEO of Silo AI, told TNW.

Sarlin explained that in generic LLMs, high-resource languages like English dominate, meaning that the capabilities of low-resource languages reach the extent of translation, but aren’t representative of the language and the culture of a specific country.

According to the startup, Poro outperforms all existing open-source language models in Finnish, including Mistral, FinGPT, Llama, and the BLUUMI 176 billion parameter model.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

To achieve this, the team used a novel training approach, by pairing Finnish with high-resource languages. It determined optimal data reuse frequencies for low-resource languages and integrated translated paired texts between Finnish and English. This method relies on cross-lingual signals to boost the understanding of the connections between languages — and in turn, boost performance for Finnish, while not compromising it in English.

Poro has also achieved another milestone: it’s the first multilingual model that has been trained on a EuroHPC supercomputer. “This is proof that we’re able to train LLMs on the AMD-based LUMI supercomputer, instead of an NVIDIA-based supercomputer,” Sarlin said.

A step towards European sovereignty

Open-source multilingual LLMs are key to ensuring language diversity, cultural representation, and democratic access in artificial intelligence. They’re also critical for Europe’s AI sovereignty.

“From a commercial perspective, these models build a baseline and infrastructure that allows European companies to innovate on top,” Sarlin noted. “This way companies can create IP, create competitive edge, and [create] great business that ensures that value stays in Europe with them.”

Poro is available for free under the Apache 2.0 License, which allows both commercial and research use. SiloAI is currently working on the Nordic languages (Swedish, Norwegian, Danish, and Icelandic), and is planning to expand to all other official languages of the EU.

One of the themes of this year’s TNW Conference is Ren-AI-ssance: The AI-Powered Rebirth. If you want to go deeper into all things artificial intelligence, or simply experience the event (and say hi to our editorial team), we’ve got something special for our loyal readers. Use the code TNWXMEDIA at checkout to get 30% off your business pass, investor pass or startup packages (Bootstrap & Scaleup).

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with


Published
Back to top