cb.web.local

twitteryou tubeacpLinkedIn

GSMA Pleias launch African language AI model. (Image source: GSMA)

Pleias and the GSMA have introduced CommonLingua, a new open-source language identification model designed to significantly improve the processing of African language data at scale

The model forms part of the GSMA’s 'AI Language Models in Africa, by Africa, for Africa' initiative, which brings together partners working to bridge the persistent gap in African language representation in artificial intelligence systems.

With more than 2,000 living languages spoken across the continent, Africa presents a uniquely diverse linguistic landscape. However, many of these languages remain poorly represented in AI datasets, leading to reduced accuracy in language identification systems, especially when handling closely related languages or mixed-language content. Accurately identifying a language is a critical first step before building models in languages such as Swahili, Yoruba or Wolof, yet this stage has often proven unreliable for African datasets.

A major reason for this challenge lies in the design of existing language identification tools such as fastText, GlotLID and OpenLID, which were primarily trained on high-resource European and Asian languages. As a result, African-language content is frequently misclassified, often labelled incorrectly as English or French. Even advanced models show a notable decline in performance, with accuracy levels dropping by around 30 points when applied to African languages compared to widely used global languages.

CommonLingua is specifically developed to address this foundational limitation. On the CommonLID benchmark, it achieves an accuracy of 83% and a macro F1 score of 0.79, surpassing leading models by more than 10% points under similar testing conditions. Notably, it does so with a significantly smaller footprint, using approximately one three-hundredth of the parameters required by comparable systems. The model contains just 2 million parameters and is distributed as an 8 MB checkpoint, allowing efficient deployment across different environments. It can process around 20 text samples per second on a CPU and up to 3,000 texts per second on a single GPU.

The model supports a total of 334 languages, including 61 African languages spanning eight major language families. These include Bantu, Niger-Congo and West African, Afro-Asiatic and Semitic, Cushitic and Chadic, Berber, Nilo-Saharan, as well as various pidgins and creoles. By operating directly on UTF-8 byte sequences rather than relying on language-specific tokenisation, CommonLingua ensures consistent performance across multiple scripts such as Latin, Arabic, Ethiopic, N’Ko and Tifinagh.

“African languages are not an edge case. They are the working languages of hundreds of millions of people, and they deserve AI infrastructure built with the same care as any other language. CommonLingua is deliberately the first brick we are laying: you cannot curate what you cannot identify” said Pierre-Carl Langlais, co-founder and chief technology officer, Pleias.

The model has been trained entirely on open-licensed and public domain datasets compiled through the Common Corpus project. These sources include Wikipedia, OpenAlex, VOA Africa, WaxalNLP, Cultural Heritage collections and Pralekha, all released under permissive licensing frameworks.

Louis Powell, director of AI Initiatives at GSMA added, “Closing the gap in African-language AI is is fundamental to digital inclusion and unlocking economic opportunity. Progress has long been held back by the lack of foundational infrastructure, beginning with something as essential as language identification. CommonLingua addresses this critical gap, enabling the development of richer datasets and more representative AI systems at scale. Through our initiative, the GSMA is bringing partners together to move beyond fragmented efforts towards shared infrastructure that can power Africa’s digital ecosystem.”

The discussion around advancing African-language AI will continue at MWC26 Kigali, where GSMA and its partners will convene industry stakeholders to accelerate collaboration and innovation in this space.