African Languages for AI: Unlocking the Potential (2025)

Imagine a world where AI assistants like ChatGPT or Siri can't understand or respond in the languages spoken by over a billion people across Africa—now that's a global disconnect that hits home! But here's where it gets fascinating: a groundbreaking initiative is flipping the script, ensuring AI speaks the tongues of the continent. Dive in as we explore the African Next Voices project, which is assembling what could be the most extensive collection of African language data yet for AI development.

AI technologies such as ChatGPT, DeepSeek, Siri, and Google Assistant are predominantly crafted in the Global North, relying heavily on training data from English, Chinese, and various European languages. In stark contrast, African languages are severely underrepresented on the internet and in digital spaces. A dedicated group of African computer scientists, linguists, language experts, and other specialists has been tirelessly working to incorporate African languages into AI training. The African Next Voices project, backed primarily by the Gates Foundation (with additional support from Meta), and involving a collaborative network of African universities and organizations, has just unveiled what experts believe is the largest dataset of African languages tailored for AI purposes so far. The Conversation reached out to the team for insights, drawing from their hubs in Kenya, Nigeria, and South Africa.

Why is language so crucial for AI? Language serves as our primary channel for interaction, seeking assistance, and preserving communal meanings. It's the tool we employ to articulate intricate ideas and exchange knowledge. When communicating with AI, language is the bridge that conveys our needs and evaluates whether the machine truly grasps our intent. We're witnessing a surge in AI-driven applications spanning education, healthcare, and agriculture. These systems, known as large language models (LLMs), are built on vast amounts of linguistic data, yet they predominantly exist in just a handful of the world's languages. Moreover, languages embody culture, ethics, and indigenous wisdom. Without native language support, AI struggles to accurately interpret our intentions, leading to distrust in its responses. Essentially, without shared language, AI and humans can't effectively dialogue—it's the cornerstone for AI to genuinely serve people. Restricting AI to certain languages means overlooking the richness of most human cultures, histories, and knowledge bases. And this is the part most people miss: AI's inability to engage in African languages isn't just a technical glitch; it's a exclusionary force that perpetuates inequality.

Why are African languages underrepresented, and what does that mean for AI? The evolution of languages is deeply linked to human histories. Peoples affected by colonialism and imperialism often saw their languages sidelined and underdeveloped compared to those imposed by colonizers. As a result, African languages are less documented, including online. This leads to a shortage of high-quality, digitized text and audio for training and testing reliable AI models. This gap stems from longstanding policies that favored colonial languages in education, media, and governance. Beyond mere data, essential resources like dictionaries, specialized terminologies, and glossaries are scarce. Numerous challenges inflate the costs of dataset creation, such as the absence of tailored keyboards, fonts, spellcheckers, and tokenizers (tools that segment text for AI comprehension). Issues like regional spelling variations, tone notations, and diverse dialects further complicate matters. The outcome? AI that underperforms and can even be unsafe, with frequent errors in translation, transcription, and language recognition for African tongues. For instance, consider how a mistranslation in healthcare advice could have dire real-world consequences. But here's where it gets controversial: Some argue that prioritizing colonial languages was a necessary step for global integration, yet this view ignores the cultural erasure it causes. Is it fair to expect colonized regions to adapt to imposed languages forever? This debate highlights tensions between technological advancement and cultural preservation.

Practically, this shortfall bars many Africans from accessing global news, educational resources, medical information, and the efficiencies AI offers—in their mother tongues. When a language is absent from training data, its speakers are effectively erased from AI products, rendering them unsafe, impractical, or unjust. This widens the digital chasm, sidelining millions and hindering vital services. To illustrate, picture a farmer unable to use AI for crop advice in Swahili or a student struggling with non-native educational tools.

What is your project accomplishing, and how? Our core goal is to gather speech data for automatic speech recognition (ASR), a vital technology for oral languages that transforms spoken words into written text—think voice-to-text on your phone, but tailored for diverse African contexts. This initiative also investigates data collection methods and the volume required for effective ASR tools, sharing lessons across regions. By design, our data encompasses a wide spectrum: impromptu and scripted speech across domains like casual chats, healthcare, finance, and farming. We ensure diversity in participants—varying ages, genders, and education levels. Every recording adheres to ethical standards, including informed consent, reasonable payment, and clear data ownership agreements. Transcriptions follow language-specific protocols with rigorous technical validations. In Kenya, via the Maseno Centre for Applied AI, we're amassing voice samples for five languages, covering major groups: Nilotic (Dholuo, Maasai, Kalenjin), Cushitic (Somali), and Bantu (Kikuyu). Nigeria's Data Science Nigeria is collecting data for five prominent languages—Bambara, Hausa, Igbo, Nigerian Pidgin, and Yoruba— with collaboration from RobotsMali for Bambara, aiming to mirror authentic community usage. In South Africa, through the Data Science for Social Impact lab and partners, we're documenting seven local languages to capture the nation's linguistic tapestry: isiZulu, isiXhosa, Sesotho, Sepedi, Setswana, isiNdebele, and Tshivenda. This effort builds on the foundations laid by groups like the Masakhane Research Foundation network, Lelapa AI, Mozilla Common Voice, EqualyzAI, and countless innovators driving African language AI. Together, these initiatives foster a burgeoning ecosystem, making African languages integral to the AI era.

How can this be applied? The datasets and models will enable features like captioning media in local languages, voice assistants for farming and medical queries, and multilingual customer support. They'll also support cultural archiving. Broader, accessible African language resources will bridge text and speech technologies. Models will extend to practical uses in chatbots, teaching aids, and community services. We're envisioning complete toolsets—spellcheckers, dictionaries, translation apps, and summarizers—that keep African languages vibrant online. Ultimately, by combining ethically sourced, premium speech data with advanced models, we empower natural, accurate AI interactions in everyday languages. For example, imagine a doctor using AI to transcribe patient consultations in Yoruba without errors.

What's on the horizon for the project? This phase focused on speech data for select languages, but what about the rest? And what of other AI features like machine translation or grammar correction? We'll persist in expanding to more languages, creating models that authentically represent African usage. We're emphasizing compact, energy-saving models suited to local needs. The big hurdle now is integration: weaving these elements into cohesive systems for real-world deployment, not just isolated experiments. Key takeaways? Data collection is merely the start; benchmarking, reusability, and community ties are paramount. We'll link our ASR standards with other African initiatives. Sustainability is critical: providing ongoing access to computing power, training resources, and frameworks like the Nwulite Obodo Open Data License (NOODL) or Esethu Framework for students, researchers, and entrepreneurs. The ultimate dream? Empowering farmers, educators, and entrepreneurs to engage AI in isiZulu, Hausa, or Kikuyu, rather than defaulting to English or French. And this is the part most people miss: In a world dominated by tech giants, is it enough for local projects to fill the gap, or should global companies bear more responsibility for inclusive AI? Do you think cultural preservation through AI is a human right, or just a nice-to-have? Share your thoughts in the comments—we'd love to hear if you agree, disagree, or have a counterpoint to add to the conversation!

  • Vukosi Marivate is chair of data science, professor of computer science, director AfriDSAI at the University of Pretoria; Iyanuoluwa Adebara is assistant professor at the University of Alberta; Lincoln Wanzare is a lecturer and chair of the Department of Computer Science at Maseno University
African Languages for AI: Unlocking the Potential (2025)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Kareem Mueller DO

Last Updated:

Views: 6230

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Kareem Mueller DO

Birthday: 1997-01-04

Address: Apt. 156 12935 Runolfsdottir Mission, Greenfort, MN 74384-6749

Phone: +16704982844747

Job: Corporate Administration Planner

Hobby: Mountain biking, Jewelry making, Stone skipping, Lacemaking, Knife making, Scrapbooking, Letterboxing

Introduction: My name is Kareem Mueller DO, I am a vivacious, super, thoughtful, excited, handsome, beautiful, combative person who loves writing and wants to share my knowledge and understanding with you.