Menu

Wide Angle

Moroccan and International researchers develop Atlas-Chat, the first language model in Darija

Atlas-Chat is the first large language model tailored for Darija, Moroccan Arabic, outperforming similar models in handling this dialect. The model was created using existing language resources and new datasets. 

DR
Estimated read time: 2'

A team of researchers from Morocco and beyond has developed the first large language models specifically designed for Darija, Moroccan Arabic. Named «Atlas-Chat», this AI model is a nod to the Atlas Mountains, a significant symbol of Morocco.

Atlas-Chat is capable of understanding and speaking Darija. In a research paper published on ResearchGate in September, titled «Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect», the researchers explain that the model was developed by integrating existing Darija language resources, creating new datasets, and carefully translating English instructions.

Atlas-Chat-9B response example 2 (The model can understand English instructions but only responds in Darija)Atlas-Chat-9B response example 2 (The model can understand English instructions but only responds in Darija)

The paper also highlights that their models, «Atlas-Chat-9B and Atlas-Chat-2B», outperform other cutting-edge Arabic-specialized language models, including LLaMa, Jais, and AceGPT in following instructions in Darija. The Atlas-Chat models can also perform standard Natural Language Processing (NLP) tasks, which include interpreting, manipulating, and comprehending human language.

Their findings also show that Atlas-Chat achieved a «13% performance boost over a larger 13B model on DarijaMMLU», a newly introduced evaluation suite for Darija that covers both discriminative and generative tasks.

Darija and low-resource languages

The study also observes that while large language models excel at understanding and using major languages, they often struggle with underrepresented languages, particularly Arabic dialects like Darija.

This is mostly because while Arabic boasts a rich cultural history and a complex linguistic structure, most efforts to develop Arabic-specialized models focus on bilingualism—balancing English and Modern Standard Arabic (MSA)—while often overlooking dialectal Arabic (DA).

Although DA is spoken by millions, there is a lack of data available for training large language models for it. To address this, the researchers created new datasets and evaluation tests specifically for DA.

Atlas-Chat-9B response example 1.Atlas-Chat-9B response example 1.

The research was conducted by researchers from Mohamed bin Zayed University of Artificial Intelligence (United Arab Emirates), École des Mines de Rabat (Morocco), Université Mohammed VI Polytechnique (Morocco), KTH Royal Institute of Technology (Sweden), Atlas Institute for Artificial Intelligence (Morocco), and École Polytechnique (France).

To inspire the development of similar models for other low-resource languages, the researchers have made all of their resources publicly available.

Be the first one to comment on our articles...