Building a Mixed-Dialect Arabic Dataset for Summarization: MSA and Moroccan Darija

Authors: Abir Harrasse, Yassmine ED-DYB
April 7, 2025

In this post, we'll walk through how we created a specialized dataset for fine-tuning a small language model to summarize both Modern Standard Arabic (MSA) texts and Moroccan dialectal Arabic (Darija). We'll share the practical challenges we faced, our solutions, and provide code snippets for key steps in the process.

Introduction

Fine-tuning small language models (SLMs) for summarization has gained significant attention due to the increasing need for efficient and effective text summarization techniques. This blog post focuses on our approach to creating a dataset that combines Modern Standard Arabic and dialectal Arabic for a summarization task.

Our primary goal was to develop a dataset that can be used to fine-tune a model capable of generating high-quality summaries while operating within the constraints of a Google Colab free-tier GPU.

Dataset Selection and Preparation

For our project, we aimed to create a dataset that enables an SLM to summarize both Modern Standard Arabic (MSA) texts and dialectal Arabic. Given that our dataset consists of only 5,000 documents, we decided to focus specifically on the Moroccan dialect, Darija.

We constructed our dataset using the following composition:

This distribution was deliberately chosen to reflect real-world usage patterns: web content forms the majority as it's the most common source for summarization tasks, while educational content (Wikipedia) provides more structured formal language, and dialectal content ensures the model can handle local variations in Arabic.

Darija Samples (20%)

We extracted Darija content from several open-source datasets:

Our first step was to analyze the dialect distribution in our initial dataset:

Dialects Distribution

Figure 1: Dialects' distribution across the No-Dialect-Left-Behind dataset

Based on this analysis, we decided to focus on Moroccan Darija. After filtering for Moroccan dialect texts, we analyzed the length distribution:

Moroccan Dialect's Distribution

Figure 2: Length distribution of Moroccan dialect samples

Our filtering process included:

  1. Removing any Latin words from the texts to retain only Arabic content
  2. Analyzing text length to identify suitable candidates for summarization

When working with the Darija_Dataset from JasperV13, we followed the same approach:

Length Distribution JasperV13

Figure 3: Length distribution of samples in the initial JasperV13/Darija_Dataset

Length Distribution After Filtering

Figure 4: Length distribution of samples after Latin words' removal

Initially, we selected 300 documents of maximal length for annotation. However, we discovered that annotating these long documents was time-consuming (approximately 4 hours) because we had to chunk our text into many segments to feed it to the annotating model.

After this experience, we revised our strategy and set a maximum character limit of 5000 for all documents in our dataset.

Arabic Web Content Samples (60%)

For this major portion of our dataset, we focused exclusively on the Arabic FineWeb2 dataset by Ali Elfilali. This dataset has already undergone extensive cleaning, filtering, and deduplication, which saved us significant preprocessing time.

Our process included:

  1. Analyzing the length distribution of the dataset
  2. Filtering out texts containing Latin characters
  3. Setting a maximum threshold of 5000 characters, consistent with our approach for Darija
FineWeb2 Length Distribution

Figure 5: Length distribution of samples in the initial Arabic FineWeb2 data

FineWeb2 After Filtering

Figure 6: Length distribution of samples after Latin words' removal

Arabic Educational Content Samples (20%)

For this section, we used an Arabic Wikipedia dump curated by Saied Alshahrani. Our process was similar to the previous sections:

  1. Exploring the length distribution of the data
  2. Removing texts with Latin words
  3. Applying the FineWeb2 pipeline to filter out data with excessive n-gram repetition, line repetition, or punctuation repetition
Wikipedia Length Distribution

Figure 7: Length distribution of samples in the initial Arabic Wikipedia data

Wikipedia After Filtering

Figure 8: Length distribution of samples after Latin words' removal

Dataset Annotation

Model Selection

We chose to perform synthetic annotation, using a large language model to generate summaries for our dataset. The constraints of the free-tier Colab environment limited our model selection.

In theory, the free-tier Colab limitation meant we could not use models larger than 20B under 4-bit quantization. However, in practice, we found that:

After experimentation, we selected Jais 13B as our annotation model, providing the best balance between performance and feasibility within our constraints. Running this model required approximately 20GB of GPU RAM to generate reliable summaries, which we managed through careful memory optimization and gradient checkpointing.

Annotation Process

For annotation, we used an Alpaca format prompt structured as follows:

### Instruction: قم بتلخيص المقال التالي بطريقة مختصرة وواضحة في أقل من 50 كلمة:

{text}

### الملخص:

For longer texts, we implemented chunking with a maximum of 1700 tokens, given that our model has a context size of 2048 tokens. Here's the key code we used for our annotation process:

def safe_generate(text, max_input_tokens=1700):
    token_count = count_tokens(text)

    if token_count > max_input_tokens:
        print(f"Input too long ({token_count} tokens). Truncating to {max_input_tokens} tokens.")
        tokens = tokenizer.encode(text)
        text = tokenizer.decode(tokens[:max_input_tokens])

    torch.cuda.empty_cache()
    gc.collect()

    prompt = f"### Instruction: قم بتلخيص المقال التالي بطريقة مختصرة وواضحة في أقل من 50 كلمة:\n\n{text}\n\n### الملخص:"
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
            num_beams=1
        )

    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    if "### الملخص:" in full_response:
        summary = full_response.split("### الملخص:")[1].strip()
    else:
        summary = full_response.replace(prompt, "").strip()

    del inputs, outputs
    torch.cuda.empty_cache()
    gc.collect()

    return summary

For longer documents exceeding the token limit, we implemented a chunking mechanism:

def process_article(text, max_chunk_tokens=1700):
    if count_tokens(text) <= max_chunk_tokens:
        return safe_generate(text, max_chunk_tokens)

    tokens = tokenizer.encode(text)
    chunks = []

    for i in range(0, len(tokens), max_chunk_tokens):
        chunk_tokens = tokens[i:i+max_chunk_tokens]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)

    print(f"Split into {len(chunks)} chunks")

    summaries = []
    for i, chunk in enumerate(chunks):
        result = safe_generate(chunk, max_chunk_tokens)
        summaries.append(result)

        torch.cuda.empty_cache()
        gc.collect()
        import time
        time.sleep(2)  # Adding delay to prevent GPU OOM errors

    return " ".join(summaries)

We observed that due to the quantization of our model, performance decreased as the length of the input text increased, which further justified our chunking approach.

To ensure quality, we validated the generated summaries by:

  1. Checking their lengths to ensure they were appropriately concise
  2. Manually reviewing a random sample to verify they captured the main points of the original text
  3. Ensuring they maintained the same language variety as the source (MSA or Darija)

We generated summaries for all 5,000 documents and stored them in our mixed-darija-msa-summarization dataset.

Data Splitting

To ensure proper structure for fine-tuning, we performed stratified data splitting, allocating 80% of the data for training and 20% for testing, with a fixed random seed to ensure reproducibility.

Stratifying the data ensures that the proportion of each category (Darija, MSA from web, MSA from Wikipedia) remains consistent between the train and test sets.

Category Distribution

Figure 9: Category distribution in different data sets after Stratified Sampling

Challenges and Limitations

Throughout the dataset creation process, we encountered several challenges:

  1. Computational constraints: Running the 13B parameter model required careful memory management. We implemented aggressive garbage collection and added delays between processing chunks to prevent GPU out-of-memory errors.
  2. Text length management: Many texts in our initial dataset exceeded the model's context window, requiring us to implement chunking. This added complexity to the annotation process and potentially affected summary quality for very long documents.
  3. Dialect representation: Finding high-quality, clean Darija text was challenging, as many datasets mixed Arabic script with Latin characters or contained inappropriate content.
  4. Dataset limitations: Our 5,000 document dataset, while substantial, may not represent all variations of Arabic dialects. We focused specifically on Moroccan Darija, which limits the model's applicability to other Arabic dialects.

Conclusion

By carefully selecting, filtering, and annotating our data, we've created a balanced dataset of 5,000 documents combining Moroccan Darija and Modern Standard Arabic texts from both web content and educational sources. This dataset provides a solid foundation for fine-tuning a small language model for Arabic summarization tasks.

The code and methodologies we've shared should be adaptable to other languages with similar diglossia situations (formal vs. dialectal variants), making this approach valuable beyond just Arabic NLP.

In the next part of this series, we'll cover the model selection, fine-tuning process, and evaluation of our summarization model.

References

  1. Saied Alshahrani. Arabic Wikipedia Dataset (2023-01-01)
  2. Atlasia. No-Arabic-Dialect-Left-Behind
  3. Ali Elfilali. FineWeb2 Arabic Subset
  4. JasperV13. Darija Dataset
  5. MBZUAI-Paris. DarijaStory Dataset
  6. Guilherme Penedo et al. FineWeb2: A sparkling update with 1000s of languages