In this post, we'll walk through how we created a specialized dataset for fine-tuning a small language model to summarize both Modern Standard Arabic (MSA) texts and Moroccan dialectal Arabic (Darija). We'll share the practical challenges we faced, our solutions, and provide code snippets for key steps in the process.
Fine-tuning small language models (SLMs) for summarization has gained significant attention due to the increasing need for efficient and effective text summarization techniques. This blog post focuses on our approach to creating a dataset that combines Modern Standard Arabic and dialectal Arabic for a summarization task.
Our primary goal was to develop a dataset that can be used to fine-tune a model capable of generating high-quality summaries while operating within the constraints of a Google Colab free-tier GPU.
For our project, we aimed to create a dataset that enables an SLM to summarize both Modern Standard Arabic (MSA) texts and dialectal Arabic. Given that our dataset consists of only 5,000 documents, we decided to focus specifically on the Moroccan dialect, Darija.
We constructed our dataset using the following composition:
This distribution was deliberately chosen to reflect real-world usage patterns: web content forms the majority as it's the most common source for summarization tasks, while educational content (Wikipedia) provides more structured formal language, and dialectal content ensures the model can handle local variations in Arabic.
We extracted Darija content from several open-source datasets:
Our first step was to analyze the dialect distribution in our initial dataset:
Figure 1: Dialects' distribution across the No-Dialect-Left-Behind dataset
Based on this analysis, we decided to focus on Moroccan Darija. After filtering for Moroccan dialect texts, we analyzed the length distribution:
Figure 2: Length distribution of Moroccan dialect samples
Our filtering process included:
When working with the Darija_Dataset from JasperV13, we followed the same approach:
Figure 3: Length distribution of samples in the initial JasperV13/Darija_Dataset
Figure 4: Length distribution of samples after Latin words' removal
Initially, we selected 300 documents of maximal length for annotation. However, we discovered that annotating these long documents was time-consuming (approximately 4 hours) because we had to chunk our text into many segments to feed it to the annotating model.
After this experience, we revised our strategy and set a maximum character limit of 5000 for all documents in our dataset.
For this major portion of our dataset, we focused exclusively on the Arabic FineWeb2 dataset by Ali Elfilali. This dataset has already undergone extensive cleaning, filtering, and deduplication, which saved us significant preprocessing time.
Our process included:
Figure 5: Length distribution of samples in the initial Arabic FineWeb2 data
Figure 6: Length distribution of samples after Latin words' removal
For this section, we used an Arabic Wikipedia dump curated by Saied Alshahrani. Our process was similar to the previous sections:
Figure 7: Length distribution of samples in the initial Arabic Wikipedia data
Figure 8: Length distribution of samples after Latin words' removal
We chose to perform synthetic annotation, using a large language model to generate summaries for our dataset. The constraints of the free-tier Colab environment limited our model selection.
In theory, the free-tier Colab limitation meant we could not use models larger than 20B under 4-bit quantization. However, in practice, we found that:
After experimentation, we selected Jais 13B as our annotation model, providing the best balance between performance and feasibility within our constraints. Running this model required approximately 20GB of GPU RAM to generate reliable summaries, which we managed through careful memory optimization and gradient checkpointing.
For annotation, we used an Alpaca format prompt structured as follows:
### Instruction: قم بتلخيص المقال التالي بطريقة مختصرة وواضحة في أقل من 50 كلمة:
{text}
### الملخص:
For longer texts, we implemented chunking with a maximum of 1700 tokens, given that our model has a context size of 2048 tokens. Here's the key code we used for our annotation process:
def safe_generate(text, max_input_tokens=1700):
token_count = count_tokens(text)
if token_count > max_input_tokens:
print(f"Input too long ({token_count} tokens). Truncating to {max_input_tokens} tokens.")
tokens = tokenizer.encode(text)
text = tokenizer.decode(tokens[:max_input_tokens])
torch.cuda.empty_cache()
gc.collect()
prompt = f"### Instruction: قم بتلخيص المقال التالي بطريقة مختصرة وواضحة في أقل من 50 كلمة:\n\n{text}\n\n### الملخص:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=False,
num_beams=1
)
full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
if "### الملخص:" in full_response:
summary = full_response.split("### الملخص:")[1].strip()
else:
summary = full_response.replace(prompt, "").strip()
del inputs, outputs
torch.cuda.empty_cache()
gc.collect()
return summary
For longer documents exceeding the token limit, we implemented a chunking mechanism:
def process_article(text, max_chunk_tokens=1700):
if count_tokens(text) <= max_chunk_tokens:
return safe_generate(text, max_chunk_tokens)
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), max_chunk_tokens):
chunk_tokens = tokens[i:i+max_chunk_tokens]
chunk_text = tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
print(f"Split into {len(chunks)} chunks")
summaries = []
for i, chunk in enumerate(chunks):
result = safe_generate(chunk, max_chunk_tokens)
summaries.append(result)
torch.cuda.empty_cache()
gc.collect()
import time
time.sleep(2) # Adding delay to prevent GPU OOM errors
return " ".join(summaries)
We observed that due to the quantization of our model, performance decreased as the length of the input text increased, which further justified our chunking approach.
To ensure quality, we validated the generated summaries by:
We generated summaries for all 5,000 documents and stored them in our mixed-darija-msa-summarization dataset.
To ensure proper structure for fine-tuning, we performed stratified data splitting, allocating 80% of the data for training and 20% for testing, with a fixed random seed to ensure reproducibility.
Stratifying the data ensures that the proportion of each category (Darija, MSA from web, MSA from Wikipedia) remains consistent between the train and test sets.
Figure 9: Category distribution in different data sets after Stratified Sampling
Throughout the dataset creation process, we encountered several challenges:
By carefully selecting, filtering, and annotating our data, we've created a balanced dataset of 5,000 documents combining Moroccan Darija and Modern Standard Arabic texts from both web content and educational sources. This dataset provides a solid foundation for fine-tuning a small language model for Arabic summarization tasks.
The code and methodologies we've shared should be adaptable to other languages with similar diglossia situations (formal vs. dialectal variants), making this approach valuable beyond just Arabic NLP.
In the next part of this series, we'll cover the model selection, fine-tuning process, and evaluation of our summarization model.