ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset

August 7, 2024

TL;DR

This blog discusses the pre-processing recipe of the ArabicWeb24 [1] dataset and the evaluation of the process via training different ablation models. It also outlines the impact of the different filtering pipelines on model’s output and on data’s quality.

The code used for all the filtering/deduplication steps can be found here [2]

1. Introduction

As the development of Large Language Models (LLMs) accelerates, the significance of high quality pre-training data cannot be overstated. However, obtaining a high quality pre-training dataset in the Arabic language continues to be a hard challenge due to the limited availability of publicly accessible resources. For this matter, we are presenting the latest work, ArabicWeb24, which consists of two subsets. The first subset- referred to as the V1 contains 28 billion tokens of data, curated from an Arabic web crawl using various filters, formatters, and deduplication techniques. The second subset, referred to as V5, includes 39 billion tokens, applying all preprocessing pipelines except sentence deduplication. Both datasets aim to elevate the performance of Arabic LLMs by providing high-quality dataIn this blog post, we will provide a detailed overview of the pre-processing steps, as well as the training of various ablation models to examine and validate the filtering choices.

2. Data preparation

To build the ArabicWeb24, we started from a crawl containing 6.5 TB of compressed WARC files. Since Common Crawl is mostly anglocentric and not able to effectively handle Arabic datasets and other non-Latin languages, we focused on using a custom Arabic crawl designed to extract every possible bit of Arabic language data Given the voluminous size of the data we had, we’ve been looking for a way to process the data in a quick way by parallelizing workloads. For this purpose, we chose datatrove [3] library which is an open-source data processing library developed by Hugging Face that allows to process, filter and deduplicate text data at a large scale. In this section we will detail each step taken to have the ArabicWeb24 dataset.The image below summarizes all the pipelines we designed from extraction to filtering to removing deduplicates and formatting to obtain a good quality dataset.

2.1 Extraction

As the initial data is available in WARC (Web ARChive format), it contains raw data with both HTML content and its metadata. To perform the extraction we used the Trafilatura module as extractor, iterating through all the files and extracting the corresponding text.

While Trafilatura includes a timeout option that sets a limit for text extraction, records are skipped if this limit is exceeded. Using a timeout of 0.1 seconds per document effectively filters out data, such as long texts, and images. This pre-filtering step helps streamline the subsequent processing stages.

‍

2.2 Base filtering

As we are dealing with web data that it’s filled with bad quality content resulting in negatively affecting the performance of models trained on. Unlike cleaned datasets, web data often includes data from different sources with varying levels of reliability, consistency, and relevance. This could result in introducing noise and inaccuracies that may affect the quality of the end models.

First, let’s discuss the characteristics of the Arabic language that distinguish it from Latin languages. Notably, Arabic language is different as it’s written from right to left, and the script employs diacritical marks. Additionally, it has different punctuation marks and its own fundamentals in constructing fully understandable paragraphs and sentences.

For this we mention the base filtering process. With the help of the datatrove pipelines and with adjustments to fit the requirements of the Arabic language, we conducted:

Bad URL filtering; we updated the banned words list [4] to contain both Arabic and English banned words ensuring we remove all bad URLs containing adult content.
Applied the fastText language classifier to keep only samples with English and Arabic with a score >= 0.65.
We apply the Gopher quality filter. This filter is used to apply Gopher's quality heuristic rules[5]. Documents that have an unsuitable word count or mean word length, as well as those with a high use of symbols, such as a high percentage of lines starting with bullet points or ending with ellipses, are filtered out.
Eventually, we applied some changes within the heuristics, as we adapted the list of stop words[6] from AlGhafa[7] paper to fit the Arabic data, and we slightly increased the maximum ellipsis lines ratio to a value of 0.4 to avoid penalizing shorter samples as such sentences are commonly found in Arabic text.

After this filtering and extraction, we have obtained approximately 284GT.

‍

2.3 Deduplication

Filtering out deduplicated samples, has an important effect on the improvement of the model performance as it will reduce the possibility of the model memorizing the set of the pre-training data.

Additionally, deduplication helps in lowering the computational requirements as it helps getting rid of redundant data and even with a lower number of tokens, the model will be exposed to more varied data content.

In the deduplication process, we used two types of deduplication:

1. MinHash (Fuzzy deduplication): an approach that works by doing similarity search in all the documents and marking those who are considered duplicates. It starts by transforming the document into a set of n-grams then applies multiple hash functions to them to generate the MinHash signatures. These signatures are later divided into buckets. Documents with the same MinHashes in any bucket are considered a duplicate of each other. For this purpose, we went for the default parametrization of MinHash in datatrove using 5-grams and 112 hash functions to compute the signatures that were divided into 14 buckets, each containing 8 hashes to remove documents that are at least 75% similar.

After this step, the number of tokens left is approximately 98GT.

2. Sentence Deduplication: is a process designed to identify and remove duplicate sentences within a dataset, ensuring each sentence is unique and reducing redundancy. We used the default values provided by datatrove in their repository and we changed how the sentences are splitted to be based on a newline character.

This step generates in total a number of 56GT.

‍

2.4 Additional Filtering for a better quality and Formatting

To ensure high data quality, we conducted thorough inspections of the samples after deduplication and observed that many contained HTML image placeholders. These placeholders were generated by the Trafilatura extractor, which replaces images with placeholders.

To address this issue, we created a custom formatter to remove all placeholders. Additionally, we added a symbol line formatter to eliminate lines composed primarily of symbols. These steps ensure the consistency and cleanliness of the extracted text.

Additionally, and since the URL filter only operates at the URL level, leaving some samples with adult or scam content, we applied the Bad Words Filter from the C4 filtering pipeline, using a custom list of banned English and Arabic words, the same list used in URL filtering.

Although datatrove provides a list for Arabic, we found it insufficient. The complexity of the Arabic language makes it hard to gather a crawl that contains raw Arabic content. Therefore, we created an exhaustive custom list that aims to cover a large number of bad words.

This results in getting about 45GT.

For more consistency, we added the FineWeb filter to remove short lines, lists, navigation bars, and all the unnecessary data that could be extracted from a web page leaving us with high quality, cleaned pre-training dataset and resulting in 28GT.

An interesting aspect of this filter is that it operates using stop characters. Given that Arabic is read from right to left, encompassing letters, numbers, and punctuation, it also has its own unique stop characters. These characters were interpreted differently, sometimes detected at the beginning of a sentence and other times at the end. We ensured that the custom version of the FineWeb filter takes into account all those specific characteristics.

To provide a comprehensive overview of the data retained after each filtering stage, we present the schema below. It illustrates the full extent of the data that was kept and discarded at each step of filtering, formatting, and deduplication.

All percentages are computed with respect to the total number of tokens left after the base filtering.

2.5 Exclusion Reasons

For each filter applied, we examine the reasons for dropped documents, review the removed samples to ensure they meet the exclusion criteria, and adjust the filters as needed to maximize data retention and avoid getting a strict filter operating on data.

For the Gopher filter, most of the samples were eliminated due to having a number of words per document that does not fit within the heuristics constraints, so the document was either too short or too long.

As for the Bad URL filter, it was shown that most of the samples were removed due to having their domain appeared within the domains in the block list in addition to having banned words in their urls which made the hard blacklisted.

The FineWeb filter demonstrated that the ArabicWeb24 dataset is initially filled with lines that do not have a sufficient proportion of punctuation marks (with the stop characters adapted).

3. Training ablations

After the massive filtering/deduplication, we opted for training small ablation models (<1B) and evaluated these models on zero-shot evaluation tasks in order to judge data quality.

It is very common to train small models and evaluate them on a set of benchmarks to assess data quality [8]. This approach is effective since the models are small so training different models is fast and cheap.

To achieve this, we divided the process into three essential steps: clean, train, and evaluate. For each adjustment in the pipeline, we trained a small ablation model to observe the impact of each filter and deduplication stage on the dataset's quality. This method allowed us to systematically evaluate and validate the improvements at every step, ensuring that the final dataset was of the highest quality.

3.1 Setup

In the ablation model, we used the Mamba 2 architecture with a sequence length of 1024 given that the mean number of tokens was around 750. We set the global batch size to 1040 and used a d_model of 2304 and number of layers of 18. This choice of a wider model is motivated by training efficiency considerations with minimal performance degradation. The vocabulary size was set to 64k, and we use the AraGPT2 tokenizer [9]. The ablation models have all 900 million parameters. We also use the cosine decay learning rate scheduler with 10% of warmup.

‍

3.2 Versioning

In total, we trained 5 ablation models, they all had the same configuration. The only difference was the data they were trained on.

We also trained a baseline model on an open source dataset that we called V3. The 101 Billion Arabic Words Dataset [10] is built on Common Crawl and is dedicated for training and fine tuning large language models for a better generation of Arabic content.

In the table down below, we’ll detail the differences in the data pipeline. Please note that we started the ablations after the MinHash deduplication process, we consider that MinHash is necessary in all cases.

Datasets Versions	Dataset V1	Dataset V2	Dataset V4	Dataset V5
Sentence deduplication	✅	✅	🚫	🚫
C4 Bad words filter	✅	✅	✅	✅
FineWeb filter	✅	🚫	🚫	✅
Formatters	✅	✅	✅	✅
Num. of tokens (GT)	28	45	78	39

3.3 Evaluation

a- Qualitative Evaluation

We began by conducting various experiments with the models, starting with a series of qualitative analyses based on different prompts. To evaluate and judge the quality of the model outputs, we established three qualitative metrics.

Fluency: How grammatically correct and natural the generated text is.
Coherence: How logically consistent and contextually appropriate the generated text is.
Relevance: How well the generated text addresses the prompt or task.

The experiments demonstrate that the model trained on V1 data, which underwent extensive filtering, performed the best according to the chosen metrics. Its output was fluent, relevant, and appropriate, without any display of adult or spam content.

In contrast, models trained on V2 and V4, which lacked the FineWeb filter, produced noisy output containing characters from other languages,list-like outputs, etc. This highlights the critical role of the FineWeb filter in eliminating undesired characters and gibberish content, ensuring the coherence and relevance of the model's output.

Furthermore, deduplication is a very important step as it lowers the frequency of emitting memorized content and the use of high computational resources for training and pre-processing. However, we found that sentence deduplication had minimal impact on the quality of the model's output at the scale of the ablations.

b- Metrics and Benchmarks

We used the open-access ultimate_arabic_news dataset [11] for evaluation, extracting 61 million tokens. This dataset is a single-label collection of modern Arabic news texts, compiled through web scraping from sources like Al-Arabiya, Al-Youm Al-Sabea (Youm7), Google News, and other news sites.

We evaluated the different models using the top 3 and top 10 accuracies as shown in the plots below. These metrics will help us detect how models are not only able to generate a coherent output but also how often the answer appears within the top 3 or top 10 predictions.

Analysis of the plots shows that the model trained on dataset V1 yields the best results in terms of top 3 and top 10 accuracies followed by the model trained on V5. The filtering and deduplication processes all help in improving the data quality but it seems that sentence deduplication has a minor effect. We can drop the sentence deduplication in exchange of getting more training tokens which would bring further improvements.

The perplexity study (figure below) indicates that the V1 model outperforms others, including the baseline model trained on the 101B words dataset. Models trained on ArabicWeb24 data achieve lower, more similar perplexity scores. However, these perplexity results are not optimal, possibly due to the model's limited size (900M parameters) and differences between the evaluation and training datasets. The 101B words dataset underwent specific preprocessing, such as diacritic removal and elimination of non-Arabic characters, which may contribute to these differences. Nevertheless, the V1 model's superior performance in predicting subsequent words suggests more coherent and contextually accurate outputs, aligning with the qualitative analysis findings.

Another way to compare the different datasets is by evaluating the ablation models on standard benchmarks. We selected a set of benchmarks that would help us evaluate the models at a small scale that were trained on a few billion tokens. We generally selected these benchmarks since they target the Arabic language and are also easy to use through the lm-evaluation-harness [12] library, in addition to being able to provide signal at a small scale.

After consideration, this is the list of Arabic benchmarks that we used:

* COPA-ar: Translated COPA from AlGhafa paper [13]

* HellaSwag-ar: Translation of HellaSwag with ChatGPT

* PIQA-ar: Translated PIQA from AlGhafa paper [14]

The results of the benchmarking at the last step of training is detailed in the plot below:

The zero-shot benchmarks show a significant difference between the performance of the baseline model and the models trained on the ArabicWeb24 datasets. Also, it indicates only a slight variation across different tasks for the models we trained, which is likely suspected to be due to the Arabic benchmarking datasets used.

4. Data Domain Distribution

The experiments revealed a pattern in the models' outputs: a tendency to generate text similar to news articles. To investigate this trend, we extracted the 150 most frequently occurring URLs from the documents. These URLs were then annotated using a two step approach of Llama 3 8B annotation and manual verification. We provide the model with an excerpt from each URL and ask it to classify it into one of the following classes:

News
Encyclopedia
Art and Entertainment
Sports
Society and Religion
Spam
Marketplace

The resulting distribution is illustrated in the graph below.

The chart illustrates the distribution of the top 150 URLs in the ArabicWeb24 dataset. Among these, news content is the most prevalent, making up 76% of the URLs. Encyclopedia entries are the second largest category at 8%, followed by Art and Entertainment at 6%. Sports-related URLs account for 4% of these top 150 URLs. The remaining categories - Society and Religion, Spam, and Marketplace - each represent 2% of this subset. This distribution shows the variety of content sources within these most frequently occurring URLs, with a clear emphasis on news-related material. While this analysis focuses on the top 150 URLs and may not represent the entire dataset, it suggests a strong presence of current events and factual information, along with a range of other topics that add diversity to the content.

‍

5. Computational Resources

This section outlines the computational resources utilized for various stages of the data processing pipeline:

Ablation Studies:
- Platform: HPE Cray node
- Hardware: 8 NVIDIA H100 GPUs
- Cloud Provider: Orange Cloud Avenue
MinHash Deduplication:
- Infrastructure: MeluXina HPC cluster nodes
Data Pre-processing:
- Cloud Provider: Amazon Web Services (AWS)
- Text Extraction and Base Filtering:
  - Instance Type: c7a.8xlarge
- Advanced Processing (sentence deduplication, tokenization, fineweb processing, URL filtering, and formatting):
  - Instance Type: r7a.12xlarge

6. Conclusion

In this work, we aim to bridge the gap between the datasets available for English and the limited ones for low-resource languages like Arabic. Furthermore, the existing open-source Arabic datasets are often small and noisy and for that the goal was to develop a clean, web-scale dataset for Arabic. We are releasing a substantial amount of this cleaned data collected from previously unexplored sources, and we extensively document the process to do so, in this blog post.

We are hoping that the efforts made will ultimately motivate other actors to perform critical work on data and eventually contribute to the broader AI community by providing valuable and accessible resources for researchers and developers working on the path to larger, stronger, natively Arabic language models.

Interested in custom data curation and processing for your language? Contact LightOn for custom solutions, with Arabic as a prime example.

‍

7. References

[1] ArabicWeb24 HF dataset card (https://huggingface.co/datasets/lightonai/ArabicWeb24)

[2] ArabicWeb24 pipelines (https://github.com/lightonai/datatrove/tree/arabic-web/arabicweb)

[3] datatrove Library (https://github.com/huggingface/datatrove)

[4] ArabicWeb24 bad words list (https://github.com/lightonai/datatrove/blob/arabic-web/arabicweb/assets/en_ar_banned_words.txt)

[5] Gopher filter paper (https://arxiv.org/pdf/2112.11446.pdf)

[6] Arabic stop words list (https://countwordsfree.com/stopwords/arabic)

[7] ALGhafa paper (https://aclanthology.org/2023.arabicnlp-1.21.pdf)

[8] FineWeb blog post (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)

[9] AraGPT2 tokenizer (https://huggingface.co/aubmindlab/aragpt2-base)

[10] 101B words dataset (https://huggingface.co/datasets/ClusterlabAi/101_billion_arabic_words_dataset )

[11] Ultimate news Arabic dataset (https://huggingface.co/datasets/khalidalt/ultimate_arabic_news/viewer/UltimateArabicPrePros)

[12] lm-evaluation-harness library (https://github.com/EleutherAI/lm-evaluation-harness)

[13] Arabic version of COPA benchmark (https://huggingface.co/datasets/OALL/AlGhafa-Arabic-LLM-Benchmark-Translated/viewer/copa_ext_ar)

[14] Arabic version of PIQA benchmark (https://huggingface.co/datasets/OALL/AlGhafa-Arabic-LLM-Benchmark-Translated/viewer/piqa_ar)

‍

To cite this work, please refer to the following BibTeX:

@misc{ArabicWeb24, title={ArabicWeb24: Creating a High Quality Arabic Web-only Pre-training Dataset}, author={Farhat, May and Taghadouini, Said and Hallström, Oskar and Hajri-Gabouj, Sonja}, organization={LightOn, INSAT}, url={www.lighton.ai/lighton-blogs/arabicweb24}, year={2024} }

Note : May Farhat completed her work on the ArabicWeb24 project during her internship tenure at LightOn. Throughout this period, she was under the academic supervision of Ms. Sonia Hajri Gabouj from INSAT, and the professional guidance of Mr. Oskar Hallström, her designated supervisor at LightOn.

‍

Ready to Transform Your Enterprise?