Optimizing Arabic Web-Crawled Text Dataset Generation: A High-Quality, Resource-Efficient Pipeline for Small-Scale Applications

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Qirong Ho

Second Advisor

Dr. Martin Takac


"With the proliferation of data on the internet and the enhanced accessibility of modern artificial intelligence (AI) technologies, advancements in the field of Natural Language Processing (NLP) and Large Language Models (LLMs) are occurring at an unprecedented pace. This surge is propelled by the vast availability of data across diverse domains, marking a golden era in computational linguistics and AI research. Despite this abundance, the quest for high-quality, large datasets remains a critical challenge. The sheer volume of raw data can be overwhelming, and its conversion into a format conducive to AI research often presents a formidable barrier, particularly for individual researchers and small teams. This thesis delves into the automation of data collection and processing at a large scale. It aims to bridge the gap between the availability of raw data and the creation of datasets that are both comprehensive and of high quality1 . The research introduces and evaluates the Macro Data Refinement Pipeline, an innovative approach developed by the Falcon LLM Team [28]. This methodology leverages a modular approach and frameworks to streamline the process of dataset generation at an efficient scale, ensuring that the resulting collections are not only vast but also meticulously curated to meet the rigorous standards required for cutting-edge AI research. By systematically addressing the challenges associated with large-scale data processing, this thesis contributes to the field by offering a scalable solution that can significantly enhance the efficiency and effectiveness of data preparation processes. Moreover, it lays the groundwork for future innovations in AI and NLP, potentially accelerating the pace of discoveries and applications in these areas. The implications of this work are far-reaching, offering potential benefits across various sectors, including technology, healthcare, education, and more. By simplifying the process of generating high-quality, large datasets, this research paves the way for more sophisticated and capable AI systems, driving progress and innovation in a multitude of fields."


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Qirong HO, Dr. Martin Takac

Online access available for MBZUAI patrons