Towards Text-To-Speech Models For Arabic

Date of Award


Document Type


Degree Name

Master of Science in Natural Language Processing


Natural Language Processing

First Advisor

Prof. Muhammad Abdul-Mageed

Second Advisor

Prof. Preslav Nakov


Text-To-Speech (TTS) systems with deep learning approaches have undergone remarkable advancements in recent years. Moreover, advancements in linguistic modeling and language understanding have enabled TTS systems to adapt to various languages, dialects, and accents, significantly enhancing their accessibility and usability across diverse global audiences. Although TTS systems are thriving for English, many languages still lag behind due to insufficient resources. The scenario is the same for both single-speaker TTS and zero-shot multi-speaker TTS systems, highlighting the need for continued research and resource allocation to address these disparities and expand the reach of TTS technology to linguistically diverse populations. In this thesis, we address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Subsequently, we train the YourTTS [10] model, one of the SoTA architectures for English, using our dataset from scratch. Additionally, we fine-tune the XTTS1 model, an open-source architecture. We then evaluate all our models on a dataset comprising 40 unseen speakers, finding the YourTTS model to achieve comparable performance to the XTTS model. YourTTS achieves the best speaker similarity score of 0.46, while the original XTTS attains the best Mean Opinion Score (MOS) of 3.37 (±1.16). Although the performance of zero-shot multi-speaker models lag behind that of the single-speaker TTS model, our study highlights significant potential for improvements in this emerging area of research in Arabic. Furthermore, we probe the performance of available singlespeaker TTS models in Arabic such as Glow-TTS [40], GradTTS [64], and ArTST [84]. In conclusion, this thesis presents a pioneering effort in advancing Arabic TTS synthesis, offering insights, methodologies, and empirical findings that contribute to the evolving landscape of TTS technologies tailored for Arabic. Through its experiments, this research lays the groundwork for further exploration and development in this critical area of speech processing.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing

Advisors: Muhammad Abdul-Mageed, Preslav Nakov

Online access available for MBZUAI patrons