Towards Text-To-Speech Models For Arabic
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Natural Language Processing
Department
Natural Language Processing
First Advisor
Prof. Muhammad Abdul-Mageed
Second Advisor
Prof. Preslav Nakov
Abstract
Text-To-Speech (TTS) systems with deep learning approaches have undergone remarkable advancements in recent years. Moreover, advancements in linguistic modeling and language understanding have enabled TTS systems to adapt to various languages, dialects, and accents, significantly enhancing their accessibility and usability across diverse global audiences. Although TTS systems are thriving for English, many languages still lag behind due to insufficient resources. The scenario is the same for both single-speaker TTS and zero-shot multi-speaker TTS systems, highlighting the need for continued research and resource allocation to address these disparities and expand the reach of TTS technology to linguistically diverse populations. In this thesis, we address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Subsequently, we train the YourTTS [10] model, one of the SoTA architectures for English, using our dataset from scratch. Additionally, we fine-tune the XTTS1 model, an open-source architecture. We then evaluate all our models on a dataset comprising 40 unseen speakers, finding the YourTTS model to achieve comparable performance to the XTTS model. YourTTS achieves the best speaker similarity score of 0.46, while the original XTTS attains the best Mean Opinion Score (MOS) of 3.37 (±1.16). Although the performance of zero-shot multi-speaker models lag behind that of the single-speaker TTS model, our study highlights significant potential for improvements in this emerging area of research in Arabic. Furthermore, we probe the performance of available singlespeaker TTS models in Arabic such as Glow-TTS [40], GradTTS [64], and ArTST [84]. In conclusion, this thesis presents a pioneering effort in advancing Arabic TTS synthesis, offering insights, methodologies, and empirical findings that contribute to the evolving landscape of TTS technologies tailored for Arabic. Through its experiments, this research lays the groundwork for further exploration and development in this critical area of speech processing.
Recommended Citation
D. Doan, "Towards Text-To-Speech Models For Arabic,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing
Advisors: Muhammad Abdul-Mageed, Preslav Nakov
Online access available for MBZUAI patrons