Beyond Text: Leveraging Audio Utterances to Enhance Diacritic Restoration

Date of Award


Document Type


Degree Name

Master of Science in Natural Language Processing


Natural Language Processing

First Advisor

Prof. Hanan Aldarmaki

Second Advisor

Prof. Shady Shehata


Automatic diacritization plays a vital role in improving the readability and comprehension of Arabic text. However, current diacritic restoration models encounter difficulties when applied to transcribed speech due to shifts in domain and style inherent in spoken language. Researchers developing a text-to-speech system for Arabic identified a significant issue: synthesized speeches contain numerous pronunciation errors, largely stemming from the absence of diacritics in Modern Standard Arabic writing. In Modern Standard Arabic, texts are typically devoid of diacritical markings, which are essential for disambiguating word senses and meanings. The absence of these markings can lead to ambiguity, posing challenges for various Arabic applications such as information retrieval, machine translation, and text-to-speech. Thus, integrating diacritics into Arabic text is crucial for enhancing accuracy and effectiveness across these domains. This research investigates the possibility of enhancing the automatic restoration of diacritics in speech data by utilizing parallel spoken utterances. Particularly, we developed two frameworks: ASR+Text and Audio+Text. The ASR+Text framework uses a pretrained Automatic Speech Recognition (ASR) model to generate preliminary diacritized text, which is then refined in conjunction with raw text data. On the other hand, the Audio+Text framework incorporates direct audio features along with the textual data, employing several techniques such as clustering features from models like HuBERT and Wav2Vec and fine-tuning the XLS-R model for ASR objective. Our methodology involved conducting a comparative analysis of various results against pre-existing text-only diacritic restoration models. The evaluation of our proposed models, which use audio features, revealed a relative reduction in diacritic error rates - 45% for Text+ASR and 43% for Text+Audio. This highlights the substantial benefits of incorporating audio data.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing

Advisors:Hanan Aldarmaki ,Shady Shehata

Online access available for MBZUAI patrons