Beyond Text: Leveraging Audio Utterances to Enhance Diacritic Restoration
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Natural Language Processing
Department
Natural Language Processing
First Advisor
Prof. Hanan Aldarmaki
Second Advisor
Prof. Shady Shehata
Abstract
Automatic diacritization plays a vital role in improving the readability and comprehension of Arabic text. However, current diacritic restoration models encounter difficulties when applied to transcribed speech due to shifts in domain and style inherent in spoken language. Researchers developing a text-to-speech system for Arabic identified a significant issue: synthesized speeches contain numerous pronunciation errors, largely stemming from the absence of diacritics in Modern Standard Arabic writing. In Modern Standard Arabic, texts are typically devoid of diacritical markings, which are essential for disambiguating word senses and meanings. The absence of these markings can lead to ambiguity, posing challenges for various Arabic applications such as information retrieval, machine translation, and text-to-speech. Thus, integrating diacritics into Arabic text is crucial for enhancing accuracy and effectiveness across these domains. This research investigates the possibility of enhancing the automatic restoration of diacritics in speech data by utilizing parallel spoken utterances. Particularly, we developed two frameworks: ASR+Text and Audio+Text. The ASR+Text framework uses a pretrained Automatic Speech Recognition (ASR) model to generate preliminary diacritized text, which is then refined in conjunction with raw text data. On the other hand, the Audio+Text framework incorporates direct audio features along with the textual data, employing several techniques such as clustering features from models like HuBERT and Wav2Vec and fine-tuning the XLS-R model for ASR objective. Our methodology involved conducting a comparative analysis of various results against pre-existing text-only diacritic restoration models. The evaluation of our proposed models, which use audio features, revealed a relative reduction in diacritic error rates - 45% for Text+ASR and 43% for Text+Audio. This highlights the substantial benefits of incorporating audio data.
Recommended Citation
S. Shatnawi, "Beyond Text: Leveraging Audio Utterances to Enhance Diacritic Restoration,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing
Advisors:Hanan Aldarmaki ,Shady Shehata
Online access available for MBZUAI patrons