A Unified Model for Text-to-Speech and Speech-to-Text

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Hao Li

Second Advisor

Dr. Hanan Aldarmaki


"objectives, resulting in two distinct large networks. However, SpeechT5 introduces a unified modal framework designed for self-supervised speech and text representation learning. It optimizes this learning process with a joint objective, aligning text and speech information into a unified semantic space. This framework features a shared transformer encoder-decoder architecture, accompanied by six auxiliary pre/post networks tailored for handling modal-specific data. To achieve alignment of the textual and speech information into a unified semantic space, a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units served as the interface between the encoder and decoder. The SpeechT5 framework has demonstrated superior performance across various spoken English language tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. However, these tasks are still approached individually, utilizing pre-trained weights from the self-supervised speech and text representation learning. In this study, our objective is to consolidate the training process for Arabic ASR and TTS, aiming to reduce computational demands while maintaining state-of-the-art performance. This is accomplished through a two-stage approach. Initially, we conducted pre-training on the SpeechT5 architecture using 1K hours of Arabic language speech data along with their corresponding transcriptions. Subsequently, the pre-trained model undergoes fine-tuning for downstream speech tasks, specifically ASR and TTS. Through thorough evaluation, we emphasize the significance of language-specific pre-training in enhancing the performance of downstream tasks. In the second stage, we endeavor to unify the training procedures for ASR and TTS. We achieve this by developing a unified automatic speech recognition and synthesis model, employing a transformer encoder, a task-specific decoder, and six auxiliary networks. These components are trained concurrently with a combined loss objective. Our evaluation demonstrates that the performance of our model is comparable to that of individually trained models. Additionally, we illustrate the effectiveness of utilizing pre-trained weights from the initial iteration to enhance the performance of our model."


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Hao Li, Hanan Aldarmaki

Online access available for MBZUAI patrons