Rebooting Language Models for Speech

Date of Award


Document Type


Degree Name

Master of Science in Natural Language Processing


Natural Language Processing

First Advisor

Hanan Aldarmaki

Second Advisor

Gus Xia


Integrating speech directly into the text domain has significantly improved the traditional two-step process of converting speech to text and then processing the text. Recent publications even showcase integrating the Large Language Model for context recognition of speech modality. While most methods employ the output of the intermediate layer of the pre-trained models or direct placement of speech hidden representation instead of text embedding space, there is potential in exploring alternative approaches that use querying text information from speech representation context. Exploring alternative methods that derive text information directly from the context of speech representations presents opportunities for efficiency improvements, such as reduced storage needs, parameter efficient computation, etc.. In this study, we propose a new training protocol for speech that utilizes speech codes from the neural encodec model in Automatic Speech Recognition and Automatic Speech Translation tasks, which re-frames sequence classification objectives to generative. Our experiments on the LibriSpeech dataset reveals that our proposed method is effective, though it encounters some challenges with accurately matching the target text. Through evaluating the model’s performance against established benchmarks, we infer that the generated outputs bear a high correlation with the semantic representation of the gold standard labels.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing Advisors: Hanan Aldarmaki,Gus Xia with 2 years embargo period

This document is currently not available here.