Rebooting Language Models for Speech
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Natural Language Processing
Department
Natural Language Processing
First Advisor
Hanan Aldarmaki
Second Advisor
Gus Xia
Abstract
Integrating speech directly into the text domain has significantly improved the traditional two-step process of converting speech to text and then processing the text. Recent publications even showcase integrating the Large Language Model for context recognition of speech modality. While most methods employ the output of the intermediate layer of the pre-trained models or direct placement of speech hidden representation instead of text embedding space, there is potential in exploring alternative approaches that use querying text information from speech representation context. Exploring alternative methods that derive text information directly from the context of speech representations presents opportunities for efficiency improvements, such as reduced storage needs, parameter efficient computation, etc.. In this study, we propose a new training protocol for speech that utilizes speech codes from the neural encodec model in Automatic Speech Recognition and Automatic Speech Translation tasks, which re-frames sequence classification objectives to generative. Our experiments on the LibriSpeech dataset reveals that our proposed method is effective, though it encounters some challenges with accurately matching the target text. Through evaluating the model’s performance against established benchmarks, we infer that the generated outputs bear a high correlation with the semantic representation of the gold standard labels.
Recommended Citation
A. Djanibekov, "Rebooting Language Models for Speech,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing Advisors: Hanan Aldarmaki,Gus Xia with 2 years embargo period