Unknown Language Identification with Transformer Architecture
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Natural Language Processing
Department
Natural Language Processing
First Advisor
Prof. Timothy Baldwin
Second Advisor
Prof. Muhammad Abdul-Mageed
Abstract
This thesis addresses the complex challenge of identifying unknown languages by leveraging pre-trained models across three main transformer architectures: encoder-only, decoder only, and encoder-decoder. Through the novel application of contrastive learning and thresholding techniques, we significantly enhance the performance of encoder-only models. Additionally, we employ prompt engineering strategies to optimize decoder-only and encoder-decoder models, demonstrating their critical role in maintaining model effectiveness. Our comprehensive analysis reveals that contrastive learning and thresholding can effective improve the performance of encoder-only models. While decoder-only models excel in tasks involving datasets of unknown languages, they are susceptible to overfitting. In contrast, encoder-decoder models emerge as the most reliable, delivering consistently superior average performance. Notably, our study finds that the model size does not have a direct impact on performance; however, the diversity of languages included in pre-training plays a significant role. iv
Recommended Citation
Q. Liao, "Unknown Language Identification with Transformer Architecture,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing
Advisors: Timothy Baldwin, Muhammad Abdul-Mageed
Online access available for MBZUAI patrons