Unknown Language Identification with Transformer Architecture

Date of Award


Document Type


Degree Name

Master of Science in Natural Language Processing


Natural Language Processing

First Advisor

Prof. Timothy Baldwin

Second Advisor

Prof. Muhammad Abdul-Mageed


This thesis addresses the complex challenge of identifying unknown languages by leveraging pre-trained models across three main transformer architectures: encoder-only, decoder only, and encoder-decoder. Through the novel application of contrastive learning and thresholding techniques, we significantly enhance the performance of encoder-only models. Additionally, we employ prompt engineering strategies to optimize decoder-only and encoder-decoder models, demonstrating their critical role in maintaining model effectiveness. Our comprehensive analysis reveals that contrastive learning and thresholding can effective improve the performance of encoder-only models. While decoder-only models excel in tasks involving datasets of unknown languages, they are susceptible to overfitting. In contrast, encoder-decoder models emerge as the most reliable, delivering consistently superior average performance. Notably, our study finds that the model size does not have a direct impact on performance; however, the diversity of languages included in pre-training plays a significant role. iv


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Science in Natural Language Processing

Advisors: Timothy Baldwin, Muhammad Abdul-Mageed

Online access available for MBZUAI patrons