Unifying Protein Function Prediction via Text Matching

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Shangsong Liang

Second Advisor

Dr. Martin Takac


The vast increase in publicly available protein sequences has popularized the pretraining then finetuning paradigm for predicting protein functions. This approach, however, relies on the availability of annotated protein data specific to each prediction task, which can be a significant limitation due to the necessity for extensive finetuning for different protein functions. To overcome these challenges, this thesis propose a novel method termed Protein prediction via Text Matching (ProTeM), which seeks to simplify and unify the protein function prediction process. This method converts numeric or categorical labels from various protein function datasets into textual instructions, incorporating rich semantic information. It utilizes both Large Language Models (LLMs) and Protein Language Models (PLMs), where LLMs are employed for their superior language understanding capabilities to discern connections among protein functions, thereby facilitating a more effective alignment between textual instructions and protein sequences.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors:Shangsong Liang, Dr. Martin Takac

Online access available for MBZUAI patrons