Unifying Protein Function Prediction via Text Matching
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Machine Learning
Department
Machine Learning
First Advisor
Dr. Shangsong Liang
Second Advisor
Dr. Martin Takac
Abstract
The vast increase in publicly available protein sequences has popularized the pretraining then finetuning paradigm for predicting protein functions. This approach, however, relies on the availability of annotated protein data specific to each prediction task, which can be a significant limitation due to the necessity for extensive finetuning for different protein functions. To overcome these challenges, this thesis propose a novel method termed Protein prediction via Text Matching (ProTeM), which seeks to simplify and unify the protein function prediction process. This method converts numeric or categorical labels from various protein function datasets into textual instructions, incorporating rich semantic information. It utilizes both Large Language Models (LLMs) and Protein Language Models (PLMs), where LLMs are employed for their superior language understanding capabilities to discern connections among protein functions, thereby facilitating a more effective alignment between textual instructions and protein sequences.
Recommended Citation
X. Li, "Unifying Protein Function Prediction via Text Matching,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Machine Learning
Advisors:Shangsong Liang, Dr. Martin Takac
Online access available for MBZUAI patrons