Unifying Protein Function Prediction via Text Matching

Master of Science in Machine Learning


Machine Learning

Dr. Shangsong Liang

Dr. Martin Takac


The vast increase in publicly available protein sequences has popularized the pretraining then finetuning paradigm for predicting protein functions. This approach, however, relies on the availability of annotated protein data specific to each prediction task, which can be a significant limitation due to the necessity for extensive finetuning for different protein functions. To overcome these challenges, this thesis propose a novel method termed Protein prediction via Text Matching (ProTeM), which seeks to simplify and unify the protein function prediction process. This method converts numeric or categorical labels from various protein function datasets into textual instructions, incorporating rich semantic information. It utilizes both Large Language Models (LLMs) and Protein Language Models (PLMs), where LLMs are employed for their superior language understanding capabilities to discern connections among protein functions, thereby facilitating a more effective alignment between textual instructions and protein sequences.


Advisors:Shangsong Liang, Dr. Martin Takac

