Gene Pathogenicity Prediction using Genomic Foundation Models

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Hanan Aldarmaki

Second Advisor

Dr. Kun Zhang


The domain of genomic analysis, specifically the classification of pathogenicity in gene sequences, stands at the forefront of advancements in medical genetics and personalized medicine. Accurate classification is imperative for understanding genetic disorders and crafting targeted treatments, yet traditional methodologies are often hampered by their need for extensive genomic attribute analysis and complex predictive models, making the process both intricate and computationally demanding. The research proposes a novel methodology utilizing Genomic Foundation models namely HyenaDNA, GenaLM, and Nucleotide Transformer (NT), to streamline the classification process of gene sequences, aiming to mitigate the complexity and computational demands inherent in conventional methods. This approach seeks to enhance the efficiency of gene sequence classification for disease prediction by leveraging the capabilities of AI, thereby reducing the reliance on extensive computational analyses. The methodology involved fine-tuning the aforementioned models with genomic data to assess their efficacy in achieving accuracy rates comparable to or surpassing those of existing pathogenicity classification techniques, without the need for additional auxiliary data traditionally required for such analyses. Furthermore, the research proposes two fine-tuning strategies one, a multiobjective approach where models were trained to predict both pathogenicity and specific predictive scores that are critical in assessing variant pathogenicity. The second strategy is to use the scores for pretraining the model alone then use the weights as better initializations for fine tuning. Despite these specialized fine-tuning methods not consistently outperforming standard approaches across all models, they were found to be particularly advantageous for models like GenaLM, which had less pre-training on genomic datasets. Remarkably, the Nucleotide Transformer model exhibited exceptional performance, achieving an accuracy rate of 90%. These findings underscore the potential of leveraging large AI models to simplify the gene sequence classification process, significantly enhancing the efficiency of genetic analysis and reducing computational burdens. This approach not only contributes to advancements in the accuracy of classification but also highlights the utility of AI in genetic research, minimizing the dependency on extensive auxiliary datasets and fostering new avenues for the application of AI in genetics and personalized medicine.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Hanan Aldarmaki, Kun Zhang

Online access available for MBZUAI patrons