Gene Pathogenicity Prediction using Genomic Foundation Models
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Machine Learning
Department
Machine Learning
First Advisor
Dr. Hanan Aldarmaki
Second Advisor
Dr. Kun Zhang
Abstract
The domain of genomic analysis, specifically the classification of pathogenicity in gene sequences, stands at the forefront of advancements in medical genetics and personalized medicine. Accurate classification is imperative for understanding genetic disorders and crafting targeted treatments, yet traditional methodologies are often hampered by their need for extensive genomic attribute analysis and complex predictive models, making the process both intricate and computationally demanding. The research proposes a novel methodology utilizing Genomic Foundation models namely HyenaDNA, GenaLM, and Nucleotide Transformer (NT), to streamline the classification process of gene sequences, aiming to mitigate the complexity and computational demands inherent in conventional methods. This approach seeks to enhance the efficiency of gene sequence classification for disease prediction by leveraging the capabilities of AI, thereby reducing the reliance on extensive computational analyses. The methodology involved fine-tuning the aforementioned models with genomic data to assess their efficacy in achieving accuracy rates comparable to or surpassing those of existing pathogenicity classification techniques, without the need for additional auxiliary data traditionally required for such analyses. Furthermore, the research proposes two fine-tuning strategies one, a multiobjective approach where models were trained to predict both pathogenicity and specific predictive scores that are critical in assessing variant pathogenicity. The second strategy is to use the scores for pretraining the model alone then use the weights as better initializations for fine tuning. Despite these specialized fine-tuning methods not consistently outperforming standard approaches across all models, they were found to be particularly advantageous for models like GenaLM, which had less pre-training on genomic datasets. Remarkably, the Nucleotide Transformer model exhibited exceptional performance, achieving an accuracy rate of 90%. These findings underscore the potential of leveraging large AI models to simplify the gene sequence classification process, significantly enhancing the efficiency of genetic analysis and reducing computational burdens. This approach not only contributes to advancements in the accuracy of classification but also highlights the utility of AI in genetic research, minimizing the dependency on extensive auxiliary datasets and fostering new avenues for the application of AI in genetics and personalized medicine.
Recommended Citation
M. Sayeed, "Gene Pathogenicity Prediction using Genomic Foundation Models,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Machine Learning
Advisors: Hanan Aldarmaki, Kun Zhang
Online access available for MBZUAI patrons