Machine Learning Dissertations and Theses

Gene Pathogenicity Prediction using Genomic Foundation Models

Mohammad Sayeed, Mohamed bin Zayed University of Artificial IntelligenceFollow

Date of Award

4-30-2024

Document Type

Thesis

Degree Name

Master of Science in Machine Learning

Department

Machine Learning

First Advisor

Dr. Hanan Aldarmaki

Second Advisor

Dr. Kun Zhang

Abstract

The domain of genomic analysis, specifically the classification of pathogenicity in gene sequences, stands at the forefront of advancements in medical genetics and personalized medicine. Accurate classification is imperative for understanding genetic disorders and crafting targeted treatments, yet traditional methodologies are often hampered by their need for extensive genomic attribute analysis and complex predictive models, making the process both intricate and computationally demanding. The research proposes a novel methodology utilizing Genomic Foundation models namely HyenaDNA, GenaLM, and Nucleotide Transformer (NT), to streamline the classification process of gene sequences, aiming to mitigate the complexity and computational demands inherent in conventional methods. This approach seeks to enhance the efficiency of gene sequence classification for disease prediction by leveraging the capabilities of AI, thereby reducing the reliance on extensive computational analyses. The methodology involved fine-tuning the aforementioned models with genomic data to assess their efficacy in achieving accuracy rates comparable to or surpassing those of existing pathogenicity classification techniques, without the need for additional auxiliary data traditionally required for such analyses. Furthermore, the research proposes two fine-tuning strategies one, a multiobjective approach where models were trained to predict both pathogenicity and specific predictive scores that are critical in assessing variant pathogenicity. The second strategy is to use the scores for pretraining the model alone then use the weights as better initializations for fine tuning. Despite these specialized fine-tuning methods not consistently outperforming standard approaches across all models, they were found to be particularly advantageous for models like GenaLM, which had less pre-training on genomic datasets. Remarkably, the Nucleotide Transformer model exhibited exceptional performance, achieving an accuracy rate of 90%. These findings underscore the potential of leveraging large AI models to simplify the gene sequence classification process, significantly enhancing the efficiency of genetic analysis and reducing computational burdens. This approach not only contributes to advancements in the accuracy of classification but also highlights the utility of AI in genetic research, minimizing the dependency on extensive auxiliary datasets and fostering new avenues for the application of AI in genetics and personalized medicine.

Comments

Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Hanan Aldarmaki, Kun Zhang

Online access available for MBZUAI patrons

Recommended Citation

M. Sayeed, "Gene Pathogenicity Prediction using Genomic Foundation Models,", Apr 2024.

Link to Full Text

COinS

Machine Learning Dissertations and Theses

Gene Pathogenicity Prediction using Genomic Foundation Models

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Abstract

Comments

Recommended Citation

Browse

Contribute

Links

Machine Learning Dissertations and Theses

Gene Pathogenicity Prediction using Genomic Foundation Models

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Second Advisor

Abstract

Comments

Recommended Citation

Share

Browse

Contribute

Links