Self-supervised Representation Learning of Multi-omics Cancer Data

Document Type



We have gained access to vast amounts of multi-omics data thanks to Next Generation Sequencing. However, it is challenging to analyse this data due to much of it not being annotated. Lack of annotated data is a significant problem in machine learning, and Self-Supervised Learning (SSL) methods are typically used to deal with limited labelled data. However, there is a lack of studies that use SSL methods to exploit inter-omics relationships on unlabelled multi-omics data. Moreover, for personalised medicines, very crucial intrinsic information is present in high dimensional omics data, which is difficult to capture due to the large number of molecular features and small number of available samples. Different types of omics data show various aspects of samples. Integration and analysis of multi-omics data give us a broad view of tumours, which can improve clinical decision-making. Omics data, mainly DNA methylation and gene expression profiles, are usually high dimensional data with many molecular features. We developed multiple SSL methods to exploit inter-omics relationships and improve cancer type classification performance on the TCGA pan-cancer dataset with limited labelled data. In one of our works, SubOmiEmbed, we extended the idea of using a variational autoencoder (VAE) for low dimensional latent space extraction with the self-supervised learning technique of feature subsetting. With VAEs, the key idea is to make the model learn meaningful representations from different types of omics data, which could then be used for downstream tasks such as cancer type classification. The main goals are to over- come the curse of dimensionality and integrate methylation and expression data to combine information about different aspects of the same tissue samples and hopefully extract biologically relevant features. Our extension involves training the encoder and decoder to reconstruct the data from just a subset of it. Doing this forces the model to encode the most important information in the latent representation. We also added identity to the subsets so that the model knows which subset is being fed into it during training and testing. We experimented with our approach and found that SubOmiEmbed produces comparable results to the baseline OmiEmbed [65] with a much smaller network and by using just a subset of the data. This work can be improved to integrate mutation-based genomic data. In another work named Self-omics, we developed a novel and efficient pre-training paradigm that consists of various SSL components, including but not limited to contrastive alignment, data recovery from corrupted samples, and using one type of omics data to recover other omic types. This pre-training paradigm improved performance on downstream tasks with limited labelled data. We showed that our approach outperforms the state-of-the-art method in cancer type classification on the TCGA pan-cancer dataset in semi-supervised setting. Moreover, we showed that the encoders that are pre-trained using our approach can be used as powerful feature extractors even without fine-tuning. Our ablation study showed that the method is not overly dependent on any pretext task component. The network architectures in our approach are designed to handle missing omic types and multiple datasets for pre-training and downstream training. Our pre-training paradigm can be extended to perform zero-shot classification of rare cancers.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Mohammad Yaqub, Dr. Karthik Nandakumar

Online access provided for MBZUAI patrons