Zero-shot out-of-distribution cancer tissue classification via the Contextualized Classifier

Document Type



Understanding inter-sample heterogeneity is important in many applications for comprehending complex biological processes. For instance, in the genomic analysis of cancers, each patient in a cohort may have a distinct driver mutation, making it challenging to identify causal mutations by averaging the entire cohort. However, conventional methods for genomic analysis aim to estimate a single model that applies to all samples in a population, disregarding inter-sample heterogeneity completely. Thus, to better understand patient heterogeneity, there is a need for practical and personalized statistical models. Instead of fitting a single model to all patients, we implement the contextualized classifier which assigns a simple unique model, to each patient based on contextual information such as clinical data. This approach improves performance for both in-distribution and out-of-distribution predictions by tailoring each model to the specific context of the patient. We compare our model to two baseline models: the feature-only logistic regression and the context-only multi-layer perceptron classifiers. The former, takes gene expression as input which we denote as $\mathbf{x}$, while the latter takes the covariate or context data as input denoted by $\mathbf{c}$. The contextualized classifer has an encoder part and a sigmoid function part which take $\mathbf{c}$ and $\mathbf{x}$ as inputs, respectively. Using the contexts allows modeling highly individualized datasets where there is only one sample per patient, and each patient is considered to have a distinct model leading to highly heterogeneous data. We do the survival classification with two settings. First, in-distribution, where we randomly split the data to 80-20. The contextualized classifier outperformed the baseline models on the test set for all three evaluation metrics BCE loss, AUC score, and accuracy with $\mathbf{0.57}$, $\mathbf{0.786}$, and $\mathbf{70.3}$\%, respectively for the contextualized classifier, 1.02, 0.719, and 64.9 \%, respectively for the feature-only LR, and 0.71, 0.631, and 59.7 \%, respectively for the context-only MLP. Second, the out-of-distribution setting, where we isolate each primary site samples as test sets by turn and evaluate the BCE loss for each of the primary sites. Results show consistent improvements in BCE loss for our model compared to the baseline models. Therefore, the contextualized classifier captures the heterogeneity in the data through context-specific models, which opens the way for personalized cancer analysis. Visualizing the patient-specific coefficient embeddings allows identification of sub-populations among a cohort, as demonstrated by the distinct localization patterns for Kidney Chromophobe in kidney tissue samples using UMAP coefficient embeddings.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Computer Vision

Advisors: Prof. Eric Xing, Dr. Kun Zhang

Online access for MBZUAI patrons