Towards Explainable and Controllable Vision-Language Models for Chest X-ray Imaging

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Mohammad Yaqub

Second Advisor

Dr. Kun Zhang


"While large-scale vision-language models (VLMs) show promise across various tasks, their application in safety-critical domains like medical imaging is hindered by a lack of explainability and control over outputs, a well-known challenge in Deep Learning (DL). This gap is increased by limited research on explainable VLMs for healthcare. Similarly, text-to-image medical generative models, while exciting, struggle with generating unrealistic and anatomically inaccurate images due to their sole reliance on textual input and lack of control over spatial features, limiting their real-world usefulness. This work addresses these two challenging problems in VLMs by proposing methods for explainability in discriminative VLMs and for precise spatial control in the generative VLMs for X-ray imaging. In our first work on explainability in VLMs, we analyze the performance of various explainable AI methods on a vision-language model, MedCLIP, to demystify its inner workings. We also provide a simple methodology to overcome the shortcomings of these methods. Our work offers a new perspective on the explainability of a recent well-known VLM in the medical domain while being generalizable to other VLMs. Moreover, to address the problem of lack of medical realism in generative VLMs, we propose XReal, a novel controllable diffusion model for generating realistic chest X-ray images through precise anatomy and pathology location control. Our lightweight method can seamlessly integrate spatial control in a pre-trained text-to-image diffusion model without fine-tuning, retaining its existing knowledge while enhancing its generation capabilities. XReal outperforms state-of-the-art X-ray diffusion models in quantitative and qualitative metrics while showing 13% and 10% anatomy and pathology realism gain, respectively, based on the expert radiologist evaluation. Our model holds promise for advancing the research on realism in generative medical VLMs, offering greater precision and adaptability while inviting further exploration in this evolving field."


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Mohammad Yaqub, Kun Zhang

Online access available for MBZUAI patrons