Automated Generation of Chest X-Ray Reports

Document Type



In this work, we focus on (i) understanding the relative importance of encoder and decoder components, and (ii) developing a new reward for REINFORCE-based model optimization to improve the clinical accuracy of the reports. We analyze four different image encoding approaches: direct, fine-grained, CLIP-based, and Cluster-CLIP-based encodings in conjunction with three different decoders on the large-scale MIMIC-CXR dataset. Among these encoders, the cluster CLIP visual encoder is a novel approach that aims to generate more discriminative and explainable representations. CLIP-based encoders produce comparable results to traditional CNN-based encoders in terms of NLP metrics, while fine-grained encoding outperforms all other encoders both in terms of NLP and clinical accuracy metrics, thereby validating the Importance of image encoders to extract semantic information effectively. We also propose a new reward for REINFORCE-based optimization. The reward relies on question-answering (QA) transformer models. QA model selects the most relevant spans of the generated reports and the model is optimized with respect to those important spans. The QA-based reward doesn’t perform as well as other existing rewards in the REINFORCE-based optimization, but we outline its current weaknesses and propose further modifications for its improvement.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Karthik Nandakumar, Mr. Mohammad Yaqub

Online access provided for MBZUAI patrons