Diversity Assessment of Synthetically Generated Medical Datasets
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Machine Learning
Department
Machine Learning
First Advisor
Dr. Karthik Nandakumar
Second Advisor
Dr. Mohammad Yaqub
Abstract
Recent advancements in generative modeling have significantly pushed forward the capabilities of synthetic medical image generation, offering promising avenues for enhancing the training of machine learning models in medical diagnostics. Such synthetic images can augment existing datasets, potentially leading to more robust and accurate diagnostic models by providing a richer variety of training examples. However, a critical aspect that often remains underexplored is the diversity of these generated images. Ensuring a wide range of variability in synthetic images is crucial for developing models that can generalize well to unseen real-world cases. In this context, we introduce the SDICE index, a novel metric designed to quantify the diversity of synthetic medical images in relation to a benchmark set of real images. The SDICE index leverages the concept of similarity distributions, which are generated by comparing images using a contrastive encoder pre-trained on a relevant task. This encoder produces similarity scores for pairs of images, capturing the nuanced differences and similarities that are critical in medical imaging. By analyzing the distribution of these similarity scores, the SDICE index offers a comprehensive measure of how closely the diversity of a synthetic dataset mirrors that of a real dataset. To ensure the SDICE index provides a reliable and interpretable metric that can be consistently applied across different medical imaging domains, we normalize the calculated distances between similarity score distributions using an exponential function. This normalization process ensures that the SDICE index values are bounded and can be straightforwardly compared across different studies and datasets. To validate the efficacy and utility of the SDICE index, we conducted extensive experiments on two widely recognized datasets: the MIMIC chest X-ray dataset, which is pivotal in numerous medical imaging studies, and the ImageNet dataset, to demonstrate the index’s applicability beyond medical imaging. Our results underscore the SDICE index’s potential as a critical tool for researchers and practitioners in assessing and ensuring the diversity of synthetic datasets, thereby contributing to the development of more generalizable and effective machine learning models in medical image analysis.
Recommended Citation
M. Alam, "Diversity Assessment of Synthetically Generated Medical Datasets,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Machine Learning
Advisors:Karthik Nandakumar, Mohammad Yaqub
with 1 year embargo period