Diversity Assessment of Synthetically Generated Medical Datasets

Date of Award

4-30-2024

Document Type

Thesis

Degree Name

Master of Science in Machine Learning

Department

Machine Learning

First Advisor

Dr. Karthik Nandakumar

Second Advisor

Dr. Mohammad Yaqub

Abstract

Recent advancements in generative modeling have significantly pushed forward the capabilities of synthetic medical image generation, offering promising avenues for enhancing the training of machine learning models in medical diagnostics. Such synthetic images can augment existing datasets, potentially leading to more robust and accurate diagnostic models by providing a richer variety of training examples. However, a critical aspect that often remains underexplored is the diversity of these generated images. Ensuring a wide range of variability in synthetic images is crucial for developing models that can generalize well to unseen real-world cases. In this context, we introduce the SDICE index, a novel metric designed to quantify the diversity of synthetic medical images in relation to a benchmark set of real images. The SDICE index leverages the concept of similarity distributions, which are generated by comparing images using a contrastive encoder pre-trained on a relevant task. This encoder produces similarity scores for pairs of images, capturing the nuanced differences and similarities that are critical in medical imaging. By analyzing the distribution of these similarity scores, the SDICE index offers a comprehensive measure of how closely the diversity of a synthetic dataset mirrors that of a real dataset. To ensure the SDICE index provides a reliable and interpretable metric that can be consistently applied across different medical imaging domains, we normalize the calculated distances between similarity score distributions using an exponential function. This normalization process ensures that the SDICE index values are bounded and can be straightforwardly compared across different studies and datasets. To validate the efficacy and utility of the SDICE index, we conducted extensive experiments on two widely recognized datasets: the MIMIC chest X-ray dataset, which is pivotal in numerous medical imaging studies, and the ImageNet dataset, to demonstrate the index’s applicability beyond medical imaging. Our results underscore the SDICE index’s potential as a critical tool for researchers and practitioners in assessing and ensuring the diversity of synthetic datasets, thereby contributing to the development of more generalizable and effective machine learning models in medical image analysis.

Comments

Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors:Karthik Nandakumar, Mohammad Yaqub

with 1 year embargo period

This document is currently not available here.

Share

COinS