Resilience of Vision-Language Models on Object-to-Background Compositional Changes
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Machine Learning
Department
Machine Learning
First Advisor
Dr. Fahad Khan
Second Advisor
Dr. Salman Khan
Abstract
"Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, our goal is to evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. This allows us to quantify the role of background context in understanding the robustness and generalization of deep neural networks. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct thorough experimentation and provide an in-depth analysis of the robustness of vision-based models against object-to-background context variations across different tasks."
Recommended Citation
M. Huzaifa, "Resilience of Vision-Language Models on Object-to-Background Compositional Changes,", Apr 2024.
Additional Links
https://mbzuaiac-my.sharepoint.com/:b:/g/personal/libraryservices_mbzuai_ac_ae/EXImNfxJgW9Oi6kCCIFxEkIBgLCloPTj1NThjbAzfpSmwQ?e=9c8TdQ
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Machine Learning
Advisors: Fahad Khan, Salman Khan
Online access available for MBZUAI patrons