Text-to-Image Diffusion with Complex and Detailed Prompts
Date of Award
4-30-2024
Document Type
Thesis
Degree Name
Master of Science in Machine Learning
Department
Machine Learning
First Advisor
Dr. Salman Khan
Second Advisor
Dr. Fahad Khan
Abstract
"Diffusion-based generative models have significantly ascended to the prominence providing massive leap forward in the field of computer vision, emerging as a key tool to unlock the vast potential of generative AI capabilities. Among their extensive applications in several potential fields, their ability to generate high quality images from textual prompts is remarkable. This intersection of text and image synthesis has the potential to revolutionize content creation, design, and various other domains where conveying complex visual concepts from textual descriptions is paramount. However, despite their remarkable achievements, diffusion-based generative models encounter notable hurdles when tasked with processing lengthy and intricate textual prompts. These challenges become particularly pronounced when describing scenes with multiple objects, intricate attributes, and nuanced contextual details. While these models excel in faithfully generating images from succinct, single-object descriptions, they often struggle to capture the richness and complexity inherent in longer textual inputs. This limitation poses a significant barrier, hindering their ability to accurately translate intricate textual descriptions into coherent visual representations. To mitigate these issues, in this work, we present a novel training-free approach leveraging a Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs. Our iterative framework offers a promising solution for enhancing text-to-image generation models' fidelity with lengthy, multifaceted descriptions, opening new possibilities for accurate and diverse image synthesis from textual inputs."
Recommended Citation
M. Gani, "Text-to-Image Diffusion with Complex and Detailed Prompts,", Apr 2024.
Comments
Thesis submitted to the Deanship of Graduate and Postdoctoral Studies
In partial fulfilment of the requirements for the M.Sc degree in Machine Learning
Advisors: Salman Khan, Fahad Khan
Online access available for MBZUAI patrons