Text-to-Image Diffusion with Complex and Detailed Prompts

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Salman Khan

Second Advisor

Dr. Fahad Khan


"Diffusion-based generative models have significantly ascended to the prominence providing massive leap forward in the field of computer vision, emerging as a key tool to unlock the vast potential of generative AI capabilities. Among their extensive applications in several potential fields, their ability to generate high quality images from textual prompts is remarkable. This intersection of text and image synthesis has the potential to revolutionize content creation, design, and various other domains where conveying complex visual concepts from textual descriptions is paramount. However, despite their remarkable achievements, diffusion-based generative models encounter notable hurdles when tasked with processing lengthy and intricate textual prompts. These challenges become particularly pronounced when describing scenes with multiple objects, intricate attributes, and nuanced contextual details. While these models excel in faithfully generating images from succinct, single-object descriptions, they often struggle to capture the richness and complexity inherent in longer textual inputs. This limitation poses a significant barrier, hindering their ability to accurately translate intricate textual descriptions into coherent visual representations. To mitigate these issues, in this work, we present a novel training-free approach leveraging a Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs. Our iterative framework offers a promising solution for enhancing text-to-image generation models' fidelity with lengthy, multifaceted descriptions, opening new possibilities for accurate and diverse image synthesis from textual inputs."


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Salman Khan, Fahad Khan

Online access available for MBZUAI patrons