ModalityBridge: A Foundational Model-Based Framework for Cross-Modal Content Generation and Video Summarization

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Muhammad Haris

Second Advisor

Dr. Shijian Lu


This thesis introduces a novel framework that leverages foundational models for converting content across different modalities, encompassing any form of input (image, text, video, audio) to text and then back to any desired output modality. Central to this framework is the ability to understand and generate content through an innovative process that mimics human cognitive abilities in interpreting and recreating media. Two principal applications of this framework are explored in depth: an image-to-image conversion that aims to recreate images based on their textual descriptions, akin to a sketch artist’s work from witness accounts, and a video-to-video summarization through a technique that efficiently condenses videos into coherent summaries through a novel use of keyframe extraction, textual description, and synthesis. In the subsequent steps, the core innovation is seen where these visual descriptions are combined into a single textual representation for each scene. These texts are then clustered, with chronological and narrative coherence ensured by restricting clusters to adjacent scenes. Leveraging GPT-4, image prompts for each text cluster are generated, reflecting the combined visual storyline of the scenes. A text-to-image AI model is then employed to translate these prompts into images. The stitching of these images together to create a summarized video constitutes the final step. A Python package that encapsulates the functionalities of this framework, enabling easy integration into various projects, has been developed to facilitate the adoption and experimentation with this framework. Additionally, a survey tool designed to evaluate the effectiveness and user satisfaction of the video summaries produced by our framework is introduced, ensuring a user-centered approach to continuous improvement. Supporting the research, a comprehensive dataset from YouTube videos, enriched with annotations including detected scenes, textual descriptions of key frames, audio clips, and corresponding text transcripts, has been curated. In summary, this thesis introduces a comprehensive framework that enables seamless conversion across different media formats, including image-to-image recreation based on textual descriptions and video-to-video summarization through keyframe extraction, textual description generation, clustering, and image synthesis leveraging large language models and generative AI.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Muhammad Haris, Shijian Lu

Online access available for MBZUAI patrons