LLM-Driven Video Background Music Generation Through Text Processing

Date of Award


Document Type


Degree Name

Master of Science in Machine Learning


Machine Learning

First Advisor

Dr. Gus Xia


"This thesis introduces an innovative framework, Herrmann-1, designed to automate the generation of background music tailored to video content, with a special focus on enhancing movie scenes. Named after the celebrated composer Bernard Herrmann, the framework represents a convergence of computer vision, natural language processing, musicology, and speech analysis to seamlessly align music with both the narrative and emotional dynamics of videos. This approach aims to resolve the traditional challenges associated with manual and time-consuming music selection processes, which typically result in generic and unspecific tracks that fail to complement the video’s mood and themes adequately. The methodology of Herrmann-1 involves the extraction and analysis of visual and auditory elements from videos, followed by an emotional analysis. These elements are transformed into descriptive texts, which then guide a large language model (GPT-4) to establish music-related conditions. A novel text-to-music model uses these conditions to generate soundtracks that align with the video’s emotional and thematic aspects. This demonstrates Herrmann-1’s potential to transform the music generation process for content creators, offering a streamlined and efficient approach that notably reduces the time content creators traditionally require for music selection by automatically generating music tailored to their videos. Furthermore, this thesis delves into the development of an advanced music generation model aimed at addressing significant challenges noted in existing systems. These challenges include their inability to achieve professional-grade music quality and a noticeable drop in the coherence of tracks extending beyond 30 seconds. Additionally, the legal complexities arising from the use of copyrighted music in training these systems compound these issues, making widespread use problematic. In the domain of music generation, three primary approaches are employed: first, training a generative model, such as a transformer, on a substantial dataset of raw audio music, enabling the model to generate new music. Second, developing a generative model based on MIDI data, which subsequently produces new MIDI sequences that can be synthesized into music. Third, crafting full songs by layering pre-recorded music loops both vertically and horizontally. The advantage of the third method is the use of professionally recorded loops, ensuring the high quality of the resulting music. Building on this premise, this thesis pioneers an approach focused on utilizing pre-recorded music samples. Specifically, it addresses the challenge of vertical compatibility—selecting loops that harmonize well when combined—achieving a 15.7% improvement over the current best methods. Essentially, this involves enhancing loop compatibility, which is the process of identifying samples that produce a cohesive sound when played together. The present study introduces an advanced pipeline for video background music generation, achieving high levels of relevance and quality through the innovative integration and adaptation of existing models, thereby eliminating the necessity to train new models from scratch. Furthermore, this effort marks an initial step towards the creation of a high-quality music generation system, representing a notable progression in applying AI to multimedia content creation. Samples are available at audiomatic-research.github.io/herrmann-1/."


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Machine Learning

Advisors: Gus Xia,

with 5 years embargo period

This document is currently not available here.