Robust Video Dubbing: Improved Audio Classification, Recognition, and Separation in the Multilingual Setting

Document Type



This thesis introduces automatic multilingual video dubbing framework designed to address the challenges of localizing and dubbing video content in low-resource languages. The proposed cascaded pipeline comprises model-agnostic ASR, MT, and TTS components, and integrates HCI design principles along with human-in-the-loop components to robustly correct cascaded errors. As a result, our AVD system improves upon end-to-end dubbing techniques in terms of synchronization and cascading error mitigation. In the context of multilingual dubbing, we propose three innovative components to further automate and enhance the quality of the dubbing platform. A pivotal aspect of the proposed framework is the development of a language and demographic identification module, capable of accurately identifying age, gender, and language across a diverse range of multilingual speakers. We achieve this by employing medium to late transformer layers in a self-supervised XLS-R and WavLM audio model, combined with ordinal classification and a proposed joint contrastive triplet loss and non-contrastive Barlow Twins loss function. Additionally, we contribute a new semi-balanced dataset, CVLAG, derived from the CommonVoice project. By leveraging our techniques, we surpass benchmark results from similar studies with fewer languages. Another crucial component of the pipeline is the enhancement of ASR performance for low-resource languages. We present experimental results for Kyrgyz and Bulgarian, which have not been extensively reported in the literature. We train ASR models using state-of-the-art XLSR and Whisper models, reporting Word Error Rate (WER) errors for these languages. Moreover, we propose a novel technique ”Cross-Domain Boosting” which exploits large language models (LLMs) to improve low-resource ASR. Leveraging recent prompt engineering efforts, we adapt it to this task and provide comprehensive ablation studies of various LLMs, such as GPT-3.5, GPT-4, and Davinci. We investigate the impact of prompt design, N-shot, temperature, and best- of-N sampling on LLM-based ASR correction performance. This approach demonstrates potential in enhancing ASR performance, particularly when a specific language has more resources in the text modality. Non-Speech Audio Separation is the final component we examine as a critical element of the video dubbing pipeline. We present a simple yet effective technique for transferring state-of-the-art music instrument separation models directly to the distinct task of extracting non-speech audio from mixed audio signals. Our contribution extends to a qualitative ablation study, emphasizing that this technique is an emergent capability of the most recent HT-Demucs music instrument separation model. Incorporating these techniques, this thesis introduces a resilient multilingual video dub- bing platform and insights into the obstacles and prospects of video dubbing that has the potential to connect people and transcend language boundaries.

Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Le Song, Dr. Bin Gu

with 2 year embargo period

This document is currently not available here.