Face Tracking Using Diffusion Model Generated Data

Date of Award


Document Type


Degree Name

Master of Science in Computer Vision


Computer Vision

First Advisor

Dr. Hao Li

Second Advisor

Dr. Martin Takac


Face tracking is crucial for creating photo-realistic, digital human in high-end productions and virtual reality applications. Generally, face tracking can be divided into two categories based on different setups: Multi-view setups and Monocular setups. This thesis focuses on monocular setups, utilizing only RGB images as input, within the realm of face tracking. Our approach tackles the inherent limitations of current face tracking technologies, primarily addressing the labor-intensive process of constructing training datasets and the inaccuracies associated with manual data labeling. By fine-tuning ControlNet on pretrained stable diffusion, we propose a method to generate synthetic face datasets, significantly boosting the efficiency and accuracy of face tracking systems. Subsequently, we leverage ResNet50 to predict probabilistic facial landmarks, integrating Gaussian Negative Log Likelihood Loss to account for uncertainty. These predicted landmarks, along with their associated uncertainties, serve as inputs to optimize the 3D geometry of the head through FLAME model fitting process. Our generated dataset based on diffusion model reduces the reliance on manual data collection and potentially decreases biases associated with traditional methods. The landmarks prediction and 3D fitting results demonstrate that diffusion model-based datasets can achieve state-of-the-art performance in 3D face tracking by closely mimicking real-world conditions. What's more, we prove that proper data augmentation can also improve the accuracy of 3D geometry and handle well with extreme pose. And including uncertainty in landmark predictions also results in improved accuracy and better representation of diverse expressions, regardless of the dataset used. Our method not only enhances the realism and accuracy of 3D face reconstruction but also proposes a scalable solution to the data constraints during training for landmark prediction, making face tracking more accessible and cost-effective.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Computer Vision

Advisors: Hao Li, Dr. Martin Takac

Online access available for MBZUAI patrons