Temporality-guided Masked Image Consistency for Domain Adaptive Video Segmentation

Document Type



While supervised learning with annotated data has significantly improved video semantic segmentation, the potential of domain adaptive video segmentation is still underexplored. This technique can help overcome the limitations of data labeling by adapting from a labeled source domain to an unlabeled target domain. In this context, we propose Temporality-guided Masked Image Consistency (TgMIC), a simple and effective method that leverages the concept of Masked Image Modeling (MIM) to learn semantic features in the target domain. Unlike random masking strategy applied in traditional MIM, TgMIC introduces a novel temporality-guided masking strategy that samples the mask according to the distribution of optical flow, which facilitate the learning of spatial context relations in video sequence. Specifically, TgMIC masks the patches in vision transformers where the variance of optical flow is large, as these patches are known to contain noisy estimates of optical flow. In order to learn semantic information for video segmentation, TgMIC reconstructs the predictions of original frames from the predictions of masked frames. Our method has been thoroughly evaluated through extensive experiments and ablation studies on multiple public datasets. The results demonstrate that our approach achieves superior performance compared to existing methods, without incurring any additional time expenses.

Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Abdulmotaleb Elsaddik, Dr. Huan Xiong

Online access available for MBZUAI patrons