Iterative Video Segmentation Framework and Benchmark Using Minimal User Annotations

Document Type



In this study, the feasibility of enhancing user interaction with video object segmentation is explored, aiming to create an industry-driven environment. The primary objective is to examine the baseline performance on common industry edge cases while developing a robust solution capable of saving time for individuals labeling frames from various types of videos. Applications span multiple industries, including special effects with compositing and rotoscoping, realistic augmented reality, sports performance monitoring, and medical analysis for more accurate treatment. The need for a time-saving frame selector is emphasized, and modifications to the baseline XMem \cite{xmem} backbone are investigated. An attention mechanism-based frame selector is proposed to recommend the most suitable images for labeling, achieving a performance 27 times faster than experts and 32 times faster than non-experts. In light of the backbone modification results, potential future directions are suggested. Additionally, a benchmark featuring extreme object segmentation cases that reflect industry challenges is introduced, underscoring the need for model improvements to achieve robust VOS capabilities.

Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Hao Li, Dr. Bin Gu

Online access available for MBZUAI patrons