Open-Vocabulary Object Detection and its Application in Robotic Navigation

Date of Award


Document Type


Degree Name

Master of Science in Computer Vision


Computer Vision

First Advisor

Dr. Ian Reid

Second Advisor

Dr. Shijian Lu


This thesis explores the role of open-vocabulary object detection in enhancing robotic navigation. The initial phase of our research centers on addressing a fundamental challenge in open-vocabulary object detection: training detectors capable of recognizing a wide array of novel classes without direct supervision. Traditional self-training approaches often rely on image-level weak supervision to generate pseudo object boxes for training, which unfortunately results in noisy and base-class-biased pseudo boxes, diminishing the detectors' effectiveness. To counter this, we introduce a novel technique named Debiased Curriculum Self-Training (DCS), designed to refine the generation of pseudo object boxes through progressive pseudo-label filtering (PPF) and adaptive pseudo-label selection (APS). PPF systematically eliminates mismatched detections early in training—when the detector's bias toward base classes is most pronounced—while APS merges class-aware and class-agnostic pseudo-labeling methods, giving precedence to class-aware labeling as the detector's capability to detect novel classes matures. Without resorting to complex mechanisms, DCS markedly enhances detection performance across multiple open-vocabulary benchmarks. In the second phase of our research, we focus on developing a mapping method for robotic visual navigation—a fundamental step enabling an agent to comprehend its environment. Preferring the less resource-intensive topological mapping over metric mapping, we innovate beyond the conventional image-as-node approach by constructing an object-level map using image segments. This technique refines the map's granularity and enhances its interconnectivity. To improve the map's semantic clarity and enable the agent to navigate using more reliable landmarks, we incorporate the DCS model to supplement the semantic information of the map and design a novel planning strategy that considers the semantic difference between nodes. Our experimental results in a simulator indicate superior navigation outcomes with this mapping method. This integration not only demonstrates the practical applicability of our initial research but also paves the way for my future research on robotics navigation in dynamic settings.


Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfilment of the requirements for the M.Sc degree in Computer Vision

Advisors:Ian Reid, Shijian Lu

Online access available for MBZUAI patrons