Improved visual-semantic alignment for zero-shot object detection

Shafin Rahman, The Australian National University
Salman Khan, The Australian National University
Nick Barnes, The Australian National University


Zero-shot object detection is an emerging research topic that aims to recognize and localize previously 'unseen' objects. This setting gives rise to several unique challenges, e.g., highly imbalanced positive vs. negative instance ratio, proper alignment between visual and semantic concepts and the ambiguity between background and unseen classes. Here, we propose an end-to-end deep learning framework underpinned by a novel loss function that handles class-imbalance and seeks to properly align the visual and semantic cues for improved zero-shot learning. We call our objective the 'Polarity loss' because it explicitly maximizes the gap between positive and negative predictions. Such a margin maximizing formulation is not only important for visual-semantic alignment but it also resolves the ambiguity between background and unseen objects. Further, the semantic representations of objects are noisy, thus complicating the alignment between visual and semantic domains. To this end, we perform metric learning using a 'Semantic vocabulary' of related concepts that refines the noisy semantic embeddings and establishes a better synergy between visual and semantic domains. Our approach is inspired by the embodiment theories in cognitive science, that claim human semantic understanding to be grounded in past experiences (seen objects), related linguistic concepts (word vocabulary) and the visual perception (seen/unseen object images). Our extensive results on MS-COCO and Pascal VOC datasets show significant improvements over state of the art.