Computer Vision Faculty Publications

Class-Agnostic Object Detection with Multi-modal Transformer

Muhammad Maaz, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Hanoona Rasheed, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Salman Khan, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Rao Anwer, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Ming Hsuan Yang, UC Merced

Document Type

Conference Proceeding

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract

What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability. Code: https://git.io/J1HPY.

First Page

512

Last Page

531

DOI

10.1007/978-3-031-20080-9_30

Publication Date

11-3-2022

Keywords

Class-agnostic, Object detection, Vision transformers

Comments

IR conditions: non-described

Recommended Citation

M. Maaz, H. Rasheed, S. Khan, F.S. Khan, R.M. Anwer and M.H. Yang. Class-Agnostic Object Detection with Multi-modal Transformer, in Computer Vision (ECCV 2022), , Lecture Notes in Computer Science, Nov 2022, vol 13670, pp. 512-531, doi: 10.1007/978-3-031-20080-9_30

Link to Full Text

COinS

Computer Vision Faculty Publications

Class-Agnostic Object Detection with Multi-modal Transformer

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Browse

Contribute

Links

Computer Vision Faculty Publications

Class-Agnostic Object Detection with Multi-modal Transformer

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Share

Browse

Contribute

Links