Computer Vision Faculty Publications

C4AV: learning cross-modal representations from transformers

Shujie Luo, College of Information Science and Electronic Engineering, Zhejiang University
Hang Dai, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Ling Shao, Mohamed Bin Zayed University of Artificial Intelligence & Inception Institute of Artificial IntelligenceFollow
Yong Ding, College of Information Science and Electronic Engineering, Zhejiang University

Document Type

Conference Proceeding

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract

In this paper, we focus on the object referral problem in the autonomous driving setting. We propose a novel framework to learn cross-modal representations from transformers. In order to extract the linguistic feature, we feed the input command to the transformer encoder. Meanwhile, we use a resnet as the backbone for the image feature learning. The image features are flattened and used as the query inputs to the transformer decoder. The image feature and the linguistic feature are aggregated in the transformer decoder. A region-of-interest (RoI) alignment is applied to the feature map output from the transformer decoder to crop the RoI features for region proposals. Finally, a multi-layer classifier is used for object referral from the features of proposal regions.

First Page

Last Page

DOI

10.1007/978-3-030-66096-3_3

Publication Date

1-3-2021

Keywords

Cross-modal representations, Object referral

Comments

IR Deposit conditions:

OA version (pathway a)
Accepted version 12 month embargo
Must link to published article
Set statement to accompany deposit

Recommended Citation

S. Luo, H. Dai, L. Shao and Y. Ding, "C4AV: learning cross-modal representations from transformers", in Computer Vision – ECCV 2020 Workshops, ECCV 2020, (Lecture Notes in Computer Science, v. 12536), pp. 33-38, 2020. Available: 10.1007/978-3-030-66096-3_3

Link to Full Text

COinS

Computer Vision Faculty Publications

C4AV: learning cross-modal representations from transformers

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Browse

Contribute

Links

Computer Vision Faculty Publications

C4AV: learning cross-modal representations from transformers

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Share

Browse

Contribute

Links