C4AV: learning cross-modal representations from transformers
Document Type
Conference Proceeding
Publication Title
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Abstract
In this paper, we focus on the object referral problem in the autonomous driving setting. We propose a novel framework to learn cross-modal representations from transformers. In order to extract the linguistic feature, we feed the input command to the transformer encoder. Meanwhile, we use a resnet as the backbone for the image feature learning. The image features are flattened and used as the query inputs to the transformer decoder. The image feature and the linguistic feature are aggregated in the transformer decoder. A region-of-interest (RoI) alignment is applied to the feature map output from the transformer decoder to crop the RoI features for region proposals. Finally, a multi-layer classifier is used for object referral from the features of proposal regions.
First Page
33
Last Page
38
DOI
10.1007/978-3-030-66096-3_3
Publication Date
1-3-2021
Keywords
Cross-modal representations, Object referral
Recommended Citation
S. Luo, H. Dai, L. Shao and Y. Ding, "C4AV: learning cross-modal representations from transformers", in Computer Vision – ECCV 2020 Workshops, ECCV 2020, (Lecture Notes in Computer Science, v. 12536), pp. 33-38, 2020. Available: 10.1007/978-3-030-66096-3_3
Comments
IR Deposit conditions: