Computer Vision Faculty Publications

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation

Wenhao Cheng, Beijing Institute of Technology
Xingping Dong, Inception Institute of Artificial Intelligence
Salman Khan, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Jianbing Shen, University of Macau

Document Type

Conference Proceeding

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract

Vision-and-Language Navigation (VLN) requires an agent to follow complex natural language instructions and perceive the visual environment for real-world navigation. Intuitively, we find that instruction disentanglement for each viewpoint along the agent’s path is critical for accurate navigation. However, most methods only utilize the whole complex instruction or inaccurate sub-instructions due to the lack of accurate disentanglement as an intermediate supervision stage. To address this problem, we propose a new Disentanglement framework with Decoupled Labels (DDL) for VLN. Firstly, we manually extend the benchmark dataset Room-to-Room with landmark- and action-aware labels in order to provide fine-grained information for each viewpoint. Furthermore, to enhance the generalization ability, we propose a Decoupled Label Speaker module to generate pseudo-labels for augmented data and reinforcement training. To fully use the proposed fine-grained labels, we design a Disentangled Decoding Module to guide discriminative feature extraction and help alignment of multi-modalities. To reveal the generality of our proposed method, we apply it on a LSTM-based model and two recent Transformer-based models. Extensive experiments on two VLN benchmarks (i.e., R2R and R4R) demonstrate the effectiveness of our approach, achieving better performance than previous state-of-the-art methods.

First Page

309

Last Page

329

DOI

10.1007/978-3-031-20059-5_18

Publication Date

10-29-2022

Keywords

Disentanglement, Imitation/Reinforcement learning, LSTM and Transformer, Modular network, Vision-and-Language Navigation

Comments

IR conditions: non-described

Recommended Citation

W. Cheng, X. Dong, S. Khan, and J. Shen. Learning Disentanglement with Decoupled Labels for Vision-Language Navigation. in Computer Vision – ECCV 2022, Lecture Notes in Computer Science, Oct 2022, vol 13696, pp. 309-329, doi:10.1007/978-3-031-20059-5_18

Link to Full Text

COinS

Computer Vision Faculty Publications

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Browse

Contribute

Links

Computer Vision Faculty Publications

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Share

Browse

Contribute

Links