Computer Vision Faculty Publications

VST++: Efficient and Stronger Visual Saliency Transformer

Nian Liu, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Ziyang Luo, Northwestern Polytechnical University
Ni Zhang, Northwestern Polytechnical University
Junwei Han, Northwestern Polytechnical University

Document Type

Article

Publication Title

IEEE Transactions on Pattern Analysis and Machine Intelligence

Abstract

While previous CNN-based models have exhibited promising results for salient object detection (SOD), their ability to explore global long-range dependencies is restricted. Our previous work, the Visual Saliency Transformer (VST), addressed this constraint from a transformer-based sequence-to-sequence perspective, to unify RGB and RGB-D SOD. In VST, we developed a multi-task transformer decoder that concurrently predicts saliency and boundary outcomes in a pure transformer architecture. Moreover, we introduced a novel token upsampling method called reverse T2T for predicting a high-resolution saliency map effortlessly within transformer-based structures. Building upon the VST model, we further propose an efficient and stronger VST version in this work, i.e. VST++. To mitigate the computational costs of the VST model, we propose a Select-Integrate Attention (SIA) module, partitioning foreground into fine-grained segments and aggregating background information into a single coarse-grained token. To incorporate 3D depth information with low cost, we design a novel depth position encoding method tailored for depth maps. Furthermore, we introduce a token-supervised prediction loss to provide straightforward guidance for the task-related tokens. We evaluate our VST++ model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets. Experimental results show that our model outperforms existing methods while achieving a 25% reduction in computational costs without significant performance compromise. The demonstrated strong ability for generalization, enhanced performance, and heightened efficiency of our VST++ model highlight its potential.

DOI

10.1109/TPAMI.2024.3388153

Publication Date

1-1-2024

Keywords

Computational modeling, Computer architecture, Decoding, Feature extraction, Multi-task learning, Multitasking, RGB-D saliency detection, RGB-T saliency detection, saliency detection, Task analysis, transformer, Transformers

Recommended Citation

N. Liu et al., "VST++: Efficient and Stronger Visual Saliency Transformer," IEEE Transactions on Pattern Analysis and Machine Intelligence, Jan 2024.

The definitive version is available at https://doi.org/10.1109/TPAMI.2024.3388153

This document is currently not available here.

COinS

Computer Vision Faculty Publications

VST++: Efficient and Stronger Visual Saliency Transformer

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Recommended Citation

Browse

Contribute

Links

Computer Vision Faculty Publications

VST++: Efficient and Stronger Visual Saliency Transformer

Authors

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Recommended Citation

Share

Browse

Contribute

Links