Computer Vision Faculty Publications

Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer

Omkar Thawakar, Mohamed bin Zayed University of Artificial Intelligence
Sanath Narayan, Inception Institute of Artificial Intelligence
Jiale Cao, Tianjin University, ChinaFollow
Hisham Cholakkal, Mohamed bin Zayed University of Artificial IntelligenceFollow
Rao Anwer, Mohamed bin Zayed University of Artificial IntelligenceFollow
Muhammad Haris Khan, Mohamed bin Zayed University of Artificial IntelligenceFollow
Salman Khan, Mohamed bin Zayed University of Artificial IntelligenceFollow
Michael Felsberg, Linköping University, Sweden
Fahad Shahbaz Khan, Mohamed bin Zayed University of Artificial Intelligence & Linköping University, SwedenFollow

Document Type

Conference Proceeding

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract

State-of-the-art transformer-based video instance segmentation (VIS) approaches typically utilize either single-scale spatio-temporal features or per-frame multi-scale features during the attention computations. We argue that such an attention computation ignores the multi-scale spatio-temporal feature relationships that are crucial to tackle target appearance deformations in videos. To address this issue, we propose a transformer-based VIS framework, named MS-STS VIS, that comprises a novel multi-scale spatio-temporal split (MS-STS) attention module in the encoder. The proposed MS-STS module effectively captures spatio-temporal feature relationships at multiple scales across frames in a video. We further introduce an attention block in the decoder to enhance the temporal consistency of the detected instances in different frames of a video. Moreover, an auxiliary discriminator is introduced during training to ensure better foreground-background separability within the multi-scale spatio-temporal feature space. We conduct extensive experiments on two benchmarks: Youtube-VIS (2019 and 2021). Our MS-STS VIS achieves state-of-the-art performance on both benchmarks. When using the ResNet50 backbone, our MS-STS achieves a mask AP of 50.1%, outperforming the best reported results in literature by 2.7% and by 4.8% at higher overlap threshold of AP 75, while being comparable in model size and speed on Youtube-VIS 2019 val. set. When using the Swin Transformer backbone, MS-STS VIS achieves mask AP of 61.0% on Youtube-VIS 2019 val. set. Source code is available at https://github.com/OmkarThawakar/MSSTS-VIS. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

First Page

666

Last Page

681

DOI

10.1007/978-3-031-19818-2_38

Publication Date

10-22-2022

Keywords

Attention computation, Multi-scale features, Multi-scales, Multiple scale, Spatio-temporal, Spatiotemporal feature, Split attentions, State of the art, Temporal consistency, YouTube

Comments

IR conditions: non-described

Recommended Citation

O. Thawakar et al, "Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer", in Computer Vision (ECCV 2022), Lecture Notes in Computer Science, vol 13689, pp. 666-681, Oct. 2022, doi:10.1007/978-3-031-19818-2_38

Additional Links

Preprint available on arXiv: https://arxiv.org/abs/2203.13253

Link to Full Text

COinS

Computer Vision Faculty Publications

Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Browse

Contribute

Links

Computer Vision Faculty Publications

Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Share

Browse

Contribute

Links