Computer Vision Faculty Publications

Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers

Omkar Thawakar, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Alexandre Rivkind, Weizmann Institute of Science Israel
Ehud Ahissar, Weizmann Institute of Science Israel
Fahad Shahbaz Khan, Mohamed Bin Zayed University of Artificial IntelligenceFollow

Document Type

Conference Proceeding

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract

State-of-the-art transformers-based video instance segmentation (VIS) frameworks typically utilize attention-based encoders to compute multi-scale spatio-temporal features to capture target appearance deformations. However, such an attention computation is computationally expensive, thereby hampering the inference speed. In this work, we introduce a VIS framework that utilizes a light-weight recurrent-CNN encoder to learn multi-scale spatio-temporal features from the standard attention encoders through knowledge distillation. The light-weight recurrent encoder effectively learns multi-scale spatio-temporal features and achieves improved VIS performance by reducing the over-fitting as well as increasing the inference speed. Our extensive experiments on the popular Youtube-VIS 2019 benchmark reveal the merits of the proposed framework over the baseline. Compared to the recent SeqFormer, our proposed Recurrent SeqFormer improves the inference speed by two-fold while also improving the VIS performance from 45.1% to 45.8% in terms of overall average precision. Our code and models are available at https://github.com/OmkarThawakar/Recurrent-Seqformer

First Page

262

Last Page

272

DOI

10.1007/978-3-031-44237-7_25

Publication Date

9-20-2023

Keywords

detection, recurrent neural networks, segmentation, video instance segmentation

Comments

IR conditions: non-described

Recommended Citation

O. Thawakar et al., "Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14184 LNCS, pp. 262 - 272, Sep 2023.

The definitive version is available at https://doi.org/10.1007/978-3-031-44237-7_25

Additional Links

https://doi.org/10.1007/978-3-031-44237-7_25

Link to Full Text

COinS

Computer Vision Faculty Publications

Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Browse

Contribute

Links

Computer Vision Faculty Publications

Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers

Authors

Document Type

Publication Title

Abstract

First Page

Last Page

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Share

Browse

Contribute

Links