Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers
Document Type
Conference Proceeding
Publication Title
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Abstract
State-of-the-art transformers-based video instance segmentation (VIS) frameworks typically utilize attention-based encoders to compute multi-scale spatio-temporal features to capture target appearance deformations. However, such an attention computation is computationally expensive, thereby hampering the inference speed. In this work, we introduce a VIS framework that utilizes a light-weight recurrent-CNN encoder to learn multi-scale spatio-temporal features from the standard attention encoders through knowledge distillation. The light-weight recurrent encoder effectively learns multi-scale spatio-temporal features and achieves improved VIS performance by reducing the over-fitting as well as increasing the inference speed. Our extensive experiments on the popular Youtube-VIS 2019 benchmark reveal the merits of the proposed framework over the baseline. Compared to the recent SeqFormer, our proposed Recurrent SeqFormer improves the inference speed by two-fold while also improving the VIS performance from 45.1% to 45.8% in terms of overall average precision. Our code and models are available at https://github.com/OmkarThawakar/Recurrent-Seqformer
First Page
262
Last Page
272
DOI
10.1007/978-3-031-44237-7_25
Publication Date
9-20-2023
Keywords
detection, recurrent neural networks, segmentation, video instance segmentation
Recommended Citation
O. Thawakar et al., "Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14184 LNCS, pp. 262 - 272, Sep 2023.
The definitive version is available at https://doi.org/10.1007/978-3-031-44237-7_25
Comments
IR conditions: non-described