Student Publications

Improving Vision Transformers for Fine-Grained Recognition

Dmitry Demidov, Mohamed bin Zayed University of Artificial IntelligenceFollow

Document Type

Dissertation

Abstract

Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative inter-class variances among visually similar sub-categories of the same meta category. Most existing classical approaches typically rely on reusing complex backbones with additional resource-consuming techniques for arbitrary feature extraction. Recently, methods with Vision Transformer (ViT) have pervaded different areas of computer vision and demonstrated noticeable achievements in FGVC, generally by employing the self-attention mechanism with additional resource-consuming techniques to implicitly distinguish potentially discriminative regions while disregarding the rest. However, not only most of the existing attention-based methods significantly increase the computation complexity and make the final architecture hard to reuse as a backbone but such approaches often likely struggle to effectively focus on truly discriminative regions in fine-grained categories. Due to only relying on the inherent self-attention mechanism, initially designed for traditional classification, the FGVC model results in the classification token likely aggregating global information from less-important background patches. Moreover, due to the lack of the datapoints, classifiers may fail to find the most helpful inter-class distinguishing features, since other unrelated but distinctive background regions may be falsely recognised as being valuable. To this end, a simple yet effective approach is introduced, named Salient Mask-Guided Vision Transformer (SM-ViT), where the discriminability of the standard ViT’s attention maps is boosted through salient masking of potentially discriminative foreground regions. More specifically, we consider establishing a beneficial conjunction between attention and salience without introducing extra trainable parameters, where we integrate saliency mask into attention maps for guiding the model to effectively distinguish the foreground patches by adjusting the attention values towards the regions containing the main object, which most likely contains the most important features for fine-grained classification. Extensive experiments and qualitative analysis demonstrate that with the standard training procedure the proposed SM-ViT achieves state-of-the-art performance on popular FGVC benchmarks among existing ViT-based approaches while requiring fewer resources and lower input image resolution.

First Page

Last Page

Publication Date

12-1-2022

Comments

Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Computer Vision

Advisors: Dr. Fahad Khan, Dr. Salman Khan

with 2 years embargo period

Recommended Citation

D. Demidov, "Improving Vision Transformers for Fine-Grained Recognition", M.S. Thesis, Computer Vision, MBZUAI, Abu Dhabi, UAE, 2022.

This document is currently not available here.

COinS

Student Publications

Improving Vision Transformers for Fine-Grained Recognition

Document Type

Abstract

First Page

Last Page

Publication Date

Comments

Recommended Citation

Browse

Contribute

Links

Student Publications

Improving Vision Transformers for Fine-Grained Recognition

Authors

Document Type

Abstract

First Page

Last Page

Publication Date

Comments

Recommended Citation

Share

Browse

Contribute

Links