Improving Vision Transformers for Fine-Grained Recognition

Document Type



Fine-grained visual classification (FGVC) is a challenging computer vision problem, where the task is to automatically recognise objects from subordinate categories. One of its main difficulties is capturing the most discriminative inter-class variances among visually similar sub-categories of the same meta category. Most existing classical approaches typically rely on reusing complex backbones with additional resource-consuming techniques for arbitrary feature extraction. Recently, methods with Vision Transformer (ViT) have pervaded different areas of computer vision and demonstrated noticeable achievements in FGVC, generally by employing the self-attention mechanism with additional resource-consuming techniques to implicitly distinguish potentially discriminative regions while disregarding the rest. However, not only most of the existing attention-based methods significantly increase the computation complexity and make the final architecture hard to reuse as a backbone but such approaches often likely struggle to effectively focus on truly discriminative regions in fine-grained categories. Due to only relying on the inherent self-attention mechanism, initially designed for traditional classification, the FGVC model results in the classification token likely aggregating global information from less-important background patches. Moreover, due to the lack of the datapoints, classifiers may fail to find the most helpful inter-class distinguishing features, since other unrelated but distinctive background regions may be falsely recognised as being valuable. To this end, a simple yet effective approach is introduced, named Salient Mask-Guided Vision Transformer (SM-ViT), where the discriminability of the standard ViT’s attention maps is boosted through salient masking of potentially discriminative foreground regions. More specifically, we consider establishing a beneficial conjunction between attention and salience without introducing extra trainable parameters, where we integrate saliency mask into attention maps for guiding the model to effectively distinguish the foreground patches by adjusting the attention values towards the regions containing the main object, which most likely contains the most important features for fine-grained classification. Extensive experiments and qualitative analysis demonstrate that with the standard training procedure the proposed SM-ViT achieves state-of-the-art performance on popular FGVC benchmarks among existing ViT-based approaches while requiring fewer resources and lower input image resolution.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Computer Vision

Advisors: Dr. Fahad Khan, Dr. Salman Khan

with 2 years embargo period

This document is currently not available here.