Computer Vision Faculty Publications

Intriguing properties of vision transformers

Muzammal Naseer, Australian National University & Mohamed bin Zayed University of Artificial IntelligenceFollow
Kanchana Ranasinghe, Mohamed bin Zayed University of Artificial Intelligence & Stony Brook UniversityFollow
Salman Khan, Australian National University & Mohamed bin Zayed University of Artificial IntelligenceFollow
Munawar Hayat, Monash University
Fahad Shahbaz Khan, Mohamed bin Zayed University of Artificial Intelligence & Linköping UniversityFollow
Ming-Hsuan Yang, University of California & Yonsei University & Google ResearchFollow

Document Type

Article

Publication Title

arXiv

Abstract

Vision transformers (ViT) have demonstrated impressive performance across numerous machine vision tasks. These models are based on multi-head self-attention mechanisms that can flexibly attend to a sequence of image patches to encode contextual cues. An important question is how such flexibility (in attending image-wide context conditioned on a given patch) can facilitate handling nuisances in natural images e.g., severe occlusions, domain shifts, spatial permutations, adversarial and natural perturbations. We systematically study this question via an extensive set of experiments encompassing three ViT families and provide comparisons with a high-performing convolutional neural network (CNN). We show and analyze the following intriguing properties of ViT: (a) Transformers are highly robust to severe occlusions, perturbations and domain shifts, e.g., retain as high as 60% top-1 accuracy on ImageNet even after randomly occluding 80% of the image content. (b) The robustness towards occlusions is not due to texture bias, instead we show that ViTs are significantly less biased towards local textures, compared to CNNs. When properly trained to encode shape-based features, ViTs demonstrate shape recognition capability comparable to that of human visual system, previously unmatched in the literature. (c) Using ViTs to encode shape representation leads to an interesting consequence of accurate semantic segmentation without pixel-level supervision. (d) Off-the-shelf features from a single ViT model can be combined to create a feature ensemble, leading to high accuracy rates across a range of classification datasets in both traditional and few-shot learning paradigms. We show effective features of ViTs are due to flexible and dynamic receptive fields possible via self-attention mechanisms. Code: https://git.io/Js15X. Copyright © 2021, The Authors. All rights reserved.

DOI

doi.org/10.48550/arXiv.2105.10497

Publication Date

5-21-2021

Keywords

Classification (of information), Convolutional neural networks, Semantic Segmentation, Semantics, Textures, Attention mechanisms, Contextual cue, Image patches, Machine-vision, Natural images, Performance, Property, Sequence of images, Severe occlusions, Spatial permutations, Encoding (symbols), Artificial Intelligence (cs.AI), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG)

Comments

Preprint: arXiv

Recommended Citation

M. Naseer, K. Ranasinghe, S. Khan, M. Hayat, F.S. Khan, and M.-H. Yang, "Intriguing properties of vision transformers", 2021, arXiv:2105.10497

Link to Full Text

COinS

Computer Vision Faculty Publications

Intriguing properties of vision transformers

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Comments

Recommended Citation

Browse

Contribute

Links

Computer Vision Faculty Publications

Intriguing properties of vision transformers

Authors

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Comments

Recommended Citation

Share

Browse

Contribute

Links