Improving Vision Transformers for Remote Sensing

Document Type



Remote sensing (RS) studies for aerial image interpretation have successfully transformed by virtue of deep learning. Nonetheless, the majority of current deep models are trained using the pretrained weights of ImageNet. As natural imagery often have a wide chasm in the domain when compared to aerial images, performance finetuning on downstream tasks of aerial scenes are likely to be restricted. The challenge inspires us to perform an investigation of remote sensing pretraining (RSP) on aerial imagery. Recently, vision transformers (ViTs) have demonstrated promising performance on a variety of computer vision problems such as, image classification and object localization. In the context of remote sensing classification, few recent works have explored vision transformers for remote sensing pretraining. However, these approaches typically operate on raw RGB pixel values. Given that remote sensing images are rich in texture content, an intriguing question is whether an explicit texture representation can further improve the performance of vision transformers for remote sensing pretraining. In this thesis, we investigate this research problem and introduce a vision transformers architecture that is built on texture coded mapped images along with the standard RGB pixel values. We then evaluate the proposed vision transformers-based architecture for large-scale remote sensing pretraining on the MillionAID dataset. Our extensive quantitative and qualitative experiments demonstrate that the proposed architecture design performs favorably against its standard vision transformers counterpart.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Computer Vision

Advisors: Dr. Fahad Khan, Dr. Rao Anwer

Online access for MBZUAI patrons