A Unified Model for Face Matching and Presentation Attack Detection using an Ensemble of Vision Transformer Features

Document Type



Real-world computer vision systems typically involve a complex assembly of multiple modules, which require the extraction of different kinds of feature representations. For example, an automated face recognition system is composed of components such as face detection and alignment, face presentation attack detection (FPAD), and face matching, which are often considered as standalone problems. Various feature-based and deep neural network-based solutions have been proposed over the years to address each component of the face recognition system individually. Recently, vision transformers (ViT) have shown the ability to extract a diverse set of feature representations and an ensemble of these features could potentially handle multiple related tasks. This gives rise to the tantalising possibility of building an end-to-end face recognition system using a single ViT model. As a first step towards realizing this goal, this work attempts to perform joint face matching and FPAD based on a single ViT backbone. A naive way to solve this multi-task problem is to implement different classification heads (for face matching and FPAD) based on the ViT class token and learn these heads either individually or jointly. However, we show that this approach results in sub-optimal performance for one of the tasks because the features required by these tasks are very different. While good FPAD performance depends on accurately characterizing the micro textures, face matching demands attention towards more global characteristics. Hence, we propose a feature ensemble approach, where an ensemble of local features extracted from the intermediate blocks of a ViT model is utilized for the FPAD task, while the face matching task is performed based on the ViT class token. Experiments were performed for the proposed feature ensemble utilizing ViT model, and the performance was evaluated and reported using known datasets for FPAD and face matching such as LFW, SiW and HQWMCA. Finally, the results have demonstrated that the proposed ViT feature ensemble approach is able to achieve good performance for both face matching and FPAD compared to the naive multi-head approach.

First Page


Last Page


Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Karthik Nandakumar, Dr. Salman Khan

Online access provided for MBZUAI patrons