A Unified Model for Face Matching and Presentation Attack Detection using an Ensemble of Vision Transformer Features

Document Type

Conference Proceeding

Publication Title

Proceedings - 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, WACVW 2023


A typical automated face recognition system is composed of three main component tasks: face detection and alignment (FDA), face presentation attack detection (FPAD), and face representation and matching (FRM). These tasks are often treated as standalone problems and deep neural network (DNN)-based solutions have been proposed to address them individually. However, in resource-constrained sce-narios it would be ideal to have a unified DNN model that can perform all the three tasks together. As a first step towards realizing this goal, this work attempts to perform joint FRM and FPAD based on a single Vision Transformer (ViT) backbone. Recent work demonstrating the ability of ViT to extract a diverse set of feature representations gives rise to the tantalising possibility of building an end-to-end face recognition system using a single ViT model. The standard approach for designing multi-task DNNs is to implement different classification heads (e.g., for FRM and FPAD) based on a common stem/base and learn these heads either individually or jointly. A key contribution of this work is to demonstrate that this naive multi-head approach results in sub-optimal performance for either FRM or FPAD, because the features required by these tasks are very different. While good FPAD performance depends on accurately characterizing the micro textures, face matching demands attention towards more global characteris-tics. Hence, we propose a novel feature ensemble approach, where an ensemble of local features extracted from the in-termediate blocks of a ViT are utilized for FPAD, while face matching is performed based on the ViT class token. Exper-iments demonstrate that the proposed ViT feature ensemble approach is able to achieve good performance for both face matching and FPAD compared to the multi-head approach.

First Page


Last Page




Publication Date



Deep learning, Face recognition, Conferences, Neural networks, Feature extraction, Transformers, Multitasking


IR conditions: non-described