Exploiting Differential Adversarial Sensitivity for Trojaned Model Detection

Document Type



Given the paucity of data and the complexities of training a machine learning (ML) model from scratch, it is a common practice among ML practitioners to download a pre-trained ML model and finetune it for the task at hand. However, the use of pre-trained models can put the safety of systems deploying these models at risk due to Trojan (poisoning or backdoor) attacks. Trojan (a.k.a. poisoning or backdoor) attacks enable an adversary to train and distribute a corrupted model, which typically behaves well and achieves good accuracy on clean input samples but behaves maliciously on poisoned samples containing specific trigger patterns. Using such Trojaned models as the foundation to build ML models for real-world applications can compromise the safety of systems deploying these models. Hence, there is a strong need for algorithms that detect whether a given target model has been Trojaned. This thesis presents a novel method to detect Trojaned models by analyzing the contrasting behavior of Benign and Trojaned models when subjected to adversarial attacks. The proposed method exploits the fact that Trojaned models are more sensitive to adversarial attacks when compared to Benign models. Hence, a new metric called adversarial sensitivity index (ASI) has been proposed to quantify the sensitivity of a ML model to adversarial attacks. Furthermore, a practical algorithm to estimate the sensitivity bound of Benign models has also been proposed and a target model is categorized as Trojaned if its ASI value exceeds this sensitivity bound. The proposed algorithm has been evaluated on four standard image datasets and its effectiveness under various types of Trojan attacks has been demonstrated.

Publication Date



Thesis submitted to the Deanship of Graduate and Postdoctoral Studies

In partial fulfillment of the requirements for the M.Sc degree in Machine Learning

Advisors: Dr. Huan Xiong, Dr. Karthik Nandakumar

with 2 year embargo period

This document is currently not available here.