Exploiting Differential Adversarial Sensitivity for Trojaned Model Detection
Given the paucity of data and the complexities of training a machine learning (ML) model from scratch, it is a common practice among ML practitioners to download a pre-trained ML model and finetune it for the task at hand. However, the use of pre-trained models can put the safety of systems deploying these models at risk due to Trojan (poisoning or backdoor) attacks. Trojan (a.k.a. poisoning or backdoor) attacks enable an adversary to train and distribute a corrupted model, which typically behaves well and achieves good accuracy on clean input samples but behaves maliciously on poisoned samples containing specific trigger patterns. Using such Trojaned models as the foundation to build ML models for real-world applications can compromise the safety of systems deploying these models. Hence, there is a strong need for algorithms that detect whether a given target model has been Trojaned. This thesis presents a novel method to detect Trojaned models by analyzing the contrasting behavior of Benign and Trojaned models when subjected to adversarial attacks. The proposed method exploits the fact that Trojaned models are more sensitive to adversarial attacks when compared to Benign models. Hence, a new metric called adversarial sensitivity index (ASI) has been proposed to quantify the sensitivity of a ML model to adversarial attacks. Furthermore, a practical algorithm to estimate the sensitivity bound of Benign models has also been proposed and a target model is categorized as Trojaned if its ASI value exceeds this sensitivity bound. The proposed algorithm has been evaluated on four standard image datasets and its effectiveness under various types of Trojan attacks has been demonstrated.
S. Fares, "Exploiting Differential Adversarial Sensitivity for Trojaned Model Detection", M.S. Thesis, Machine Learning, MBZUAI, Abu Dhabi, UAE, 2023.