Trojan Detection Evaluation: Finding Hidden Behavior in AI Models
Michael Paul Majurski, Derek Juba, Timothy Blattner, Peter Bajcsy, Walid Keyrouz
Neural Networks are trained on data, learn relationships in that data, and then are deployed to the world to operate on new data. For example, a traffic sign classification AI can differentiate stop signs and speed limit signs. One potential problem is that an adversary can disrupt the training pipeline to insert Trojan behaviors. For example, the AI can be given just a few additional examples of stop signs with yellow squares on them, each labeled "speed limit sign." We explore the TrojAI program (a collaboration between NIST, IAPRA, and JHU/APL) which hopes to combat such trojan attacks via 1) developing reference datasets and 2) operating a challenge where detection methods can be evaluated against sequestered data. Submissions, packaged into Singularity containers, are run against the sequestered data and results are posted to a public leaderboard. This presentation explores the dataset generation, testing infrastructure, and a baseline detection method within the TrojAI program.