Adaptive Trojan Attacks on Deep Neural Networks
New strategies allow Trojan attacks to bypass detection methods effectively.
― 6 min read
Table of Contents
Deep Neural Networks (DNNs) have become widely used in many fields, including vision, health, games, and self-driving cars. They perform very well but also have certain weaknesses. One such weakness is their vulnerability to Trojan attacks. In these attacks, a trigger is secretly added to some inputs, allowing the attacker to manipulate the DNN's predictions when those specific inputs are used. The challenge is to detect these Trojaned models effectively.
Output-based detector models can identify Trojaned DNNs by looking at their outputs when given altered inputs. These detectors have improved over time but often operate under the assumption that the attacker is predictable and lacks knowledge of the Detection Methods. However, attackers can also adapt their methods to avoid being caught.
In this work, we present a new way for attackers to retrain their Trojaned models while being aware of the detectors. By doing this, they can maintain high accuracy on both the trigger-embedded inputs and regular clean inputs, while also avoiding detection.
Background on DNNs and Trojan Attacks
DNNs are trained to classify data samples into different classes. For instance, in an image classification task, the model will predict which category an image belongs to (like a cat or a dog). However, attackers can exploit these models. They can embed a trigger pattern into certain inputs. When the model encounters these inputs with the trigger, it is tricked into producing a specific output that benefits the attacker, while still behaving normally on regular inputs.
This manipulation can have serious consequences, especially in critical applications like autonomous driving. As a result, both attackers and defenders continuously develop new methods to outsmart each other.
Detection Methods
As DNNs are targeted by Trojan attacks, researchers have created methods for detection. Broadly, these methods fall into two categories: input-based filtering and output-based detectors.
Input-based filtering involves removing suspicious samples from the training data before they reach the DNN. Output-based detectors, on the other hand, focus on examining the outputs of the DNN when it is given various inputs. These detectors can operate without needing to see the training data of the DNN, making them more practical in real-world scenarios.
Output-based Detectors
Output-based detectors are favored because they require only black-box access to the models. They analyze the outputs the model generates in response to different inputs. There are two main types of these detectors:
- Supervised Detectors: These use labeled data to train a binary classifier that can differentiate between outputs from clean and Trojaned models.
- Unsupervised Detectors: These methods utilize outlier detection techniques to determine if a model's outputs look odd or suspicious.
Both types aim to determine whether a model is Trojaned by analyzing how its outputs change with different inputs. Many of these detectors have shown success but often assume that attackers are static and do not adjust.
The Challenge with Adaptive Attackers
In reality, attackers are not always static. They can learn about the output-based detection methods and adapt their strategies accordingly. This dynamic creates a back-and-forth scenario where both sides evolve to outsmart each other.
Previous methods did not account for adaptive attackers, thus leaving a gap in our approaches to Trojan detection. If the attacker knows the detection method, they can retrain their Trojaned models in a way that maintains accuracy and defeats detection.
The Proposed Adaptive Adversary
We propose a strategy that allows attackers to alter their Trojaned DNNs while being aware of the output-based detectors. The new approach consists of two main steps:
- The attacker embeds a trigger into selected clean samples and trains the DNN to ensure the Trojaned model behaves well on both clean and trigger-embedded samples.
- The attacker then uses the updated Trojaned model to adjust the parameters of the detector to maximize its performance.
This iterative process continues until no further improvements can be made in either the Trojaned DNN's performance or the detectability of the model.
Key Insights
The high number of parameters in DNNs gives them a lot of room to adjust for different inputs. This flexibility allows attackers to create Trojaned models that can still perform well while remaining undetected. The interaction between the attacker and the detection mechanism can be viewed as a game where both sides are trying to outmaneuver the other.
Experiments and Results
To validate our new approach, we conducted a series of experiments using multiple datasets, including images and audio commands. The goal was to see how well our adaptive attacker could bypass state-of-the-art output-based Trojan detection methods.
Methodology
We used several well-known datasets, which contain various examples to train and test our methods. The datasets included:
- MNIST: A dataset of handwritten digits.
- CIFAR-10 and CIFAR-100: Datasets containing images of common objects.
- SpeechCommand: A collection of audio files for spoken commands.
The experiments aimed to measure:
- The accuracy of clean samples classified by the model.
- The success rate of classifying trigger-embedded samples.
- The detection rates of the SOTA Trojan detectors.
Results
Our findings revealed that the adaptive adversary was effective in bypassing the detection methods. The attack maintained a high rate of success when manipulating the models, achieving significant results across all datasets.
In particular, the results clearly indicated that the proposed strategy allowed Trojaned models to escape detection. Even when the detection mechanism was adjusted, our approach still yielded better overall performance for the attacker.
Greedy Algorithm for Trigger Embedding
We also introduced a greedy algorithm to aid the attacker in selecting which input samples to embed the Trojan triggers into. The goal was to minimize operational costs while ensuring high effectiveness.
Key Considerations
When selecting input samples for trigger embedding, the attacker had to keep three main factors in mind:
- Cost of Attack: A larger number of trigger-embedded samples increases the operational costs for the attacker.
- Integrity of the Model: Too many trigger-embedded samples can degrade the accuracy of the Trojaned model on clean inputs, increasing the chance of detection.
- Stealth: An excessive number of trigger samples can lead to quick detection by advanced methods.
The greedy algorithm ensures that the attacker uses the minimal number of samples needed to achieve the desired effects without attracting attention.
Conclusion
Our work highlights the evolving battle between DNNs and Trojan attacks. As detection methods improve, so too do the strategies of attackers. By creating an adaptive adversary model, we demonstrated that it is possible for attackers to achieve high accuracy on both clean and trigger-embedded inputs while effectively evading detection.
Furthermore, the integration of a greedy algorithm into the process allows attackers to be more efficient in their approaches to embedding triggers. With extensive experiments demonstrating the effectiveness of this new approach across various datasets and detection methods, it is clear that the landscape of Trojan detection must continue to adapt.
As defenders develop more advanced techniques, researchers must develop innovative approaches to keep pace with evolving threats. This ongoing tug-of-war underscores the importance of remaining vigilant and proactive in safeguarding the integrity of machine learning models and the data they process.
Title: Game of Trojans: Adaptive Adversaries Against Output-based Trojaned-Model Detectors
Abstract: We propose and analyze an adaptive adversary that can retrain a Trojaned DNN and is also aware of SOTA output-based Trojaned model detectors. We show that such an adversary can ensure (1) high accuracy on both trigger-embedded and clean samples and (2) bypass detection. Our approach is based on an observation that the high dimensionality of the DNN parameters provides sufficient degrees of freedom to simultaneously achieve these objectives. We also enable SOTA detectors to be adaptive by allowing retraining to recalibrate their parameters, thus modeling a co-evolution of parameters of a Trojaned model and detectors. We then show that this co-evolution can be modeled as an iterative game, and prove that the resulting (optimal) solution of this interactive game leads to the adversary successfully achieving the above objectives. In addition, we provide a greedy algorithm for the adversary to select a minimum number of input samples for embedding triggers. We show that for cross-entropy or log-likelihood loss functions used by the DNNs, the greedy algorithm provides provable guarantees on the needed number of trigger-embedded input samples. Extensive experiments on four diverse datasets -- MNIST, CIFAR-10, CIFAR-100, and SpeechCommand -- reveal that the adversary effectively evades four SOTA output-based Trojaned model detectors: MNTD, NeuralCleanse, STRIP, and TABOR.
Authors: Dinuka Sahabandu, Xiaojun Xu, Arezoo Rajabi, Luyao Niu, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran
Last Update: 2024-02-12 00:00:00
Language: English
Source URL: https://arxiv.org/abs/2402.08695
Source PDF: https://arxiv.org/pdf/2402.08695
Licence: https://creativecommons.org/licenses/by/4.0/
Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.
Thank you to arxiv for use of its open access interoperability.
Reference Links
- https://www.michaelshell.org/
- https://www.michaelshell.org/tex/ieeetran/
- https://www.ctan.org/pkg/ieeetran
- https://www.ieee.org/
- https://www.latex-project.org/
- https://www.michaelshell.org/tex/testflow/
- https://www.ctan.org/pkg/ifpdf
- https://www.ctan.org/pkg/cite
- https://www.ctan.org/pkg/graphicx
- https://www.ctan.org/pkg/epslatex
- https://www.tug.org/applications/pdftex
- https://www.ctan.org/pkg/amsmath
- https://www.ctan.org/pkg/algorithms
- https://www.ctan.org/pkg/algorithmicx
- https://www.ctan.org/pkg/array
- https://www.ctan.org/pkg/subfig
- https://www.ctan.org/pkg/fixltx2e
- https://www.ctan.org/pkg/stfloats
- https://www.ctan.org/pkg/dblfloatfix
- https://www.ctan.org/pkg/url
- https://www.michaelshell.org/contact.html
- https://mirror.ctan.org/biblio/bibtex/contrib/doc/
- https://www.michaelshell.org/tex/ieeetran/bibtex/