Large AI models are increasingly used to power agentic systems, or “agents,” which can automate complex tasks on behalf of users. AI agents could have a wide range of potential benefits, such as automating scientific research or serving as personal assistants.
However, to fully realize the potential of AI agents, it is essential to identify and measure — in order to ultimately mitigate — the security risks these systems could introduce.
Currently, many AI agents are vulnerable to agent hijacking, a type of indirect prompt injection in which an attacker inserts malicious instructions into data that may be ingested by an AI agent, causing it to take unintended, harmful actions.
The U.S. AI Safety Institute (US AISI) conducted initial experiments to advance the science of evaluating agent hijacking risk and below are key insights from this work.
AI agent hijacking is the latest incarnation of an age-old computer security problem that arises when a system lacks a clear separation between trusted internal instructions and untrusted external data — and is therefore vulnerable to attacks in which hackers provide data that contains malicious instructions designed to trick the system.
The architecture of current LLM-based agents generally requires combining trusted developer instructions with other task-relevant data into a unified input. In agent hijacking, attackers exploit this lack of separation by creating a resource that looks like typical data an agent might interact with when completing a task, such as an email, a file, or a website — but that data actually contains malicious instructions intended to “hijack” the agent to complete a different and potentially harmful task.
To experiment with agent hijacking evaluations, US AISI used AgentDojo, a leading open-source framework for testing the vulnerability of AI agents developed by researchers at ETH Zurich. These tests were conducted on agents powered by Anthropic’s upgraded Claude 3.5 Sonnet (released October 2024), which AgentDojo found to be one of the top performing models in resisting agent hijacking attacks.
AgentDojo consists of a set of four environments that simulate real-world contexts in which agents might be used: Workspace, Travel, Slack, and Banking. Each of the four environments contains a set of simulated “tools” that can be used by an agent to complete tasks.
The fundamental unit of the evaluation is the hijacking scenario. In each hijacking scenario, an agent is asked to complete a legitimate user task but encounters data containing an attack that tries to direct the agent to complete a malicious injection task. If the agent ends up completing the injection task, the agent was successfully hijacked.
An example of a hijacking scenario included in the AgentDojo framework consisting of a benign user task and a malicious injection task.
US AISI leveraged AgentDojo’s default suite of hijacking scenarios and built additional, custom scenarios in-house. US AISI tested AgentDojo’s baseline attack methods as well as novel attack methods (detailed below) that were developed jointly with the UK AI Safety Institute through red teaming.
Below are several key lessons US AISI drew from the tests conducted.
Publicly available evaluation frameworks provide an important foundation for enabling security research. For these frameworks to remain effective and keep pace with rapid technological advancement, it is important that the evaluations are routinely improved and iterated upon by the scientific community.
To that end, US AISI’s technical staff devoted several days to improving and extending the AgentDojo framework. The team remediated various bugs in AgentDojo’s default hijacking scenarios and made system-level improvements, such as adding asynchronous execution support and integrating with Inspect.
US AISI also augmented AgentDojo with several new injection tasks in order to evaluate priority security risks not previously addressed in the framework — specifically: remote code execution, database exfiltration, and automated phishing.
Across all three new risk areas, US AISI was frequently able to induce the agent to follow the malicious instructions, which underlines the importance of continued iteration and expansion of the framework.
To support further research into agent hijacking and agent security more broadly, US AISI plans to make our improvements to the AgentDojo framework publicly available.
When evaluating the robustness of AI systems in adversarial contexts such as agent hijacking, it is crucial to evaluate attacks that were optimized for these systems. A new system may be robust to attacks tested in previous evaluations, but real-life attackers can probe the new system’s unique weaknesses — and the evaluation framework needs to reflect this reality.
For instance, the upgraded Claude 3.5 Sonnet is significantly more robust against previously tested hijacking attacks than the prior version of Claude 3.5 Sonnet. But, when US AISI tested the new model against novel attacks developed specifically for the model, the measured attack success rate increased dramatically.
To adapt the evaluation to the upgraded Sonnet model, the US AISI technical staff organized a red teaming exercise, which was performed in collaboration with red teamers at the UK AI Safety Institute.
The team developed attacks using a random subset of user tasks from the Workspace environment and then tested them using a held-out set of user tasks. This resulted in an increase in attack success rate from 11% for the strongest baseline attack to 81% for the strongest new attack.
Extending this further, US AISI then tested the performance of the new red team attacks in the other three AgentDojo environments to determine if they generalized well beyond the Workspace environment.
As shown in the plot below, the new attacks created for the Workspace environment were also successful when applied to tasks in the other three environments, suggesting that real-world attackers may be successful even without detailed information about the specific environment they are attacking.
So far, agent hijacking risk has been measured using the overall attack success rate, which is an aggregated measure across a collection of injection tasks.
While that is a useful metric, the analysis shows that measuring and analyzing the attack success rates of each injection task individually can also be informative, as each task poses a different level of risk.
Consider the following collection of injection tasks:
The average success rate across these five injection tasks is 57%. But, if this aggregate measure is broken down into injection task-specific results, the overall risk picture becomes more nuanced.
The task-level results reveal several details that were not clearly conveyed by the aggregate measure and could ultimately impact the assessment of the downstream risks.
First, it is evident that hijacks are far more successful for certain tasks in this collection than for others — some tasks are more frequently successful than the average (well over 57% of the time), while others are markedly less successful.
By separating these tasks out, it is also clear that that the impact of a successful attack varies widely, which should also be taken into account when using these evaluations to assess risk.
Consider, for example, the real-world impact of a hijacked AI agent sending a benign email versus that agent exfiltrating a large amount of user data — the latter is clearly much more consequential. Therefore, even though the attack success rate for the data exfiltration task is low, that doesn’t mean this scenario should not be seriously considered and mitigated against.
Some injection tasks may also pose disproportionate risk. For example, the malicious script task has a high success rate and is highly consequential, since executing a malicious script could enable an attacker to execute a range of other cyberattacks.
Many evaluation frameworks, including AgentDojo, measure the efficacy of a given attack based on a single attempt. However, since LLMs are probabilistic, the output of a model can vary from attempt to attempt.
Put simply, if a user instructs an AI agent to perform the exact same task twice, it's possible that the agent will produce different results each time. This means that if an attacker can try to attack multiple times without incurring significant costs, they will be more likely to eventually succeed.
To demonstrate this, US AISI took the five injection tasks in the previous section and attempted each attack 25 times. After repeated attempts, the average attack success rate increased from 57% to 80%, and the attack success rate for individual tasks changed significantly.
Therefore, in applications where repeated attacks are possible, moving beyond one attempt to evaluate an agent based on multiple attempts can result in meaningfully different, and possibly more realistic, estimates of risk.
Agent hijacking will continue to be a persistent challenge as agentic systems continue to evolve. Strengthening and expanding evaluations for agent security issues like hijacking will help users understand and manage these risks as they seek to deploy agentic AI systems in a variety of applications.
Some defenses against hijacking attacks are available and continuing to evaluate the efficacy of these defenses against new attacks is another important area for future work in agent security. Developing defensive measures and practices that provide stronger protection, as well as the evaluations needed to validate their efficacy, will be essential to unlocking the many benefits of agents for innovation and productivity.