Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

CAISI Research Blog

A NIST blog from the Center for AI Standards and Innovation

Insights into AI Agent Security from a Large-Scale Red-Teaming Competition

AI security red-teaming competitions – in which participants compete to develop new attacks against AI models and defenses – provide a unique way to assess how secure today’s AI systems are in the face of adversarial pressure. CAISI recently partnered with Gray Swan, the UK AI Security Institute (UK AISI), and several frontier AI labs to publish a new research paper based on data from a large-scale public AI agent red-teaming competition, revealing several insights into the robustness of current leading AI models.

Background

As AI agents are increasingly deployed to work on tasks that require processing data from external sources such as emails, websites, and code repositories, they face growing risks from agent hijacking, also known as indirect prompt injection. In these attacks, an attacker inserts malicious instructions into data that may be ingested by an AI agent, aiming to derail the agent into taking unintended, harmful actions, such as exfiltrating the user’s sensitive data or downloading and running malicious code.

Understanding and mitigating agent hijacking attacks is a key area of focus for the Center for AI Standards and Innovation (CAISI), as part of our mission to measure and improve the security of AI systems. A year ago, we released a blog post detailing our efforts to improve agent hijacking evaluations. And more recently, our evaluation of DeepSeek models showed that these models remain more vulnerable than leading U.S. models to hijacking attacks, uncovering cases where DeepSeek models were more easily tricked into sending phishing emails, running malware, and exfiltrating user login credentials.

A key challenge in benchmarking AI system security – one we highlighted in our blog post, and emphasized by other top security researchers – is the need for security evaluations to constantly evolve and adapt in order to assess the risks from real-world adversaries, who continuously seek out new attacks and tailor their techniques to particular targets and defenses.

Red-teaming competitions provide a key method to bridge this gap. In these competitions, human red-teamers are brought together to try to attack real models and defenses, providing a better opportunity to measure how these defenses will hold up against real-world adversaries.

The results of these competitions provide valuable information about the robustness of current models, as well as data and insights that can be used to improve future security evaluations. That’s why CAISI partnered with Gray Swan, the UK AI Security Institute (UK AISI), and several frontier AI labs to analyze the data from a recent large-scale public red-teaming competition.

Key Findings

The paper draws upon data from a public red-teaming competition hosted by Gray Swan, which challenged participants to develop hijacking attacks against 13 different frontier models in a variety of different agentic scenarios, including tool use agents, coding agents, and computer use agents.

Across more than 250,000 attack attempts from over 400 participants, at least one successful attack was found against all of the target frontier models, highlighting the ongoing challenge that hijacking attacks pose for secure agent use.

However, the data also showed that frontier models differed sharply in how many successful attacks were found – and also found that this did not correlate uniformly with model capability – emphasizing how red-teaming competitions and comparative security benchmarking can help inform users of the risk profiles of different models they might adopt for agentic applications.

The paper also investigates how attacks developed against one model during the competition transferred to other models and scenarios in post-competition testing. It found certain families of “universal” attacks that were often able to transfer across scenarios and models, potentially by exploiting shared underlying weaknesses in instruction-following behavior between models. This analysis also revealed that successful attacks developed against more robust models (those with lower rates of attack success during the competition) were particularly likely to transfer to models that were less robust, but not the other way around.

Takeaways

Red-teaming competitions are a valuable tool for improving and assessing the security of frontier models. They provide opportunities for evaluators to better understand and analyze the kinds of attacks that dedicated adversaries can find, particularly against models that are increasingly robust to known static attacks (a key challenge when it comes to benchmarking).

As part of the research collaboration with Gray Swan, CAISI contributed to the analysis methodology for this work, and will also receive data from the competition, including on attack strategies submitted by participants. We will be able to use this data to provide further methodological feedback on the design of future red-teaming competitions, as well as to improve our own AI security measurements going forward.

AI security evaluations are a continuously moving target, presenting novel challenges, from efficiently covering the enormous space of possible natural language attacks, to understanding and analyzing how attacks do and do not transfer across models. We’re excited to continue to collaborate with organizations like Gray Swan, independent evaluators, and frontier AI developers to develop tools, methods, and practices that can make progress against these hard problems – enabling agents to be deployed in economically valuable settings where security is a prerequisite.

Was this page helpful?