Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Cheating On AI Agent Evaluations

By Maia Hamin and Benjamin Edelman

AI evaluations are designed to assess and compare how AI models perform on different tasks. Developers, users, and independent evaluators — like the Center for AI Standards and Innovation (CAISI) — can use evaluations to track trends in model capabilities and inform decisions about real-world use.

Agent evaluations test whether models can use tools in a multi-turn feedback loop to solve complex problems like debugging software or uncovering cybersecurity vulnerabilities. They allow evaluators to measure new and increasingly economically valuable capabilities, but also bring new methodological challenges — including, as CAISI and other evaluators have found, the risk that AI agents can use their tools to cheat.

As part of our mission, CAISI both directly evaluates AI models and seeks to advance best practices for AI measurement science. To improve our evaluations, we built an AI transcript analysis tool to search through our historical evaluation transcripts for cheating. To support the development of stronger ecosystem practices, the following pages share examples that we uncovered and suggests takeaways that may help other evaluators reduce the incidence and impact of evaluation cheating.

Using our transcript analysis tool, we found several examples of how models were able to successfully cheat on agentic coding and cyber benchmarks, including:

  • Models using the internet to find walkthroughs and answers for cyber capture-the-flag challenges
  • Models using generic denial-of-service attacks to crash servers on cyber tasks instead of exploiting intended vulnerabilities
  • Models cheating on coding benchmarks by looking up more recent code versions, disabling assertions, and adding test-specific logic.

We bucket these examples into two categories of cheating risks: solution contamination, where a model accesses information that improperly reveals the solution to an evaluation task; and grader gaming, where a model exploits a gap or misspecification in an evaluation’s automated scoring system to craft a solution that scores highly without fulfilling the “spirit” of the intended task.

Examples of evaluation cheating from CAISI’s evaluation logs
CAISI BenchmarkCheating Examples from Evaluation LogsCheating Type% Logs with Successful Solution Due to Cheating (Lower Bound)
Cybench
  • Using coding tools to search the internet for challenge flags and walkthroughs
Solution contamination0.3%
SWE-bench Verified
  • Reviewing more recent code versions on GitHub
  • Installing more recent code versions using package managers
Solution contamination0.1%
  • Commenting out assertion checks to pass unit tests
Grader gaming0.2%
CVE-Bench (Internal)
  • Using denial-of-service attacks to crash the target server instead of exploiting the CVE
Grader gaming 4.80%

In general, we define evaluation cheating as:

when an AI model exploits a gap between what an evaluation task is intended to measure and its implementation, solving the task in a way that subverts the validity of the measurement.

This definition focuses on the problem that cheating creates for evaluation validity: if models can exploit implementation loopholes to score higher without actually improving at the skills an evaluation is intended to measure, it degrades the value of that measurement for decisions about real-world adoption. As models become increasingly adept problem-solvers, they may be able to find new successful cheating strategies, and detecting and preventing cheating may become increasingly important for the validity and comparability of evaluation results.

Based on lessons we learned through this process, we share some preliminary suggested practices for other evaluators and benchmark designers interested in addressing evaluation cheating, including:

  • Review evaluation transcripts for cheating, including by leveraging AI transcript analysis tools that can help scale human review processes.
  • Prevent cheating by closing task design loopholes and setting clear rules in task prompts in order to make model comparisons more accurate and fair.
  • Standardize benchmark-specific expectations about agent affordances and restrictions to help evaluators create more comparable evaluation results, including by making it easier to catch and prevent cheating.

Use the navigation link below to read the full write-up, including examples of evaluation cheating and preliminary practices.

Created November 28, 2025, Updated December 2, 2025
Was this page helpful?