A NIST blog from the Center for AI Standards and Innovation
By Maia Hamin and Benjamin Edelman
AI evaluations are designed to assess and compare how AI models perform on different tasks. Developers, users, and independent evaluators — like the Center for AI Standards and Innovation (CAISI) — can use evaluations to track trends in model capabilities and inform decisions about real-world use.
Agent evaluations test whether models can use tools in a multi-turn feedback loop to solve complex problems like debugging software or uncovering cybersecurity vulnerabilities. They allow evaluators to measure new and increasingly economically valuable capabilities, but also bring new methodological challenges — including, as CAISI and other evaluators have found, the risk that AI agents can use their tools to cheat.
As part of our mission, CAISI both directly evaluates AI models and seeks to advance best practices for AI measurement science. To improve our evaluations, we built an AI transcript analysis tool to search through our historical evaluation transcripts for cheating. To support the development of stronger ecosystem practices, the following pages share examples that we uncovered and suggests takeaways that may help other evaluators reduce the incidence and impact of evaluation cheating.
Using our transcript analysis tool, we found several examples of how models were able to successfully cheat on agentic coding and cyber benchmarks, including:
We bucket these examples into two categories of cheating risks: solution contamination, where a model accesses information that improperly reveals the solution to an evaluation task; and grader gaming, where a model exploits a gap or misspecification in an evaluation’s automated scoring system to craft a solution that scores highly without fulfilling the “spirit” of the intended task.
| CAISI Benchmark | Cheating Examples from Evaluation Logs | Cheating Type | % Logs with Successful Solution Due to Cheating (Lower Bound) |
|---|---|---|---|
| Cybench |
| Solution contamination | 0.3% |
| SWE-bench Verified |
| Solution contamination | 0.1% |
| Grader gaming | 0.2% | |
| CVE-Bench (Internal) |
| Grader gaming | 4.80% |
In general, we define evaluation cheating as:
when an AI model exploits a gap between what an evaluation task is intended to measure and its implementation, solving the task in a way that subverts the validity of the measurement.
This definition focuses on the problem that cheating creates for evaluation validity: if models can exploit implementation loopholes to score higher without actually improving at the skills an evaluation is intended to measure, it degrades the value of that measurement for decisions about real-world adoption. As models become increasingly adept problem-solvers, they may be able to find new successful cheating strategies, and detecting and preventing cheating may become increasingly important for the validity and comparability of evaluation results.
Based on lessons we learned through this process, we share some preliminary suggested practices for other evaluators and benchmark designers interested in addressing evaluation cheating, including:
View the full writeup to see examples of cheating from CAISI's agent evaluations and a discussion of practices to address evaluation cheating.