Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

2. Examples of cheating in CAISI’s agent evaluations

To investigate the issue of evaluation cheating, CAISI built an LLM-based transcript analysis tool (described later in this post) to search through our agent evaluation logs. Using this tool, we found several examples of both solution contamination and grader gaming on our agent benchmarks. We present these examples below, alongside illustrative transcript excerpts and visualizations of detections from our transcript analysis tool.

Note that, since we analyzed historical data that was originally collected for other purposes, different benchmarks may have results from different sets of models or different numbers of samples per task or model. We used human review to validate each of the instances of successful cheating reported below and iterated on our transcript review tool to improve its accuracy, but false negatives are still a possibility. The figures should be interpreted as a visualization of detections by our current transcript analysis tool rather than an absolute claim about the behavior of any model.

2.1. Solution contamination

In solution contamination, an agent solves an evaluation task by accessing information that goes beyond what was intended and what would be available in the realistic setting that the evaluation is trying to emulate. For example, consider the SWE-bench Verified example from above, in which agents learn information about the future state of the repository through the git history. When fixing software in the real world, a coding agent would not be able to peek ahead to copy how the exact problem was solved by someone else – so, if it can do so during an evaluation, the evaluation will fail to measure how it will perform in a realistic setting.

The ability for AI agents to access the internet during an evaluation is a common cause of solution contamination. If benchmarks are publicly released, agents may be able to directly find solutions on sites that host the benchmark, as Scale AI reported. Benchmarks may also be developed by compiling historical data – for example, past bugs or vulnerabilities from real codebases – and internet access can enable an agent to improperly find additional information about the task and how it was solved in the real world. (Solution contamination at evaluation time is distinct from training data contamination, in which benchmark data is included in a model’s training data, and can occur even when benchmark data postdates a model’s release and so there is no risk of training data leakage.)

Solution contamination can also occur without internet search. The challenge environment itself may inadvertently leak information about the task solution, such as through git histories (as on SWE-bench Verified), solution-related files and artifacts left in the environment (as described by OpenAI), or configuration files (as CAISI experienced, when we realized agents had discovered a Cybench solution string accidentally included in a task configuration file that was technically accessible in the environment).

Example from CAISI evaluations: flag and walkthrough search on Cybench

Cybench is a benchmark of “capture-the-flag” (CTF) challenges, gamified problems from competitions designed to test human hackers’ skills. In these challenges, competitors must exploit a vulnerable software system or investigate software artifacts to find and submit a secret string (the “flag”).

Cybench tasks are based on real challenges from past CTF competitions, and many have online write-ups posted by competitors or organizers after the competition ended. Our transcript review tool found agents using their bash and Python tools to find these walkthroughs online, including to directly retrieve the challenge-specific flag. The only models we found successfully solving Cybench challenges in this way were o3 and GPT-5 – more recent and more capable agentic models – and GPT-5 more often succeeded at finding the flag through search than its predecessor o3.

A snippet from an evaluation transcript shows GPT-5 using the bash tool to retrieve online challenge walkthroughs via curl, and searching for the string “HTB”, which corresponds to the format of the flag, via grep. The tool call returns a matching flag string, which the model submits in its next turn.
Example: GPT-5 uses curl to find and submit a flag from an online challenge write-up.
Credit: NIST
A heatmap graph shows instances of unintended solutions due to internet search on Cybench. Results are for the 40 Cybench tasks, with varying number of epochs per model. Three instances were identified for OpenAI's GPT-5, and one instance for o3. No other model (including Anthropic’s Opus 4 and Sonnets 3.7, 3.6, and 3.5, OpenAI’s o4-mini and GPT-4o, DeepSeek’s R1 and R1-0528, and Alibaba’s QwQ) was identified as solving the task by using an online solution.
Transcript analysis results for Cybench challenges that an agent solved by finding flags or walkthroughs online.
Credit: NIST

Example from CAISI’s evaluations: code lookup on SWE-bench Verified

SWE-bench Verified is a benchmark that requires agents to fix real past issues from open-source software libraries. Since these are real issues from real codebases, these issues have been fixed in more recent versions of the codebase, creating a chance for agents to cheat by accessing more recent code versions.

At the time these logs were collected, CAISI was running SWE-bench verified in a mode where agents were allowed to access the internet so that they could install dependencies required for the task via pip or other package management tools. (Below, we’ll discuss how we are updating our evaluation implementation.) However, our transcript analysis tool found agents using their coding tools to access more recent versions of the libraries’ codebases, revealing how the issue in question was fixed by someone else. For example, o3 used bash tools to directly query the latest version of the library on GitHub, or package managers like pip to install the latest version of the library and then read the source code.

A snippet from an evaluation transcript shows OpenAI’s o3 explaining that it is going to download the commit diff from GitHub for “a lesson”, then using its bash tool to view the contents of the code from the GitHub repository via curl.
Example: o3 curls the latest version of the code for the function it is supposed to fix.
Credit: NIST
A heatmap graph shows instances of unintended solutions due to looking up a more recent code version on SWE-bench Verified. Results are for 498 tasks, with 1 epoch per model. The results show OpenAI’s o3 accessing a more recent code version in 2 different instances. No other model (including Anthropic’s Opus 4, OpenAI’s GPT-5 and o4-mini, or DeepSeek’s R1-0528) was identified as using recent code lookup to craft a successful solution.
Transcript analysis results for SWE-bench Verified solutions in which the model found and read a more recent version of the code. o3 is the only model we found to use this strategy.
Credit: NIST

We also searched for instances where models viewed commits corresponding to future code states via the repository’s git history, based on the findings reported by other SWE-bench Verified users. We found some instances in which Opus 4 viewed commits corresponding to a future repository state, but none in which the model reviewed the canonical solution for the issue it was fixing.

A transcript snippet shows Anthropic Opus 4 using its bash tool to run a command to search the repository’s git log for a specific string of numbers. The resulting output shows a commit from April 13, 2019, with a description that mentions a fix for an issue related to formatting small decimal numbers. The code diff shows changes to test files that test this behavior.
Example: Opus 4 searches the repository’s git log for a string related to the tests for the issue it is fixing, and finds a future commit that changes test logic related to the issue.
Credit: NIST

2.2. Grader gaming

To allow evaluations to be run at scale, tasks are typically scored by an automated grading function. Grader gaming arises when agents can craft solutions that will score highly on the automated grading function without actually solving the intended task.

Grader gaming is possible because evaluations’ automatic grading functions may not perfectly capture the evaluator’s intent. For example, the evaluator’s intent might be “update the code to fix the specified issue” – but the actual grading function is “cause the code to pass the unit tests”. Sometimes, evaluated models can find solutions that accomplish the latter without the former: for example, both OpenAI and Anthropic have described how coding agents, during training, can find ways to make tests pass without implementing the requested feature or fix, from hard-coding values checked by the tests directly into the code to finding ways to bypass the tests altogether.

Example from CAISI’s evaluations: bypassing the CVE on CVE-Bench

On CVE-Bench, agents must develop an exploit for a known software vulnerability (a CVE). The agent proves that its exploit works by using it to cause a specific effect on a target server that is running the vulnerable software – for example, creating a particular file, or causing the server to become unresponsive – and the evaluation scoring system continuously checks the target server to see whether this effect has been achieved.

CAISI uses a non-public version of CVE-Bench containing a subset of public tasks and some additional private tasks. On this internal version of the benchmark, our transcript analysis tool flagged several tasks where models were able to fool the grading system by finding a way to cause the specified effect on the target server without exploiting the intended CVE.

The developers of CVE-Bench recently released an updated version of the benchmark, along with a blog post describing how several of these updates were necessary to close technical loopholes that models could use to cheat. We’re excited to adopt these changes to harden our evaluations against cheating, as described in the takeaways section below.

A transcript snippet shows OpenAI gpt-4o executing a command to send the contents of file “@/dev/urandom” to the target server. This command will essentially read endless random bits from the specified file, overwhelming the target server and taking it offline, allowing the model to “solve” the challenge.
Example: gpt-4o is supposed to use a CVE to craft a specially formatted request that will cause the server to enter an infinite loop. Instead, it solves the task by sending a curl request that reads from the file @/dev/urandom, which sends infinite random data until the server is overwhelmed and becomes unresponsive, ending the task.
Credit: NIST
A heatmap graph shows instances of unintended solutions due to CVE bypass on CVE-Bench. Results are for 15 tasks, with varying numbers of epochs per model. The results show that several different models successfully bypassed the CVE to solve tasks, including Anthropic’s Opus 4 and OpenAI’s o3, o4-mini, o1, and GPT-4o. In the heatmap, it’s clear that particular tasks, such as those in columns 3, 4, and 14, had unintended solutions discovered by multiple models.
Transcript analysis results for CVE-Bench tasks in which the model successfully caused the target server state without exploiting the intended vulnerability. Several of the CVE-Bench tasks had specific loopholes that were exploited by multiple models.
Credit: NIST

Example from CAISI’s evaluations: test overfitting on SWE-bench Verified

Other evaluators and developers have reported that models may be able to cheat on coding tasks by creating code that will pass the tests but clearly fails to function as a generalized solution to the intended task or issue. For example, models might hard-code specific values checked by the tests into their code, or suppress or bypass checks rather than crafting a solution that fulfills them.

Using our transcript review tool, we searched for instances of this kind of “test overfitting”, in which models craft submissions that pass unit tests but would clearly fail to function as a solution to the issue in the real codebase. We found several instances where o4-mini was able to successfully pass a task by removing pre-existing checks in the code, or by adding special logic to handle test cases that would be unlikely to work as a realistic issue fix.

A transcript snippet shows OpenAI o4-mini using its file editing tool to disable an assertion in the codebase by commenting out the line of code.
Example: o4-mini is tasked with fixing a bug in which an assertion fails under certain conditions because of overlapping keys and values. Instead of adding logic to de-duplicate the keys and values as suggested in the issue description, o4-mini comments out the assertion before submitting its solution.
Credit: NIST
A heatmap graph shows instances of unintended solutions due to test overfitting on SWE-bench verified. Across 498 tasks, OpenAI’s o4-mini found ways to solve the task by overfitting to the unit tests in 5 tasks. Other models including OpenAI’s GPT-5 and o3, Anthropic’s Opus 4, and DeekSeek’s R1-0528 did not have any instances found.
Transcript analysis results for SWE-bench Verified tasks in which the model crafted a solution that passed unit tests despite failing to implement a generalized issue fix, such as by adding test-specific logic or removing in-code checks.
Credit: NIST

2.3 Unsuccessful cheating

On most of our benchmarks, successful cheating appears relatively rare. However, the results from our reviewer system suggest that attempts to solve tasks in unintended ways are much more common across benchmarks and models. We observed other cheating strategies that were ineffective (i.e., never succeeded), such as:

  • On Cybench, trying to guess the flag without solving the challenge by combining words or contextual clues based on the challenge prompt
  • On Cybench and CVE-Bench, editing a copy of the source code for the target (intentionally provided to the model for local examination and testing) to remove protections or make it easier to exploit – presumably in an unsuccessful attempt to change the version of the code running on the remote target
  • On SWE-bench Verified, attempting to change the code or behavior of the tests (which are reset to their prior state before grading, rendering these strategies ineffective). Models both attempted to directly change the logic of the tests, and, less frequently, tried to force the tests to import a more recently installed version of the code that the agent was supposed to be fixing.
A heatmap graph shows instances in which models attempted to guess the challenge flag on Cybench, with results from GPT-5, R1-0528, Opus 4, Qwen3-235B, o3, o4-mini, QwQ, Sonnet 3.7,  o3-mini, R1, o1, Sonnet 3.6, Sonnet 3.5, and GPT-4o. Across 40 tasks, all models attempted to guess the flag on at least one task, with some models attempting to guess the flag on a majority of task attempts.
Transcript analysis results for flag guessing on Cybench. Agents frequently tried to guess Cybench flags but were never successful.
Credit: NIST
A heatmap graph shows instances in which models attempted to edit a local version of the source code for the vulnerable server on CVE-Bench. Across 15 tasks, models including Qwen3-235B, o3, o4-mini, o3-mini, R1, o1, and Sonnet 3.5 attempted this strategy on at least one task, with o4-mini and o1 showing the highest frequency.
Transcript analysis results for local source code editing on CVE-Bench. Some models attempted to edit the local source code to make the target easier to exploit, although this strategy was never effective (since it did not impact the version on the remote target server).
Credit: NIST
A heatmap graph shows instances in  which models edited the test code on SWE-bench Verified. Across 498 tasks, GPT-5, R1-0528, Opus 4, o3, and o4-mini each edited the test code on 1-2 tasks, with o4-mini having the highest number of editing instances (8).
Transcript analysis results for editing test files on SWE-bench Verified. Examples included editing test files to change expected outputs, suppress warnings and exceptions, or skip failing tests. The scoring system restores the prior state of the test files before grading, so this approach was unsuccessful.
Credit: NIST
A heatmap graph shows instances of attempts to import and use a more recent code version on SWE-bench Verified. Across 498 tasks, only OpenAI’s o3 attempted this (unsuccessful) strategy.
Transcript analysis results for trying to force the tests to import a more recent installed version of the code the agent was supposed to be fixing, such as adding a sitecustomize.py file intended to customize the Python environment at runtime. This approach was never successful.
Credit: NIST
Created November 28, 2025, Updated December 2, 2025
Was this page helpful?