To investigate the issue of evaluation cheating, CAISI built an LLM-based transcript analysis tool (described later in this post) to search through our agent evaluation logs. Using this tool, we found several examples of both solution contamination and grader gaming on our agent benchmarks. We present these examples below, alongside illustrative transcript excerpts and visualizations of detections from our transcript analysis tool.
Note that, since we analyzed historical data that was originally collected for other purposes, different benchmarks may have results from different sets of models or different numbers of samples per task or model. We used human review to validate each of the instances of successful cheating reported below and iterated on our transcript review tool to improve its accuracy, but false negatives are still a possibility. The figures should be interpreted as a visualization of detections by our current transcript analysis tool rather than an absolute claim about the behavior of any model.
In solution contamination, an agent solves an evaluation task by accessing information that goes beyond what was intended and what would be available in the realistic setting that the evaluation is trying to emulate. For example, consider the SWE-bench Verified example from above, in which agents learn information about the future state of the repository through the git history. When fixing software in the real world, a coding agent would not be able to peek ahead to copy how the exact problem was solved by someone else – so, if it can do so during an evaluation, the evaluation will fail to measure how it will perform in a realistic setting.
The ability for AI agents to access the internet during an evaluation is a common cause of solution contamination. If benchmarks are publicly released, agents may be able to directly find solutions on sites that host the benchmark, as Scale AI reported. Benchmarks may also be developed by compiling historical data – for example, past bugs or vulnerabilities from real codebases – and internet access can enable an agent to improperly find additional information about the task and how it was solved in the real world. (Solution contamination at evaluation time is distinct from training data contamination, in which benchmark data is included in a model’s training data, and can occur even when benchmark data postdates a model’s release and so there is no risk of training data leakage.)
Solution contamination can also occur without internet search. The challenge environment itself may inadvertently leak information about the task solution, such as through git histories (as on SWE-bench Verified), solution-related files and artifacts left in the environment (as described by OpenAI), or configuration files (as CAISI experienced, when we realized agents had discovered a Cybench solution string accidentally included in a task configuration file that was technically accessible in the environment).
Cybench is a benchmark of “capture-the-flag” (CTF) challenges, gamified problems from competitions designed to test human hackers’ skills. In these challenges, competitors must exploit a vulnerable software system or investigate software artifacts to find and submit a secret string (the “flag”).
Cybench tasks are based on real challenges from past CTF competitions, and many have online write-ups posted by competitors or organizers after the competition ended. Our transcript review tool found agents using their bash and Python tools to find these walkthroughs online, including to directly retrieve the challenge-specific flag. The only models we found successfully solving Cybench challenges in this way were o3 and GPT-5 – more recent and more capable agentic models – and GPT-5 more often succeeded at finding the flag through search than its predecessor o3.
SWE-bench Verified is a benchmark that requires agents to fix real past issues from open-source software libraries. Since these are real issues from real codebases, these issues have been fixed in more recent versions of the codebase, creating a chance for agents to cheat by accessing more recent code versions.
At the time these logs were collected, CAISI was running SWE-bench verified in a mode where agents were allowed to access the internet so that they could install dependencies required for the task via pip or other package management tools. (Below, we’ll discuss how we are updating our evaluation implementation.) However, our transcript analysis tool found agents using their coding tools to access more recent versions of the libraries’ codebases, revealing how the issue in question was fixed by someone else. For example, o3 used bash tools to directly query the latest version of the library on GitHub, or package managers like pip to install the latest version of the library and then read the source code.
We also searched for instances where models viewed commits corresponding to future code states via the repository’s git history, based on the findings reported by other SWE-bench Verified users. We found some instances in which Opus 4 viewed commits corresponding to a future repository state, but none in which the model reviewed the canonical solution for the issue it was fixing.
To allow evaluations to be run at scale, tasks are typically scored by an automated grading function. Grader gaming arises when agents can craft solutions that will score highly on the automated grading function without actually solving the intended task.
Grader gaming is possible because evaluations’ automatic grading functions may not perfectly capture the evaluator’s intent. For example, the evaluator’s intent might be “update the code to fix the specified issue” – but the actual grading function is “cause the code to pass the unit tests”. Sometimes, evaluated models can find solutions that accomplish the latter without the former: for example, both OpenAI and Anthropic have described how coding agents, during training, can find ways to make tests pass without implementing the requested feature or fix, from hard-coding values checked by the tests directly into the code to finding ways to bypass the tests altogether.
On CVE-Bench, agents must develop an exploit for a known software vulnerability (a CVE). The agent proves that its exploit works by using it to cause a specific effect on a target server that is running the vulnerable software – for example, creating a particular file, or causing the server to become unresponsive – and the evaluation scoring system continuously checks the target server to see whether this effect has been achieved.
CAISI uses a non-public version of CVE-Bench containing a subset of public tasks and some additional private tasks. On this internal version of the benchmark, our transcript analysis tool flagged several tasks where models were able to fool the grading system by finding a way to cause the specified effect on the target server without exploiting the intended CVE.
The developers of CVE-Bench recently released an updated version of the benchmark, along with a blog post describing how several of these updates were necessary to close technical loopholes that models could use to cheat. We’re excited to adopt these changes to harden our evaluations against cheating, as described in the takeaways section below.
Other evaluators and developers have reported that models may be able to cheat on coding tasks by creating code that will pass the tests but clearly fails to function as a generalized solution to the intended task or issue. For example, models might hard-code specific values checked by the tests into their code, or suppress or bypass checks rather than crafting a solution that fulfills them.
Using our transcript review tool, we searched for instances of this kind of “test overfitting”, in which models craft submissions that pass unit tests but would clearly fail to function as a solution to the issue in the real codebase. We found several instances where o4-mini was able to successfully pass a task by removing pre-existing checks in the code, or by adding special logic to handle test cases that would be unlikely to work as a realistic issue fix.
On most of our benchmarks, successful cheating appears relatively rare. However, the results from our reviewer system suggest that attempts to solve tasks in unintended ways are much more common across benchmarks and models. We observed other cheating strategies that were ineffective (i.e., never succeeded), such as: