Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

1. Background: AI models can cheat on evaluations?

AI researchers have long known that AI systems can find and exploit loopholes in the tasks we give them. When models are trained using reinforcement learning (RL), AI developers must carefully design training tasks to prevent models from converging on unintended solutions that provide a high reward, a problem known as “reward hacking”. As developers of frontier large language models (LLMs) have increasingly turned to RL to train their models to solve complex tasks in areas like software development, they have begun to encounter reward hacking, describing efforts to detect and prevent reward hacking during training on coding tasks and to evaluate and reduce reward hacking in released models.

As evaluators assess AI models that are increasingly capable problem-solvers – and, potentially, that may have had chances to learn reward hacking strategies during training – they increasingly need to think about the problem of task loopholes too. AI evaluations combine benchmarks of tasks designed to correspond to real-world problems with automatic grading functions that assess models’ performance at scale. For an evaluation to measure what it’s supposed to, its task implementations and scoring functions must capture the evaluator’s intent and resist gaming or subversion by the AI models it’s supposed to evaluate. This latter challenge is particularly acute in agent evaluations, in which models often have access to flexible and powerful tools like the ability to write and execute code or to access the internet. Code execution offers an agent a highly flexible ability to interact with its environment – and therefore opens up many new opportunities to find unintended ways to solve evaluation tasks.

Other evaluators have already discovered models cheating during evaluations. In January, researchers at Model Evaluation & Threat Research (METR) observed models “attempting (often successfully) to get a higher score by modifying the tests or scoring code, gaining access to an existing implementation or answer that’s used to check their work, or exploiting other loopholes in the task environment” during evaluations of software development and AI R&D capabilities. Scale AI, another third-party evaluator, caught models using internet search tools to look up answers to questions on the benchmarks they were solving, and found that blocking access to Hugging Face (where many of these benchmarks were hosted) decreased model’s performance by about 15%. Users of SWE-bench Verified—a popular benchmark that tests AI agents on their ability to fix bugs in real codebases—recently discovered AI agents finding information about the future state of the codebase by searching the repository’s git history. Researchers from Carnegie Mellon and Anthropic built “impossible” versions of common benchmarks to measure cheating, and found that some leading LLMs would resort to cheating in a majority of cases—and that more capable models generally had higher cheating rates.

In the context of evaluations, we don’t define cheating solely as cases where models violate explicit rules provided in task prompts, nor do we try to analyze whether a model did or should have understood that a particular solution violated the implicit expectations of the task. (Although models’ willingness to violate the spirit, if not the letter, of user instructions is a separate and important issue with implications for real-world use.) Instead, when it comes to measurement validity, it’s the violation of the evaluator’s intent, not the question of the model’s, that matters. Unintended solutions on evaluation tasks pose a problem because they can mean that an evaluation is not measuring what we think it is. This impacts “external validity”, the question of whether the results of an experiment or measurement will generalize to the broader context outside of that study. Cases where models can use loopholes to solve evaluation tasks can also impact the fairness of comparisons between models, by effectively penalizing models that adhere more closely to the intent of the instructions.

Created November 28, 2025, Updated December 2, 2025
Was this page helpful?