Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

5. Conclusion

Cheating on agent benchmarks is already beginning to pose a challenge for evaluators and may become a larger threat to evaluation integrity over time if more capable models also become more capable at finding new ways to cheat. The broader evaluation community can come together to prepare by developing practices and methods for finding and preventing cheating and standardizing these approaches at the benchmark and ecosystem level.

As evaluators increasingly seek ways to answer questions at scale about how evaluation tasks were solved (not just whether they were), AI-based transcript analysis may become an increasingly valuable tool for evaluation science. Continuing to develop tools, frameworks, and best practices to support the adoption and ensure the reliability of transcript review systems will be an important area of work for the ecosystem, and there may be opportunities for valuable cross-pollination with other areas of work, such as the development of AI systems that can monitor or supervise other deployed AI systems for security and reliability.

To advance these important areas, we encourage other evaluators to adopt transcript analysis tools and look for cheating; to share what they find in order to improve benchmarks and measurements; and to continue to openly discuss practices, observations, and lessons learned to foster the development of a stronger and fairer evaluation ecosystem.

Questions? Reach out to Maia Hamin and Benjamin Edelman.

Created November 28, 2025, Updated December 2, 2025
Was this page helpful?