Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

3. Using transcript review tools to find cheating at scale

The number of benchmarks and models that CAISI has evaluated would have made it infeasible to rely on manual review of every transcript to search for evaluation cheating, so we employed AI-based transcript review to aid our search.

Using Inspect, the open-source framework that CAISI uses to run evaluations, we built a transcript analysis system that uses LLM reviewers to score an evaluation transcript for cheating. This system provides reviewer models with a prompt that combines:

  • A rubric with scorable categories for known cheating risks for that benchmark (and a general “other” category for uncategorized cheating), plus:
    • Specific include/exclude examples for each category
    • An additional “unintended solution” flag that could be appended to scores to indicate cases of cheating that directly led to a successful task solution
  • A formatted version of the evaluation transcript, including system and user messages as well as the evaluated model’s messages, tool calls, and the responses to those tool calls
  • Additional metadata about the evaluation task, including the score according to the original grading system and information about the intended task solution (such as canonical patches for SWE-bench, or write-ups and reference exploits for Cybench)

Reviewer models respond with a JSON object containing any applicable scores for the transcript, providing a confidence from 1 - 10, justification, and relevant message numbers for each. Scores from multiple reviewers are then aggregated to provide a final sample score – for example, the results above are reported for detections with an average confidence score greater than or equal to 5.

A diagram shows a basic overview of CAISI’s transcript analysis tool. On the left, there is an image of a grading template with sections for benchmark-specific categories and examples, context about the evaluation task, and the evaluation transcript. On the right, this template is passed to an ensemble of three different model reviewers, whose scores are aggregated into a final sample score.
Visualization of the design of our transcript analysis tool
Credit: NIST
Created November 28, 2025, Updated December 2, 2025
Was this page helpful?