The number of benchmarks and models that CAISI has evaluated would have made it infeasible to rely on manual review of every transcript to search for evaluation cheating, so we employed AI-based transcript review to aid our search.
Using Inspect, the open-source framework that CAISI uses to run evaluations, we built a transcript analysis system that uses LLM reviewers to score an evaluation transcript for cheating. This system provides reviewer models with a prompt that combines:
Reviewer models respond with a JSON object containing any applicable scores for the transcript, providing a confidence from 1 - 10, justification, and relevant message numbers for each. Scores from multiple reviewers are then aggregated to provide a final sample score – for example, the results above are reported for detections with an average confidence score greater than or equal to 5.