A NIST blog from the Center for AI Standards and Innovation
In December, CAISI published a write-up on how AI models can cheat on agentic evaluations, including lessons from our experience building and using AI-enabled transcript analysis tools to find and fix examples of cheating from our evaluations.
In that post, we highlighted the potential of AI-enabled transcript analysis tools to help evaluators scale their capacity to detect measurement issues in evaluations — particularly as they evaluate agentic AI systems that can work on tasks for longer periods of time. We emphasized the need for continued collaboration on shared practices and tooling to help the evaluation community adopt, scale and improve transcript review practices.
Recently, we contributed several of the practices and takeaways we identified in our research to a new joint research paper with the UK AI Security Institute and other AI evaluators. The paper outlines a multi-step process for building and using transcript review tools, from preparing log data to designing and validating a scanner in an iterative loop. At each step, it provides concrete examples and implementation considerations, based on experiences and takeaways aggregated from evaluators’ different transcript analysis projects and use cases.
The paper also includes implementation case studies using a new open-source transcript analysis framework, Inspect Scout, built by the UK AISI working closely with Meridian Labs. We’ve been able to collaborate with the developers to inform the design of features based on our own use cases, and are excited to see the development of more technical frameworks and tools that can help enable the wider adoption of transcript analysis by the AI evaluation community.
We’re excited to share these collaboratively developed examples and practices to aid other evaluators, and to continue our work to contribute to frameworks, tools, and practices that can help advance more rigorous, valid, and impactful AI measurement science.