Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Towards Best Practices for Automated Benchmark Evaluations

Public Comments Sought on Initial Public Draft of NIST AI 800-2 through March 31

Consistent practices to support the validity, transparency, and reproducibility of AI evaluations are only beginning to emerge. To further the development and voluntary adoption of such practices, the Center for AI Standards and Innovation (CAISI) at the National Institute of Standards and Technology (NIST) is requesting public comment on a draft document, NIST AI 800-2 Practices for Automated Evaluations of Language Models, through March 31, 2026.

Background: Identifying Best Practices in Automated Benchmarking

NIST AI 800-2 documents preliminary best practices for evaluating language models and AI agent systems. The document addresses a particular type of evaluation: automated benchmark evaluations.

The primary audience for NIST AI 800-2 is technical staff at organizations evaluating AI systems, including AI deployers, developers, and third-party evaluators. However, all potential consumers of AI evaluation reports can benefit from robust, well-communicated evaluation practices that advance gold-standard science and inform AI procurement and implementation decisions. While such evaluations cannot meet all AI evaluation objectives, they are a common measurement instrument that may be particularly useful when organizations face constraints on time, expertise, or resources. CAISI anticipates producing additional voluntary guidelines for further types of evaluations in the future.

The practices in this initial public draft reflect CAISI’s experience partnering with leading U.S. AI organizations to evaluate frontier AI models, as well as ongoing measurement science research at NIST and beyond. The draft organizes practices into three sections and a glossary: (1) defining evaluation objectives and select benchmarks, (2) implementing and running evaluations, and (3) analyzing and reporting results.

Requests for Input

CAISI invites input on any aspect of this draft document, particularly:

  • The usefulness and relative importance of the included practices and principles (in general or for specific use cases, types of evaluations, or audiences);
  • Important practices that are within the document’s scope but missing from the draft;
  • Which practices are emerging vs. existing best practices;
  • Any content that is incorrect, unclear, or otherwise problematic;
  • Other common terms in benchmark evaluation that would be useful to define in the glossary, or for the field to use in a standardized manner;
  • When automated benchmark evaluations are more or less useful relative to other evaluation paradigms.

A 60-day comment period is now open, closing March 31, 2026. Feedback can be emailed to AI800-2 [at] nist.gov (AI800-2[at]nist[dot]gov) in any form, including markup of the draft, bulleted lists of comments, etc. CAISI does not plan to publish this feedback, but all emails, including attachments and other supporting materials, may be subject to public disclosure.

CAISI encourages all stakeholders to provide input, including organizations with experience conducting AI evaluations as well as users of AI evaluation reports – for instance, business decision-makers, procurement specialists, and technical integrators.

The document is available here.

Released January 30, 2026
Was this page helpful?