Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

CAISI Research Blog

A NIST blog from the Center for AI Standards and Innovation

Accelerating AI Innovation Through Measurement Science

Authored by Drew Keller, Ryan Steed, Stevie Bergman, and the Applied Systems Team at CAISI

Building gold-standard AI systems requires gold-standard AI measurement science – the scientific study of methods used to assess AI systems’ properties and impacts. The National Institute of Standards and Technology (NIST) works to improve measurements of AI performance, reliability, and security that American companies and consumers rely on to develop, adopt, and benefit from AI technologies.

Among other groups at NIST, the Center for AI Standards and Innovation (CAISI) works in concert with the larger community of AI practitioners to identify and make progress on open questions that are key to maturing the field of AI measurement science and advancing AI innovation. This post highlights an initial selection of such questions that CAISI has identified through its initiatives to date.

The need for improved AI measurement science

Today, many evaluations of AI systems do not precisely articulate what has been measured, much less whether the measurements are valid. The field of measurement science, or metrology, provides methods for rigorous and trustworthy use of measured values, qualified with assessments of uncertainty.

In line with its long tradition of research and leadership in measurement science, America’s AI Action Plan charges NIST with leading “the development of the science of measuring and evaluating AI models.” Working as part of NIST’s efforts, CAISI is developing new methods and frameworks to measure capabilities and risks of leading AI models as part of its mandate to “facilitate testing and collaborative research related to harnessing and securing the potential of commercial AI systems.”

A selection of open AI measurement science questions

AI measurement science is an emerging field. CAISI’s work to develop robust evaluations of AI systems in domains such as cybersecurity or biosecurity relies on ongoing research into methods for rigorous measurement. Below, we spotlight several non-exhaustive examples of challenging AI measurement science questions

I. Ensuring measurement validity. 

What concepts do current AI evaluations measure, and do they generalize to other domains and real-world settings? How can evaluators ensure measurement instruments are reliable and valid?

I-A. Construct validity. Often, claims about the capabilities (e.g., mathematical reasoning) of AI systems don’t match the construct actually measured by the benchmark (e.g., accuracy at answering math problems). A critical step in AI evaluation is the assessment of construct validity, or whether a testing procedure accurately measures the intended concept or characteristic. Construct validity is needed to accurately assess whether AI systems meet expectations.

  1. How can evaluation designers and practitioners develop construct validity?
  2. How should evaluators select appropriate measurement targets, metrics, and experimental designs for an evaluation goal? For example, how can the concept of AI-driven productivity be systematized and measured in different domains?
  3. What evaluation approaches can distinguish best- or worst-case performance from average-case reliability?

I-B. Generalization. Some AI evaluation results are unjustifiably generalized beyond the test setting. Research is needed to determine the extent to which evaluation results apply to other contexts and predict real-world performance.

  1. To what extent do domain- or task-specific evaluations generalize beyond a specific use case or between domains?
  2. How well do measures of general-purpose problem solving predict downstream performance on specific tasks?  
  3. How well do pre-deployment evaluations predict post-deployment functionality, risk, and impacts? What evaluation design practices enhance real-world informativeness?

I-C. Benchmark design and assessment. Researchers rely heavily on standardized benchmarks to compare performance across models and over time, but there are few guidelines for constructing reliable, rigorous benchmarks.

  1. What are valid and reliable methods to create and grade AI benchmarks? How can evaluators assess the validity and quality of existing AI benchmarks using publicly available information?
  2. To what degree are benchmark results sensitive to prompt selection and task design?
  3. How can evaluators determine the degree of train-test overlap affecting an evaluation? How does public release of benchmarks enhance or constrain future testing?

I-D. Measurement instrument innovation. Researchers are developing new approaches to measuring and monitoring AI system characteristics at scale. Appropriate uses, limitations, and best practices for these emerging methods are not yet clear.

  1. How and when can information from model reasoning and hidden activations be used to conduct more reliable measurements than system outputs alone?
  2. How and when should AI systems be used to test, evaluate, or monitor AI systems? What are best practices for reliable use and validation of LLM-as-a-judge?

II. Interpreting results and claims. 

How can practitioners appropriately interpret and act on the results of benchmarks, field studies, and other evaluations?

II-A. Uncertainty. All measurements involve some degree of uncertainty. Accurate claims about AI systems require honest and transparent communication of this uncertainty, but some presentations of benchmark results omit error bars and other basic expressions of uncertainty.

  1. How should evaluators identify, quantify, and communicate sources of uncertainty in an evaluation setup?

II-B. Baselines. Many evaluations lack relevant human or other non-AI baselines. Baselines are necessary to accurately interpret results (e.g., comparing AI diagnostic tool accuracy to expert physicians).

  1. What are appropriate baselines for interpreting AI system performance compared to non-AI alternatives and to AI-assisted human performance?

II-C. Model and benchmark comparison. Benchmark results are helpful for comparing the performance of different AI systems, but many reports ignore uncertainty and other factors needed for accurate comparison. Practitioners need valid methods to rank models, measure AI system improvement over time, resolve conflicting benchmark results, and make other comparisons of evaluation results.

  1. How can practitioners compare or combine the results of different AI evaluations?
  2. How should testing procedures for comparing multiple models differ from testing a single model?

II-D. Reporting. Measurement validity depends not only on the measurement instrument but also the context of the evaluation and the claims it supports. Often, reports of model performance from developers and researchers lack key details needed to assess validity. Accurate and useful interpretations of evaluation results depend on evaluators reporting sufficient detail.

  1. What are best practices or standards for reporting the results of AI evaluations?
  2. What information should AI benchmark developers include when publishing benchmarks in order to support their sound usage in evaluations?

III. Taking measurements in the field. 

What methods enable measurement of AI systems in real-world settings?

III-A. Downstream outcome measurement. Post-deployment evaluations are often neglected, but are necessary to assess AI systems’ performance, risks, and impacts in the real world.

  1. What are reproducible experimental methods to measure AI-driven changes in downstream outcomes over time? For example, can AI assist or uplift human ability to carry out physical tasks such as realistic biological laboratory research?
  2. What are methods to measure the causal effect of safeguards and other interventions on downstream outcomes?
  3. What are methods to measure, categorize, and track components of complex, multi-turn human-AI interaction workflows?

III-B. Stakeholder consultation. Domain stakeholders have valuable subject matter expertise needed to ensure successful AI adoption, but researchers and developers typically conduct AI evaluations with little involvement from the public or other important stakeholders in the success of American AI.

  1. What are best practices for including stakeholders in the assessment process, including end users, subject matter experts, and the public?

We welcome your engagement as we evaluate how CAISI can best to support stakeholders in advancing AI innovation through measurement science. Please feel free to share comments or feedback via email to caisi-metrology [at] nist.gov (caisi-metrology[at]nist[dot]gov).

Was this page helpful?