A NIST blog from the Center for AI Standards and Innovation
Authored by Drew Keller, Ryan Steed, Stevie Bergman, and the Applied Systems Team at CAISI
Building gold-standard AI systems requires gold-standard AI measurement science – the scientific study of methods used to assess AI systems’ properties and impacts. The National Institute of Standards and Technology (NIST) works to improve measurements of AI performance, reliability, and security that American companies and consumers rely on to develop, adopt, and benefit from AI technologies.
Among other groups at NIST, the Center for AI Standards and Innovation (CAISI) works in concert with the larger community of AI practitioners to identify and make progress on open questions that are key to maturing the field of AI measurement science and advancing AI innovation. This post highlights an initial selection of such questions that CAISI has identified through its initiatives to date.
Today, many evaluations of AI systems do not precisely articulate what has been measured, much less whether the measurements are valid. The field of measurement science, or metrology, provides methods for rigorous and trustworthy use of measured values, qualified with assessments of uncertainty.
In line with its long tradition of research and leadership in measurement science, America’s AI Action Plan charges NIST with leading “the development of the science of measuring and evaluating AI models.” Working as part of NIST’s efforts, CAISI is developing new methods and frameworks to measure capabilities and risks of leading AI models as part of its mandate to “facilitate testing and collaborative research related to harnessing and securing the potential of commercial AI systems.”
AI measurement science is an emerging field. CAISI’s work to develop robust evaluations of AI systems in domains such as cybersecurity or biosecurity relies on ongoing research into methods for rigorous measurement. Below, we spotlight several non-exhaustive examples of challenging AI measurement science questions
What concepts do current AI evaluations measure, and do they generalize to other domains and real-world settings? How can evaluators ensure measurement instruments are reliable and valid?
I-A. Construct validity. Often, claims about the capabilities (e.g., mathematical reasoning) of AI systems don’t match the construct actually measured by the benchmark (e.g., accuracy at answering math problems). A critical step in AI evaluation is the assessment of construct validity, or whether a testing procedure accurately measures the intended concept or characteristic. Construct validity is needed to accurately assess whether AI systems meet expectations.
I-B. Generalization. Some AI evaluation results are unjustifiably generalized beyond the test setting. Research is needed to determine the extent to which evaluation results apply to other contexts and predict real-world performance.
I-C. Benchmark design and assessment. Researchers rely heavily on standardized benchmarks to compare performance across models and over time, but there are few guidelines for constructing reliable, rigorous benchmarks.
I-D. Measurement instrument innovation. Researchers are developing new approaches to measuring and monitoring AI system characteristics at scale. Appropriate uses, limitations, and best practices for these emerging methods are not yet clear.
How can practitioners appropriately interpret and act on the results of benchmarks, field studies, and other evaluations?
II-A. Uncertainty. All measurements involve some degree of uncertainty. Accurate claims about AI systems require honest and transparent communication of this uncertainty, but some presentations of benchmark results omit error bars and other basic expressions of uncertainty.
II-B. Baselines. Many evaluations lack relevant human or other non-AI baselines. Baselines are necessary to accurately interpret results (e.g., comparing AI diagnostic tool accuracy to expert physicians).
II-C. Model and benchmark comparison. Benchmark results are helpful for comparing the performance of different AI systems, but many reports ignore uncertainty and other factors needed for accurate comparison. Practitioners need valid methods to rank models, measure AI system improvement over time, resolve conflicting benchmark results, and make other comparisons of evaluation results.
II-D. Reporting. Measurement validity depends not only on the measurement instrument but also the context of the evaluation and the claims it supports. Often, reports of model performance from developers and researchers lack key details needed to assess validity. Accurate and useful interpretations of evaluation results depend on evaluators reporting sufficient detail.
What methods enable measurement of AI systems in real-world settings?
III-A. Downstream outcome measurement. Post-deployment evaluations are often neglected, but are necessary to assess AI systems’ performance, risks, and impacts in the real world.
III-B. Stakeholder consultation. Domain stakeholders have valuable subject matter expertise needed to ensure successful AI adoption, but researchers and developers typically conduct AI evaluations with little involvement from the public or other important stakeholders in the success of American AI.
We welcome your engagement as we evaluate how CAISI can best to support stakeholders in advancing AI innovation through measurement science. Please feel free to share comments or feedback via email to caisi-metrology [at] nist.gov (caisi-metrology[at]nist[dot]gov).