Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Expanding the AI Evaluation Toolbox with Statistical Models

Published

Author(s)

Andrew Keller, Kweku Kwegyir-Aggrey, Ryan Steed, Anita Rao, Julia Sharp, Amanda Bergman

Abstract

Benchmarks are widely used to evaluate and compare the performance of artificial intelligence systems. However, some approaches to computing benchmark metrics produce invalid uncertainty estimates or make unrecognized assumptions about the evaluation setting. We leverage statistical modeling to make two contributions to the practice of AI benchmarking. First, we formally distinguish measurements of benchmark accuracy (performance conditioned on a fixed benchmark) from generalized accuracy (performance on all potential test items similar to those included in the benchmark). Then, in a simulated setting and with large-scale evaluation of 22 API-access frontier large language models on 3 popular benchmarks, we show how analysis via generalized linear mixed models can produce correct estimates of generalized accuracy while more efficiently quantifying uncertainty compared to existing regression-free approaches. We also show how this approach can equip evaluators with important context on evaluation results, including variance decomposition and item difficulty estimates that illuminate important aspects of LLM performance and benchmark construction. Our findings highlight the benefits of explicit statistical modeling for more accurate and reliable AI benchmarking.
Citation
NIST Trustworthy and Responsible AI - 800-3
Report Number
800-3

Keywords

Artificial intelligence (AI), benchmarks, evaluation, generalized linear mixed models (GLMMs), generative AI, large language models (LLMs), measurement, statistical modeling

Citation

Keller, A. , Kwegyir-Aggrey, K. , Steed, R. , Rao, A. , Sharp, J. and Bergman, A. (2026), Expanding the AI Evaluation Toolbox with Statistical Models, NIST Trustworthy and Responsible AI, National Institute of Standards and Technology, Gaithersburg, MD, [online], https://doi.org/10.6028/NIST.AI.800-3, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=961314 (Accessed February 18, 2026)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created February 17, 2026
Was this page helpful?