February 17, 2026
Author(s)
Andrew Keller, Kweku Kwegyir-Aggrey, Ryan Steed, Anita Rao, Julia Sharp, Amanda Bergman
Benchmarks are widely used to evaluate and compare the performance of artificial intelligence systems. However, some approaches to computing benchmark metrics produce invalid uncertainty estimates or make unrecognized assumptions about the evaluation