Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

AI Measurement and Evaluation Workshop

Workshop Description 

NIST will hold a virtual workshop on Artificial Intelligence Measurement and Evaluation June 15-17, 2021. The three-day workshop aims to bring together stakeholders and experts to identify the most pressing needs for AI measurement and evaluation and to advance the state of the art and practice.

NIST is assigned responsibility by statute to advance underlying research for measuring and assessing AI technologies. That includes the development of AI data standards and best practices, as well as AI evaluation and testing methodologies and standards. NIST is working collaboratively with the private and public sectors to help prioritize and work on its AI activities. 

 Workshop Goals:

  • Identify:
    1. the needs for and intended uses of AI measurement and evaluation
    2. the gaps in knowledge/practice preventing current AI measurement and evaluation activities from effectively meeting these needs/uses
  • Solicit guidance on which specific areas that NIST should focus its efforts 
  • Identify specific users and applications in need of measurement and evaluation
  • Collate best practices for AI measurement and evaluation
  • Build community around AI measurement and evaluation: provide tools, resources, pointers to other sources, and continued engagement through periodic talks and seminars

Panels and discussions will be organized to provide feedback on topics related to the AI Measurement and Evaluation, and to influence the future direction of NIST efforts in this area.

This workshop will be ideal for: 

  • Researchers who are interested in AI measurement and evaluation.
  • Developers of AI technologies who need to perform evaluation and testing of AI systems.
  • Policymakers and decision makers who need to use the outputs of AI measurements and evaluations.

Workshop Materials 

Stay connected with the latest NIST AIME updates -- sign up for the mailing list by either:

  1. If you already have a google account associated with your email address, go to and click the “Join group” button
  2. Or, send an email to aime+subscribe [at] (aime+subscribe[at]list[dot]nist[dot]gov)

A new project called Dioptra has been released on GitHub at

Dioptra is a test bed software currently focused on adversarial machine learning and defensive mitigations. It is in a pre-release status but we would like to start collecting community feedback.

Workshop Read-Ahead:  Artificial Intelligence Measurement and Evaluation at the National Institute of Standards and Technology (Draft)

Fact Sheet: NIST AI Program


Workshop Agenda 

Download the detailed agenda (PDF)

Day 1: Tuesday June 15, 2021

All times EDT (UTC-4)


Start Time End Time Topic
11:00 AM  11:20 AM 

Welcome, Workshop Goals & Logistics, Overview

  • Elham Tabassi (Chief of Staff, Information Technology Laboratory, NIST)
11:20 AM  11:50 AM 

Keynote: A National Security Perspective on AI Measurement and Evaluation

  • Jason Matheny (Deputy Assistant to the President for Technology and National Security; Deputy Director for National Security in the White House Office of Science and Technology Policy; and Coordinator for Technology and National Security at the National Security Council)
11:50 AM  12:00 PM  Break
12:00 PM 1:30 PM

Panel 1: Measuring with Purpose

Discussion of the needs for and uses of AI evaluation outputs and their role in driving down-stream processes, including the requirements and properties important for an AI evaluation to possess in order to be fit for the intended uses. Identification of areas for which current measurement and evaluation approaches are insufficient or do not exist, where further AI metrology research would be beneficial.


  • Tess DeBlanc-Knowles  (White House Office of Science and Technology Policy)


  • Jack Clark (Anthropic)
  • Michael Hind (IBM Research) (slides
  • Chuck Howell (MITRE)
  • Jane Pinelis (Test and Evaluation of AI/ML at DoD Joint  Artificial Intelligence Center)
  • Salvatore Scalzo (European Commission)

  • Bill Scherlis (DARPA)
1:30 PM 1:45 PM  Break and Discussion Time
1:45 PM 2:15 PM

Panel 2: Overview of Past & Current Evaluations

Overview of the evaluation-driven research paradigm that has been used at NIST to evaluate AI systems, with a description of the various styles of evaluations, as well as examples of some of the AI measurement and evaluation activities conducted at NIST.


  • Mark Przybocki (NIST)


  • Peter Bajcsy (NIST)
  • Jonathan Fiscus (NIST)
  • Jonathon Phillips (NIST)
  • Michael Sharp (NIST)
  • Ellen Voorhees (NIST)
  • Megan Zimmerman (NIST)
2:15 PM 2:45 PM

Panel 3: Discussion of NIST/Community Future Work (slides

Discussion of the limitations of current AI measurement and evaluation activities that prevent them from addressing all the needs for AI measurement and evaluation, and future plans for NIST to address these limitations together with the research community. 

2:45 PM 3:00 PM Break
3:00 PM 4:00 PM

Panel 4: Evaluating AI during Operation

Discussion of AI evaluation in production/operational environments, including topics drawn from: MLOps; Operational evaluation metrics/Business metrics; Model quality/Data drift with online data; Latency, throughput, and scalability issues; Adversarial attacks and robustness  to corruptions/perturbations; Governance and regulatory compliance.


  • Antonio Moretti (Walmart)


  • Clarence Agbi (Brex)
  • Sergey Karayev (Turnitin)
  • Josh Tobin (Gantry)
4:00 PM 4:15 PM

Closing Remarks

NIST Workshop Organizing Committee

4:15 PM 5:00 PM After Hours: Slack with NIST staff


Day 2: Wednesday June 16, 2021

All times EDT (UTC-4)

Start Time End Time Topic
11:00 AM  11:30 AM


  • Fei-Fei Li (Sequoia Professor, Stanford University; Co-Director of Stanford’s Human-Centered AI Institute)
11:30 AM 12:30 PM 

Panel 5: Evaluation Design Process

Discussion of the processes and procedures for designing evaluations of AI systems, including: the high-level considerations and decisions that must be made in order to design and implement effective evaluations; the components of and relationships between the various evaluation design elements; and the role of the applications and overall evaluation goals in evaluation design.


  • Nicholas Carlini (Google Brain)


  • Matthias Hein (University of Tübingen)
  • Deborah Raji (Mozilla Foundation)
  • Shibani Santurkar (MIT / Stanford)
  • Ludwig Schmidt (Toyota Research / UW)
12:30 PM 12:45 PM Break
12:45 PM 1:45 PM

Panel 6: Metrics and Measurement Methods

Discussion of: the properties of an AI system that can/should be measured, and which properties have/lack metrics and measurement methods; the different measurement methods that are used to measure AI and their strengths/limitations; the different types and uses of metrics, and the various properties that a metric can poses; the impacts of the chosen metrics and measurements methods have on an evaluation; when is it important to have glass box access to AI systems for evaluation, and when the design/approach taken by an AI system influences the choice of metrics/measurement methods.


  • Craig Greenberg (NIST)


  • José Hernández-Orallo (Universitat Politècnica de València) (slides
  • Douglas Reynolds (NSA / MIT Lincoln Laboratory)
  • Sameer Singh (UCI)
1:45 PM 2:00 PM Break
2:00 PM 3:00 PM

Panel 7: Data and Data Sets

Data collection methods and dataset design for AI system measurement and evaluation, along with discussions drawing from the following topics: approaches for data annotation/labeling; uncertain, missing, or non-existence of ground truth; how much data is necessary; needs for and uses of simulated/generated data; roles of common datasets in research; repurposing of data; ethical and privacy considerations; et al.


  • Aleksander Mądry (MIT)


  • Marzyeh Ghassemi (U Toronto/MIT)
  • Tom Goldstein (UMD)
  • Emre Kiciman (MSR)
  • Nicolas Papernot (U Toronto)
3:00 PM 3:15 PM Break
3:15 PM 4:15 PM

Panel 8: Limitations, Challenges, and Future Directions of Evaluation

Discussion of the limitations, challenges, shortcomings, and future directions for the evaluation and measurement of AI, including the new or emerging evaluation paradigms, the ability/inability to generalize evaluation results and its policy implications. Needs and plans for improvements to existing measurement and evaluation activities as well as the creation of new AI evaluation challenge problems and measurement research.


  • Soheil Feizi (UMD)


  • Kamalika Chaudhuri (UCSD)
  • Eric Horvitz (MSR)
  • Percy Liang (Stanford)
  • Chris Meserole (Brookings)
  • Daniela Rus (MIT)
4:15 PM 4:30 PM Break and Slack Discussion Time
4:30 PM 5:00 PM

Closing Remarks 

NIST Workshop Organizing Committee

5:00 PM 5:30 PM After Hours: Slack with NIST staff


Day 3: Thursday June 17, 2021

All times EDT (UTC-4)

Start Time End Time Topic
11:00 AM  11:30 AM

Keynote: AI test and evaluation from National AI Initiative Perspective

  • Lynne Parker (Director, National AI Initiative Office, White House Office of Science and Technology Policy)
11:30 AM 12:30 PM

Panel 9: Measuring Concepts that are Complex, Contextual, and Abstract

Discussion of the challenges and approaches for measuring AI system characteristics that are complex, contextual, and/or abstract, or are otherwise difficult to quantify (such as explainability, bias, trustworthiness, safety, etc.) including the role that descriptive and/or qualitative measurements should play in these cases. 


  • Ellen Voorhees (NIST)


  • Lora Aroyo (Google)
  • Ben Carterette (Spotify)
  • David Ferrucci (Elemental Cognition)
12:30 PM 12:45 PM Break
12:45 PM 1:45 PM

Panel 10: Measuring with Humans in the Mix

Discussion of the measurement and evaluation of AI systems that work in cooperation with humans, including the roles and relationships between the AI systems and the humans, and the challenges of and approaches to
measurement and evaluation when humans and AI systems are involved.


  • Margaret Burnett (OSU)


  • Rachel Bellamy (IBM)
  • Madeleine Clare Elish (Google)
  • Robert Hoffman (IHMC)
1:45 PM 2:00 PM Break
2:00 PM 3:00 PM

Panel 11: Software Infrastructure Overview, Existing Tools and Future Desires

Discussion of the landscape, challenges, and needs of developing tools and infrastructure for the particular purpose of measuring, testing, and evaluating  AI systems. 


  • Harold Booth (NIST)



  • Pin-Yu Chen (IBM) (slides)
  • Harsha Nori (Microsoft)
  • David Pitman (Google)
3:00 PM 3:15 PM Break and Discussion Time
3:15 PM 4:15 PM

Panel 12: Practical Considerations and Best Practices for Measurement and Evaluation

Discussion of the practical considerations and concrete best practices for the measurement and evaluation of AI-based systems, including the testing and evaluation strategies that can be used to mitigate privacy loss or intellectual property exposure in AI testing.


  • William Streilein (MIT Lincoln Laboratory)


  • Matt Gaston (SEI Emerging Technology Center, CMU)
  • Sven Krasser (CrowdStrike)
  • Sanjeev Mohindra (MIT Lincoln Laboratory)
  • Jane Pinelis (Test and Evaluation of AI/ML at DoD Joint Artificial Intelligence Center)
4:15 PM 4:30 PM Break
4:30 PM 5:00 PM

NIST: Workshop Debrief and Next Steps

NIST Workshop Organizing Committee


Created May 4, 2021, Updated August 25, 2022