Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Artificial intelligence

AI Test, Evaluation, Validation and Verification (TEVV)

NIST announced two AI evaluation programs: Assessing Risks and Impacts of AI (ARIA) on July 26, 2024, and the NIST GenAI Challenge on April 29, 2024.

Summary

The development and utility of trustworthy AI products and services depends heavily on reliable measurements and evaluations of underlying technologies and their use. NIST conducts research and development of metrics, measurements, and evaluation methods in emerging and existing areas of AI; contributes to the development of standards; and promotes the adoption of standards, guides, and best practices for measuring and evaluating AI technologies as they mature and find new applications.

Description

NIST has a long history of AI measurement and evaluation activities, starting in the late 1960s with the measurement and evaluation of automated fingerprint identification systems. Since then, NIST has designed and conducted hundreds of evaluations of thousands of AI systems. While these activities typically have focused on measures of accuracy and robustness, other types of AI-related measurements and evaluations under investigation include bias, interpretability, and transparency. Working collaboratively with others, NIST aims to expand these efforts, driving AI research and enabling progress by:

Advancing the measurement science for AI: defining, characterizing, and theoretically and empirically developing and analyzing quantitative and qualitative metrics and measurement methods for various characteristics of AI technologies.
Conducting evaluations of AI: designing and conducting evaluations of AI technologies –including developing tasks, challenge problems, testbeds, software tools and helping to curate and characterize meaningful data sets – and identifying technical gaps and limitations in AI technologies and related measurements.
Developing technical guidelines and practices: sharing results and guidelines to inform academic, industrial, and government programs.
Contributing to voluntary consensus-based standards for measuring and evaluating AI: leading or participating in standardization efforts to support the development, deployment, and evaluation of AI technologies.

NIST projects are carried out by researchers from a variety of disciplines across the NIST laboratories and frequently in collaboration with industry, other government agencies, and academia. The NIST AI Innovation Lab (NAIIL) leads or coordinates many of these efforts. In addition, the new US Artificial Intelligence Safety Institute and a Consortium will be a key element of NIST's work on AI measurement and evaluation.

Two major recent evaluation efforts are the Assessing Risks and Impacts of AI (ARIA) and NIST GenAI programs.

These activities are part of NIST’s efforts to build a strong and active community around the measurement and evaluation of AI technologies – and complement NIST’s establishment of forums dedicated to the advancement of AI metrology research. This spurs collaboration among those who design, develop, deploy, test, and evaluate AI technologies and helps to meet the needs of a broad and diverse AI community. Events convened by NIST to strengthen the AI measurement and evaluation community include:

Symposium on Unleashing AI Innovation, Enabling Trust (September, 24-25, 2024)
Workshop on Collaboration to Enable Safe and Trustworthy AI (November 17, 2023)
Workshop on AI Measurement and Evaluation (June 15-17, 2021)
AI Metrology Colloquia Series

For more information about how to engage with NIST on AI, see: Engage

Current/Future Work

NIST has been engaged in focused efforts to establish common terminologies, definitions, and taxonomies of concepts pertaining to characteristics of AI technologies in order to form the necessary underpinnings for trustworthy AI systems. Those characteristics include accuracy, explainability and interpretability, privacy, reliability, robustness, safety, security (resilience), and mitigation of harmful bias. Each requires its own portfolio of measurements and evaluations, and context is crucial. How a given component is measured and evaluated can change based on the context in which the AI system operates.

For each characteristic, NIST has produced – or aims to document and improve – the definitions, applications, tasks, and strengths and limitations of metrics and measurement methods in use or being proposed. NIST also has developed – or may prepare and curate – meaningful data sets with respect to select attributes of interest and apply chosen metrics and measurement methods to various AI systems.

A selection of NIST AI measurement and evaluation related projects is displayed here.

Was this page helpful?