Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

AI Test, Evaluation, Validation and Verification (TEVV)



The development and utility of trustworthy AI products and services depends heavily on reliable measurements and evaluations of underlying technologies and their use. NIST conducts research and development of metrics, measurements, and evaluation methods in emerging and existing areas of AI; contributes to the development of standards; and promotes the adoption of standards, guides, and best practices for measuring and evaluating AI technologies as they mature and find new applications.

On October 30, 2023, President Biden signed an Executive Order (EO) to build US capacity to measure and manage the risks of AI systems to ensure safety, security, and trust, while promoting an innovative, competitive AI ecosystem that supports workers and protects consumers. Learn more about NIST's responsibilities under the EO and the creation of the US Artificial Intelligence Safety Institute, including the Consortium that is being established.


NIST has a long history of AI measurement and evaluation activities, starting in the late 1960s with the measurement and evaluation of automated fingerprint identification systems. Since then, NIST has designed and conducted hundreds of evaluations of thousands of AI systems. While these activities typically have focused on measures of accuracy and robustness, other types of AI-related measurements and evaluations under investigation include bias, interpretability, and transparency. Working collaboratively with others, NIST aims to expand these efforts, driving AI research and enabling progress by:

  1. Advancing the measurement science for AI: defining, characterizing, and theoretically and empirically developing and analyzing quantitative and qualitative metrics and measurement methods for various characteristics of AI technologies.
  2. Conducting evaluations of AI: designing and conducting evaluations of AI technologies –including developing tasks, challenge problems, testbeds, software tools and helping to curate and characterize meaningful data sets – and identifying technical gaps and limitations in AI technologies and related measurements. 
  3. Developing technical guidelines and practices: sharing results and guidelines to inform academic, industrial, and government programs.
  4. Contributing to voluntary consensus-based standards for measuring and evaluating AI: leading or participating in standardization efforts to support the development, deployment, and evaluation of AI technologies.

NIST projects are carried out by researchers from a variety of disciplines across the NIST laboratories and frequently in collaboration with industry, other government agencies, and academia. In addition, the new US Artificial Intelligence Safety Institute and Consortium will be a key element of NIST's work on AI measurement and evaluation.

These activities are part of NIST’s efforts to build a strong and active community around the measurement and evaluation of AI technologies – and complement NIST’s establishment of forums dedicated to the advancement of AI metrology research. This spurs collaboration among those who design, develop, deploy, test, and evaluate AI technologies and helps to meet the needs of a broad and diverse AI community. Events convened by NIST to strengthen the AI measurement and evaluation community include:

For more information about how to engage with NIST on AI, see: Engage

Current/Future Work

NIST has been engaged in focused efforts to establish common terminologies, definitions, and taxonomies of concepts pertaining to characteristics of AI technologies in order to form the necessary underpinnings for trustworthy AI systems. Those characteristics include accuracy, explainability and interpretability, privacy, reliability, robustness, safety, security (resilience), and mitigation of harmful bias. Each requires its own portfolio of measurements and evaluations, and context is crucial. How a given component is measured and evaluated can change based on the context in which the AI system operates. 

For each characteristic, NIST has produced – or aims to document and improve – the definitions, applications, tasks, and strengths and limitations of metrics and measurement methods in use or being proposed. NIST also has developed – or may prepare and curate – meaningful data sets with respect to select attributes of interest and apply chosen metrics and measurement methods to various AI systems.

A selection of related projects is displayed here