Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

Building Evaluation Probes into Agentic AI

Description

Overview

AI agents represent the next evolution of language models, capable of planning multi-step tasks and autonomously taking actions, such as using tools and searching databases. Behind agents’ simple interfaces lies a hidden world of complex multi-step workflows. To build confidence that these workflows have executed correctly, users need increased visibility into the chain of reasoning, tool usage, and gathered evidence that led to each agentic decision. As organizations explore deploying agentic systems in critical domains, evaluation tooling that can systematically check an agent’s outputs can help ensure the trustworthy, effective performance of these systems.

To advance this goal, researchers in the AI Research, Measurement, and Standards Division of the Information Technology Laboratory (ITL) AI Program are developing evaluation probes, automated tools that are integrated directly into an agentic workflow to act as adversarial verifiers. The results of these evaluations are accumulated into a machine-readable audit trail that helps users assess agent actions and outputs.

To develop this approach, the team is focusing on one foundational evaluation task: building an audit trail to ensure that factual claims made by agents are well grounded. Our evaluation probes scrutinize the factual grounding of AI outputs as they are produced by comparing them against a human-curated corpus of reference documents.

The goal is to move beyond “the AI said so” to better understand “here is what the AI found, where it found it, and how the evidence supports the conclusions.” In the long run, this approach could be developed into a platform that provides fully characterized measurements of the quality of agentic AI result grounding.

Objectives

  • Develop automated evaluation probes, rubric-based LM-judge tools that scrutinize agentic AI outputs against trusted document corpora to assess their factual grounding
  • Produce structured audit trails that map agent decisions to the supporting document evidence
  • Establish reproducible evaluation tools that provide a baseline for assessing factual grounding of agent outputs
  • Deliver a documented, extensible evaluation methodology that organizations can test and adapt for their own agentic AI deployments

   

 

Approach

To develop and validate the probe methodology, the ITL AI Program built an open-source deep research pipeline to serve as an experimental testbed. The pipeline processes a research question alongside a corpus of authoritative documents; systematically evaluates every document chunk for relevance to the query; synthesizes a fully cited report; and automatically runs evaluation probes on every citation. Results are stored in a structured audit trail alongside the report.

Probes can be executed as part of the active agentic workflow (providing immediate feedback) or applied as post-hoc evaluations. Each probe evaluates against trusted source material using a strict rubric, and returns a structured verdict. This verdict includes a rationale detailing exactly how the source supports or fails to support the claim based on the rubric's criteria.

The demonstration probes included in the testbed cover several dimensions of citation quality:

  • Faithfulness (anti-hallucination): does the source actually support the claim?
  • Completeness (anti-cherry-picking): does the text capture the source’s full message?
  • Sufficiency (anti-overreaching): does the source carry the evidentiary burden the claim requires?

Because each probe is defined independently by a rubric, new probes can be added to characterize different dimensions of agent evidence attribution without overhauling the underlying infrastructure.

Access the Repository

Invitation for Input

Measuring and evaluating agentic AI is a shared challenge that requires a collaborative solution. NIST is seeking input from AI engineers, technical leads, trust and safety teams, and compliance officers about practical measurement and evaluation challenges and domain-specific use cases.

We are especially interested in:

  • Your current agentic AI measurement, characterization, and evaluation challenges

  • Domains or applications where factual grounding and output traceability are most critical

  • Gaps in current evaluation approaches that this line of work should address

Learn More

ITL AI Webinar Series: Building Measurement Probes into Agentic AI Ecosystem (April 7, 2026) 

NIST Information Technology Laboratory AI Webinar Series: Building Measurement Probes into Agentic AI Ecosystems
NIST Information Technology Laboratory AI Webinar Series: Building Measurement Probes into Agentic AI Ecosystems

The NIST Information Technology Laboratory (ITL) AI Program hosted a technical webinar on early research focused on developing automated measurement tools, called probes, to build traceability into agentic AI ecosystems. The approach adapts established techniques such as judges/verifiers, grounds them in a knowledge-base, and empowers them to evaluate agentic AI output.

The webinar discussed existing technical gaps in the measurement infrastructure of agentic AI systems. The team outlined initial research into a promising approach for applying concepts such as adversarial verifiers to better evaluate Agentic AI outputs.

Created May 1, 2026, Updated May 5, 2026
Was this page helpful?