AI agents represent the next evolution of language models, capable of planning multi-step tasks and autonomously taking actions, such as using tools and searching databases. Behind agents’ simple interfaces lies a hidden world of complex multi-step workflows. To build confidence that these workflows have executed correctly, users need increased visibility into the chain of reasoning, tool usage, and gathered evidence that led to each agentic decision. As organizations explore deploying agentic systems in critical domains, evaluation tooling that can systematically check an agent’s outputs can help ensure the trustworthy, effective performance of these systems.
To advance this goal, researchers in the AI Research, Measurement, and Standards Division of the Information Technology Laboratory (ITL) AI Program are developing evaluation probes, automated tools that are integrated directly into an agentic workflow to act as adversarial verifiers. The results of these evaluations are accumulated into a machine-readable audit trail that helps users assess agent actions and outputs.
To develop this approach, the team is focusing on one foundational evaluation task: building an audit trail to ensure that factual claims made by agents are well grounded. Our evaluation probes scrutinize the factual grounding of AI outputs as they are produced by comparing them against a human-curated corpus of reference documents.
The goal is to move beyond “the AI said so” to better understand “here is what the AI found, where it found it, and how the evidence supports the conclusions.” In the long run, this approach could be developed into a platform that provides fully characterized measurements of the quality of agentic AI result grounding.
To develop and validate the probe methodology, the ITL AI Program built an open-source deep research pipeline to serve as an experimental testbed. The pipeline processes a research question alongside a corpus of authoritative documents; systematically evaluates every document chunk for relevance to the query; synthesizes a fully cited report; and automatically runs evaluation probes on every citation. Results are stored in a structured audit trail alongside the report.
Probes can be executed as part of the active agentic workflow (providing immediate feedback) or applied as post-hoc evaluations. Each probe evaluates against trusted source material using a strict rubric, and returns a structured verdict. This verdict includes a rationale detailing exactly how the source supports or fails to support the claim based on the rubric's criteria.
The demonstration probes included in the testbed cover several dimensions of citation quality:
Because each probe is defined independently by a rubric, new probes can be added to characterize different dimensions of agent evidence attribution without overhauling the underlying infrastructure.
Measuring and evaluating agentic AI is a shared challenge that requires a collaborative solution. NIST is seeking input from AI engineers, technical leads, trust and safety teams, and compliance officers about practical measurement and evaluation challenges and domain-specific use cases.
We are especially interested in:
Your current agentic AI measurement, characterization, and evaluation challenges
Domains or applications where factual grounding and output traceability are most critical
Gaps in current evaluation approaches that this line of work should address
The NIST Information Technology Laboratory (ITL) AI Program hosted a technical webinar on early research focused on developing automated measurement tools, called probes, to build traceability into agentic AI ecosystems. The approach adapts established techniques such as judges/verifiers, grounds them in a knowledge-base, and empowers them to evaluate agentic AI output.
The webinar discussed existing technical gaps in the measurement infrastructure of agentic AI systems. The team outlined initial research into a promising approach for applying concepts such as adversarial verifiers to better evaluate Agentic AI outputs.