As artificial intelligence (AI) systems are increasingly integrated into commercial and government applications, there is a growing demand to monitor these systems in real-world settings. While the concept of monitoring digital systems for quality assurance is not new, particularly in the cases of cybersecurity and software continuous monitoring, it is a vast and fragmented space in the AI sector. Given that AI systems have novel properties that introduce variability and manifest in unpredictable ways, post-deployment monitoring – from incident monitoring to field studies – is a crucial practice for confident, wide-spread AI adoption.
To address this pressing need, in 2025 the Center for AI Standards and Innovation (CAISI) held three practitioner workshops and conducted an in-depth literature review to map the landscape, focusing on current challenges to robust and effective post-deployment monitoring of AI systems.
Our findings are outlined in the new report, NIST AI 800-4: Challenges to the Monitoring of Deployed AI Systems, in which we identify monitoring categories and detail challenges (gaps, barriers, and open questions) to inform and spur future research in the field. The primary contribution of this report is the identification, organization, and documentation of monitoring challenges, and reporting of views expressed by experts in the field.
Six common categories of monitoring, developed via thematic coding, are listed in the table below. See Appendix B of the report for the full methodology, and Appendix C for the associated codebook.
Monitoring Category | Definition |
Functionality Monitoring | Measuring system functions, capabilities, and features to ensure the system works as intended |
Operational Monitoring | Measuring system infrastructure components, for example to ensure the system maintains consistent levels of service |
Human Factors Monitoring | Measuring human-system interactions, for example to ensure the system produces high-quality outputs and is transparent |
Security Monitoring | Measuring where the system is potentially vulnerable to adversarial attacks and misuse |
Compliance Monitoring | Measuring system components for adherence to relevant laws, regulations, standards, controls, and guidelines |
Large-Scale Impacts Monitoring | Measuring system properties that have wide downstream impacts, for example to ensure the system promotes human flourishing |
To manageably synthesize the many challenges reported by practitioners and subject matter experts, we organized the database of workshop quotes and literature excerpts in two ways: (1) by monitoring category, as, for example, some monitoring challenges are more applicable to human factors than security (e.g., overhead of collecting and gauging user feedback), and (2) those challenges that are shared across categories (e.g., poor incident sharing mechanisms). Finally, we sorted open questions on AI system monitoring into “who”, “what”, “when”, “why”, and “how” to monitor.
The table below highlights a sampling of post-deployment monitoring challenges. See the report for the full list.
| Highlighted Gaps, Barriers, and Open Questions |
Category-Specific Challenges | Gaps:
Barriers:
|
Cross-Cutting Challenges | Gaps:
Barriers:
|
Open Questions |
|
The identified gaps, barriers, and open questions highlight impactful opportunities for further investigation and innovation. The monitoring categories can offer a common language for describing sub-fields within AI system monitoring, and the challenges identified highlight areas where additional solutions are needed.
We welcome your engagement as we evaluate how best to support stakeholders in post-deployment monitoring of AI systems. You can share comments via email to caisi-metrology [at] nist.gov (caisi-metrology[at]nist[dot]gov).