2024 NIST GenAI (Pilot Study): Text-to-Text Evaluation Overview and Results

Hariharan Iyer; Seungmin Seo; Lukas Diduch; Kay Peterson; George Awad; Yooyoung Lee

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

2024 NIST GenAI (Pilot Study): Text-to-Text Evaluation Overview and Results

Published

June 25, 2025

Author(s)

Hariharan Iyer, Seungmin Seo, Lukas Diduch, Kay Peterson, George Awad, Yooyoung Lee

Abstract

The 2024 NIST Generative AI (GenAI) Pilot Study focuses on evaluating text-to-text (T2T) generation and discrimination tasks to assess the capabilities and limitations of generative AI models and AI detectors. The study aims to measure the effectiveness of AI-generated text in mimicking human writing and the ability of AI-based discriminators to distinguish between human- and AI-generated content. A curated dataset of article groups and associated human- and machine-generated summaries served as the benchmark, with performance assessed using statistical and machine learning-based metrics, including AUC (Area Under the Curve) and Brier scores. The results indicate that while AI-generated summaries increasingly resemble human writing, detection models remain reasonably effective in distinguishing between them. Performance varies significantly depending on the systems used, but there are some generators that could deceive most discriminators, and there are discriminators that could detect AI-generated content from almost all generators. There is certainly room for improvement for both generator and discriminator systems. We also found that discriminator systems improved over the multiple rounds of testing. Moving forward, future work will focus on refining evaluation methodologies, expanding multi-modal assessments across text, image, and audio domains, and developing standardized benchmarking protocols. These efforts aim to provide a robust test and evaluation framework for assessing generative AI technologies and AI detector technologies, guiding both researchers and policymakers in understanding their evolving impact.

Citation

NIST Trustworthy and Responsible AI - NIST AI 700-1

Report Number

NIST AI 700-1

Pub Type

NIST Pubs

Download Paper

https://doi.org/10.6028/NIST.AI.700-1

Local Download

Keywords

Artificial Intelligence (AI), Generative AI, Discriminative AI, Deepfakes, Large Language Model (LLM), Forensics, Evaluation, Measurement, Provenance, Authenticity, Detection, Accuracy, and Robustness

Forensic Science, Artificial intelligence and AI measurement and evaluation

Citation

Iyer, H. , Seo, S. , Diduch, L. , Peterson, K. , Awad, G. and Lee, Y. (2025), 2024 NIST GenAI (Pilot Study): Text-to-Text Evaluation Overview and Results, NIST Trustworthy and Responsible AI, National Institute of Standards and Technology, Gaithersburg, MD, [online], https://doi.org/10.6028/NIST.AI.700-1, https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=959809 (Accessed March 18, 2026)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created June 25, 2025

Was this page helpful?

2024 NIST GenAI (Pilot Study): Text-to-Text Evaluation Overview and Results

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues