A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

Shivani Upadhyay; Ronak Pradeep; Nandan Thakur; Daniel Campos; Nick Craswell; Ian Soboroff; Hoa Dang; Jimmy Lin

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

PUBLICATIONS

A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

Published

July 18, 2025

Author(s)

Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Dang, Jimmy Lin

Abstract

The application of large language models to provide relevance assessments presents exciting opportunities to advance IR, NLP, and beyond, but to date many unknowns remain. In this paper, we report on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool. This setup allows us to correlate system rankings induced by the different approaches to characterize tradeoffs between cost and quality. We find that in terms of nDCG@20, nDCG@100, and Recall@100, system rankings induced by automatically generated relevance assessments from UMBRELA correlate highly with those induced by fully manual assessments across a diverse set of 77 runs from 19 teams. Our results suggest that automatically generated UMBRELA judgments can replace fully manual judg- ments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring (obvious) tangible benefits. Overall, human assessors appear to be more strict than UMBRELA in applying relevance criteria. Our work validates the use of LLMs in academic TREC-style evaluations and provides the foundation for future studies.

Citation

arxiv.org

Pub Weblink

https://arxiv.org

Pub Type

Websites

Download Paper

Local Download

Keywords

information retrieval, llm

Information technology

Citation

Upadhyay, S. , Pradeep, R. , Thakur, N. , Campos, D. , Craswell, N. , Soboroff, I. , Dang, H. and Lin, J. (2025), A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look, arxiv.org, [online], https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=959054, https://arxiv.org (Accessed November 23, 2025)

Issues

If you have any questions about this publication or are having problems accessing it, please contact [email protected].

Created July 18, 2025, Updated September 18, 2025

Was this page helpful?

A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look

Author(s)

Abstract

Download Paper

Keywords

Citation

Additional citation formats

Issues