Perspectives on Large Language Models for Relevance Judgment

authored by: Guglielmo Faggioli, Charles L.A. Clarke, Gianluca Demartini, Matthias Hagen, Claudia Hauff, Noriko Kando, Evangelos Kanoulas, Martin Potthast, Benno Stein, Henning Wachsmuth, Laura Dietz
Abstract: When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.
Organisation(s): Leibniz University Hannover
External Organisation(s): University of Padova
University of Waterloo
University of Queensland
Friedrich Schiller University Jena
Spotify
Research Organization of Information and Systems National Institute of Informatics
University of Amsterdam
Leipzig University
Bauhaus-Universität Weimar
University of New Hampshire
Type: Conference contribution
Pages: 39-50
No. of pages: 12
Publication date: 09.08.2023
Publication status: Published
Peer reviewed: Yes
ASJC Scopus subject areas: Computer Science (miscellaneous), Information Systems
Electronic version(s): https://doi.org/10.48550/arXiv.2304.09161 (Access: Open)
https://doi.org/10.1145/3578337.3605136 (Access: Open)

BibTeX

@inproceedings{7582791698b742b2a1dffcd353510682,
title = "Perspectives on Large Language Models for Relevance Judgment",
abstract = "When asked, large language models∼(LLMs) like ChatGPT claim that they can assist with relevance judgments but it is not clear whether automated judgments can reliably be used in evaluations of retrieval systems. In this perspectives paper, we discuss possible ways for∼LLMs to support relevance judgments along with concerns and issues that arise. We devise a human - machine collaboration spectrum that allows to categorize different relevance judgment strategies, based on how much humans rely on machines. For the extreme point of 'fully automated judgments', we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing opposing perspectives for and against the use of∼LLMs for automatic relevance judgments, and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR∼researchers.",
keywords = "automatic test collections, human - machine collaboration, large language models, relevance judgments",
author = "Guglielmo Faggioli and Clarke, {Charles L.A.} and Gianluca Demartini and Matthias Hagen and Claudia Hauff and Noriko Kando and Evangelos Kanoulas and Martin Potthast and Benno Stein and Henning Wachsmuth and Laura Dietz",
note = "Funding Information: This material is based upon work supported by the National Science Foundation under Grant No. 1846017. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. ; 9th ACM SIGIR International Conference on the Theory of Information Retrieval : ICTIR 2023 ; Conference date: 23-07-2023 Through 23-07-2023",
year = "2023",
month = aug,
day = "9",
doi = "10.48550/arXiv.2304.09161",
language = "English",
pages = "39--50",
booktitle = "ICTIR '23",
publisher = "Association for Computing Machinery, Inc",
}

Publication Details

Perspectives on Large Language Models for Relevance Judgment

Funded by