by Jan Deriu and Mark Cieliebak (Zurich University of Applied Sciences)
With all the recent hype around Large Language Models and ChatGPT in particular, one crucial question is still unanswered: how do we evaluate generated text, and how can this be automated? In this SNF project, we develop a theoretical framework to answer these questions.
Figure 1: Illustration created with Dall-E3 with the following prompt: “Image of a robot evaluating two chatbots, and a human overseeing the process”.
We are witnessing the rapid advancement of Large Language Models (LLMs) that can tackle many natural language processing tasks, which seemed hardly possible only a few years ago. For instance, performing question-driven summarisations of long meeting transcripts seemed out of scope before ChatGPT. Despite the rapid advancement and application of these text generation models, in both public and professional domains, the academic community still struggles with establishing consistent, reliable methods to evaluate the quality of such models.
The dilemma extends across the spectrum of evaluation methods. On one end, human-based evaluations, while offering nuanced insights, are notoriously time-consuming, expensive, and prone to inconsistent results due to low inter-rater agreement. On the other end, automated metrics promise more efficient improvement cycles for LLMs by providing timely feedback at low cost. Yet, these too have their pitfalls: untrained metrics, e.g. word overlap measures such as BLEU, despite their simplicity and widespread use, often fail to correlate strongly with human judgements. Trained metrics, which are trained to emulate human ratings, show better alignment with human assessments, but they are fragile, requiring domain-specific retraining and being highly sensitive to the training data, parameters and adversarial manipulations.
The result is that human evaluations are highly uncertain due to small sample sizes (i.e. they have a high variance), while automated evaluations are more certain about the results, but they do not agree well with human ratings (i.e. they have a low variance but a high bias). The consequence of using automated metrics is that the evaluation results cannot be trusted.
As an example, assume that you are tasked with evaluating the hallucination rate in the creation of summaries by ChatGPT (i.e. in how many cases it generates factually wrong statements), and compare it to Falcon-180B (a well-known open-source LLM). How would we proceed? We would take a batch of texts to be summarised and generate the summaries using both systems. Then we would ask humans to read these summaries and source texts, and annotate the hallucinations. However, this task is very tedious, and we might be only able to gather, say, 50 human annotations per system due to time and money constraints. Now, we use GPT-4 to automatically rate the generated texts, and we are able to gather another 1,000 automated annotations. However, are we sure that these can be trusted? How can we leverage the data provided in this scenario to create an evaluation of ChatGPT and Falcon-180B, where we can state if one hallucinates less than the other? How can we mitigate the mistakes that automated metrics make in contrast to human evaluation? Is it possible that GPT-4 unfairly favours the texts generated by ChatGPT? How can we select texts which have the highest discriminatory power, i.e. which best show us the differences between the two systems?
The goal of the UniVal project, a 3-year SNF project starting in January 2024 at ZHAW, is to provide answers to these questions. We will develop a theoretical framework to ensure trustworthy evaluation by developing a Bayesian model for the evaluation process to mitigate the aforementioned issues with evaluation. We already performed two pre-studies to test the concept [1, 2], which showcased the modelling power of our approach by combining human and automated ratings. In , we showed that by explicitly modelling the mistakes of the metrics with respect to the human evaluation, it is possible to create a trustworthy evaluation, which is precise in stating how certain the evaluation is that the hallucination rate of one system is lower than that of the other system. In , we showcased how to create a trustworthy evaluation guideline from the theoretical framework to reduce the number of human annotations needed. The modelling approach that we present allows us to run counterfactual examples, which gives us insights into how an evaluation needs to be prepared in order to yield the insights that we wish to gain. That is, how many human ratings are needed, how many automated ratings, and how good the automated metric has to be. Thus, yielding a very powerful tool in the evaluation arsenal.
Evaluation of text generation systems is a very important, hard, and yet unsolved challenge. However, we must not forget that it took almost 200 years to define the notion of a metre (which is now based on the speed of light), and it took even longer to master the science of measuring temperature. Thus, it might also take a large amount of effort to measure the quality of general text generation. Our SNF project will provide one piece of the puzzle.
 P. von Däniken et al., “On the effectiveness of automated metrics for text generation systems,” in Findings of the Association for Computational Linguistics: EMNLP 2022. [Online]. Available: https://aclanthology.org/2022.findings-emnlp.108/
 J. Deriu et al., “Correction of errors in preference ratings from automated metrics for text generation,” in Findings of the Association for Computational Linguistics: ACL 2023. [Online]. Available: https://aclanthology.org/2023.findings-acl.404/
Jan Deriu, Centre for Artificial Intelligence at the Zurich University of Applied Sciences, Switzerland