Assessing Generative AI Systems Using E-scores

by Guneet Singh Dhillon (University of Oxford), Teodora Pandeva (Microsoft Research), and Alicia Curth (Microsoft Research)

Generative AI systems are becoming ubiquitous, but their outputs can still be inaccurate or misleading. Using e-values, the e-scores framework provides a statistically rigorous assessment of AI-generated responses while accommodating the adaptive and post-hoc nature of human-AI interactions.

Generative AI is actively shaping our everyday lives. Large language models now draft emails, summarise documents, generate software code, and answer millions of conversational queries each day. Their outputs are often coherent and persuasive, yet they may still contain subtle inaccuracies, fabricated facts, or logical inconsistencies. Such spurious outputs are especially problematic in high-stakes settings like healthcare, education, finance, and scientific research. This creates a central challenge for trustworthy AI: how can we reliably assess whether an AI-generated response is actually correct?

Statistical methods based on p-values have long offered a principled way to control errors in scientific inference. P-value-based methods can also provide guarantees in our setting by filtering generated responses [1]. For example, suppose a user is willing to tolerate a 5% error rate. After filtering responses by their corresponding p-values, the probability of mistakenly retaining an incorrect response is at most 5%. In this way, the retained responses are reliable for the chosen tolerance level. A user asking for restaurant recommendations may accept a relatively high tolerance for error, while another seeking medical or legal advice may require a tolerance closer to zero.

These guarantees, however, only hold if the user fixes the tolerance level before examining the data. Modern-day usage of generative AI systems rarely follows such a rigid protocol because adaptivity is intrinsic to human-AI interaction. Users naturally revise their tolerance after inspecting the data. Figure 1 illustrates a setting in which the filtered responses remain unchanged for any tolerance level between 0.01 and 1. Since a lower tolerance corresponds to a stronger reliability guarantee, the user would want to choose 0.01 after observing the data. From a statistical perspective, this creates a serious problem that closely resembles the well-known issue of p-hacking in scientific research, and invalidates the original error-control guarantees.

Recent years have seen a surge of interest in hypothesis testing with e-values, which extend classical p-value methodology to more complex testing scenarios. Importantly, this includes settings with data-dependent or post-hoc tolerance levels. Unlike p-values, e-values retain post-hoc validity, i.e., they continue to provide guarantees even when the user chooses the tolerance level after examining the data [2]. This property makes them particularly well-suited to the interactive and adaptive use of modern generative AI systems.

In our recent work, together with Javier González, we developed a framework based on e-values for assessing the correctness of AI-generated responses [3]. At the heart of the approach is the idea of assigning each generated response a non-negative score that measures evidence of incorrectness, i.e., low scores indicate reliable responses, while high scores indicate incorrect ones. Embracing the adaptive nature of human-AI interaction, the framework allows users to choose their tolerance level (and therefore the filtering threshold) after observing the responses and their scores. Crucially, because these scores are derived from e-values, they provide guarantees for post-hoc validity. We therefore call these scores e-scores.

Another appealing feature of the e-scores approach is its ability to assess smaller portions of a generated response. For instance, a long AI-generated text may contain both highly reliable and spurious segments. As illustrated in Figure 1, the framework can assign e-scores not only to the complete response but also to partial responses, allowing users to identify which portions deserve greater trust and which may require additional verification. This provides much finer-grained information than a single global assessment while maintaining statistical validity. The framework is also applicable beyond large language models and textual outputs. In principle, it can broadly apply to any generative AI system and any output domain. This flexibility opens the door to diverse applications and use cases. In our work, we also demonstrate the framework’s efficacy in two real-world scenarios [3]. The first focuses on mathematical factuality to identify the first incorrect step in a chain of reasoning steps; Figure 1 presents one such example. The second evaluates the desirability of a response according to criteria such as helpfulness and truthfulness. In both cases, the empirical results corroborate that the theoretical guarantees translate effectively into practical utility.

Figure 1: E-scores example for mathematical factuality. The large language model’s response consists of five sub-responses, each a step in the mathematical reasoning (starting from the inner and ending on the outer block). The checks/crosses on the bottom left and the green/red colour of each block represent the response’s (in)correctness up to that point. We highlighted part of the third sub-response that, on manual inspection, caused the incorrectness, which cascades to subsequent sub-responses. The e-scores on the bottom right of each block are measures of incorrectness, i.e., low for correct and high for incorrect responses. (Credit: [3].)

Altogether, the e-scores framework offers a new mechanism for assessing generative AI systems. It incorporates e-values to bridge rigorous guarantees and adaptive human-AI interactions. In doing so, it empowers users to make reliable decisions while providing statistical post-hoc validity. We believe this presents an important step toward trustworthy AI and the adoption of AI systems in our everyday lives.

References:
[1] C. Mohri and T. Hashimoto, “Language models with conformal factuality guarantees,” in Proceedings of the International Conference on Machine Learning, 2024.
[2] P. D. Grünwald, “Beyond Neyman–Pearson: E-values enable hypothesis testing with a data-driven alpha,” Proceedings of the National Academy of Sciences, 2024.
[3] G. S. Dhillon, J. González, T. Pandeva, and A. Curth,
“E-scores for (in)correctness assessment of generative model outputs,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2026.

Please contact:
Guneet Singh Dhillon, University of Oxford, United Kingdom
This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

Assessing Generative AI Systems Using E-scores