E-values, P-values and Counterfactuals

by Peter Grünwald (CWI and Leiden University)

A major criticism of p-values and standard confidence intervals, first coined around 1960, is their sensitivity to counterfactuals: their validity depends on how data would have been collected in situations that never occurred, which is often unknown or even unknowable. The fact that e-based methods remain valid under optional continuation implies that they do not suffer from this problem…or does it?

Suppose that a randomized clinical trial to test a new medical treatment is performed on 50 patients represented by 50-dimensional data vector X(₁). The result turns out to be promising but not conclusive: the researchers observed a p-value p₁ = 0.1 while they had a significance level of 0.05 in mind. But their boss is optimistic at the news and agrees to supply the resources to test another 30 patients, resulting in data X(₂).

Is this good news? Not if one measures evidence in the second trial by another p-value, say p₂. For some realizations of X(₁), it may be decided not to gather X(₂). As a result, standard combination methods for p-values like Fisher’s cannot be employed, since they invariably require that, no matter what is observed in each sample, we always combine both. Similarly, joining the two data sets and recalculating the p-value leads to a wrong answer as well: take a method that stops for some values of X(₁) and in that case outputs p* = p₁, a sharp, nonconservative p-value for X(₁), yet continues for other values of X(₁) and in that case, after observing X(₂), outputs p* = p′ with p′ a number strictly smaller than 1. It is easily shown that for any such method, the resulting p* is not a p-value—it tends to exaggerate the evidence that an effect is present.

Now, it is often countered that p-values were never meant to be used for such optional continuation. So we simply shouldn’t use them here! But this is a subtle matter: in a variation of the example, suppose the researchers told their boss merely that p₁ was small enough for the result to be “promising but not conclusive,” and they did not tell its actual value. The boss, once again, feels optimistic when he hears “promising” and suggests they continue the trial on a second batch of 30 patients.

The researchers, who know their statistics, are now worried about invalidating the results. But then news reaches the boss that the p-value is 0.1. Disappointed, he now decides to stop the trial after all—he had thought “promising” really meant something closer to 0.05 than 0.1.

Should the researchers be relieved? Perhaps surprisingly, the answer is no! [1,2] A simple calculation shows that the mere fact that the sample would have been 80 patients (thus different from the originally planned 50) in some counterfactual situation (i.e. if the first 50 data points had been different than they actually were) already makes the p-value invalid. That is, counterfactuals can ruin the validity of the p-value even if the sample plan was not in fact changed for the data which were actually observed.

As we show in [1], using e- instead of p-values avoids the above problem. This suggests that e-based methods solve the general conundrum about counterfactuals. But, it turns out, they only do this if the counterfactuals involved are related to time. For counterfactuals related to censoring, they don’t:

Pratt’s (1962) Voltmeter
Suppose we observe X₁, ..., X_n where n is fixed and the X_i represent voltages of electron tubes, measured with an accurate voltmeter. A statistician examines the X_i assuming they are normally distributed with fixed variance and some mean μ. He aims to use a p-value to measure the evidence against the null hypothesis μ = 4. Later he visits the engineer’s laboratory, and notices that the voltmeter reads only as far as 6 (Figure 1). Even though none of the X_i were ≥ 6 this makes the standard p-value invalid; it necessitates a new calculation that takes into account the (potential, counterfactual!) censoring [L1,1,2,3].

Figure 1: A voltmeter prone to censoring, from the era in which Pratt came up with the example. Photo by Andrey Shtanko (CC BY 4.0).

However, the engineer then says she also has a super-high-range meter, equally accurate, which she would have used if any of the measurements had turned out 6. This is a relief to the statistician, because it means the original p-value is correct after all. But the next day the engineer telephones and says, “I just discovered my high-range voltmeter was not working the day I did the experiment.” The statistician then informs her that a new analysis will be required after all!

The engineer is astounded. She says, “But the experiment turned out just the same as if the high-range meter had been working. I learned exactly what I would have learned if the high-range meter had been available. Next you’ll be asking about my oscilloscope!”

Unknown Unknowns
As we show in [1], variations of this problem do affect e-methods: they might also have to be re-defined in terms of counterfactual censoring. Nevertheless, both the voltmeter and the optional continuation story have a similar flavour: one first obtains less precise information (50 data points; or censored measurements). Then, depending on the value of this initial data, one may or may not decide to get more precise information (additional data points; uncensored measurements). This suggests that extensions of e-values that deal with such more general counterfactuals are possible—this is one of the projects that I suggested to tackle in my honoured ERC Advanced Grant. For this, we need to formalize the intriguing notion of observing the outcome of a random variable without knowing the definition of that random variable, which we managed to do in [1]. Yet, much work still needs to be done; specifically, one needs to be able to employ, within the same mathematical formula, random variables that have known definitions and random variables that have unknown definitions. This is still a major challenge—which may require tools from epistemic logic rather than probability and statistics!

Link:
[L1] https://www2.stat.duke.edu/~st118/sta732/PrincHO.pdf

References:
[1] B. Chugg, A. Ramdas, and P. D. Grünwald, “E-values as statistical evidence: A comparison to Bayes factors, likelihoods, and p-values,” arXiv preprint arXiv:2603.24421, 2026.
[2] E. J. Wagenmakers, “A practical solution to the pervasive problem of p-values,” Psychonomic Bulletin & Review, vol. 14, no. 5, pp. 779–804, 2007.
[3] L. J. Savage, G. Barnard, J. Cornfield, et al., “On the foundations of statistical inference: Discussion,” Journal of the American Statistical Association, vol. 57, no. 298, pp. 307–326, 1962.

Please contact:
Peter Grünwald
CWI and Leiden University, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

E-values, P-values and Counterfactuals