Racing to the Truth: How E-values Can Speed Up Science

by Sebastian Arias , Alexander Ly, Michele Meziu (CWI) and Angel Reyero Lobo (CWI and Inria)

Modern science generates data continuously, but the statistical methods that still dominate many fields generally require data collection to end before reliable meta-analysis can begin. New research on e-values offers a way to analyse evidence in real time, without sacrificing statistical reliability. The approach could make science not just more robust to modern research practices, but significantly more efficient.

Science no longer proceeds one experiment at a time. Data now flow continuously from laboratories, countries and digital platforms, often through sprawling parallel collaborations. Yet statistical practice remains rooted in an era when evidence was assessed only after each study had been concluded.

Current ongoing research into e-values [1] suggests a more modern approach. Our preliminary findings indicate that e-values are not just theoretically appealing but also practically advantageous, allowing evidence to be monitored, combined and acted upon continuously during data collection – without sacrificing statistical reliability or sample efficiency (see Figure 1 for a schematic illustration).

Figure 1: Sequential evidence monitoring across multiple data-collection sites. The map shows five geographically distinct data-collection sites, each represented by a different colour. The coloured dots along the x-axis mark observations as they arrive over time from the different sites. The curve shows the accumulation of meta-analytical evidence against either the hypothesis of no effect, or against the hypothesis that the effect is meaningfully large. The horizontal line marks the stopping threshold for the combined analysis, corresponding here to the evidential strength of four individually significant replication attempts.

Projects such as the Many Labs 2 Project [2] illustrate the potential of e-value-based methods in multi-stream experimental settings. This vast replication effort involved 61 laboratories and more than 15,000 participants testing 28 published psychological effects. Under traditional p-value-based methods, researchers could not safely aggregate evidence until every participating laboratory had finished collecting data; interim analyses risked producing unreliable conclusions.

E-values could have changed this. Preliminary results suggest that, had they been used, robust conclusions about replicability could often have been reached using only a fraction of the realised sample sizes – and crucially, while the experiments were still running.

The method’s power [L1] lies in its structure. It races two e-values against each other: one accumulating evidence against the null hypothesis of no effect, and another accumulating evidence against the hypothesis that the effect is meaningfully large. Both remain statistically valid regardless of when – or even if – data collection has stopped. This allows interim evidence from multiple laboratories to be safely combined into a live meta-analysis, continuously weighing the case for and against replicability.

Figure 2 shows the percentage of the realised sample sizes needed to reach (strong) conclusions for 17 of the 28 effects. The meta-analytical evidential threshold used to determine success or failure of replicability across sites corresponds to the strength of four individually significant replication attempts combined. In several cases, the reductions are substantial. For one effect (“ross1” in Figure 2), reliable replication could already have been established after 421 participants – just 5.8% of the 7,205 eventually recruited. For another (“bauer”), evidence against replicability emerged after only 311 participants, or 4.7% of the final sample of 6,608. The gains were not universal: one effect (“risen”) still required nearly half of the realised sample size before a conclusion could be reached.

Figure 2: Percentage of the realised sample from the replication studies needed to reach a conclusion across seventeen different effects. The identifiers on the left refer to the (first) authors of the original published effects. Blue indicates evidence for replication success, whereas yellow indicates evidence for replication failure.

Interpreting these reductions precisely, however, requires some caution. Many Labs 2 was designed primarily to characterise variability in effect sizes across diverse samples and contexts, with replication success as a secondary consideration. Its design therefore prioritised breadth and robustness over statistical efficiency, suggesting that while our observed reductions should not be interpreted too literally, they nonetheless point to substantial untapped gains from continuous evidence aggregation.

The examples come from psychology, a field acutely aware of the costs of inefficient and irreproducible research. But the implications extend much further. Medicine, genomics and online experimentation increasingly depend on evidence accumulated simultaneously across many sites. Statistical tools capable of aggregating such evidence in real time may therefore become central to the organisation of science itself.

Our research is ongoing. But the early results already suggest that the e-value-based methods designed to make scientific inference robust to the adaptive realities of modern data collection can also make science substantially more efficient.

Link:
[L1] https://github.com/AlexanderLyNL/safestats/tree/futility88

References:
[1] A. Ly, et al., “Dynamic evidence synthesis with e-values: Efficient sequential meta-analysis with early stopping for efficacy or futility with anytime-valid type I and II error control,” work in progress, 2026.
[2] R. A. Klein, et al., “Many Labs 2: Investigating variation in replicability across samples and settings,” Advances in Methods and Practices in Psychological Science, vol. 1, no. 4, pp. 443–490, 2018.

Please contact:
Alexander Ly, CWI, The Netherlands
This email address is being protected from spambots. You need JavaScript enabled to view it.