More Discoveries and Flexibility in Multiple Testing

by Rianne de Heide (University of Twente and CWI)

Modern data analysis can test thousands of scientific questions at once, from genes in cancer studies to voxels in brain scans. A new general principle called e-closure gives researchers more freedom to explore these results after seeing the data, while keeping false discoveries under control.

In many areas of science, a single experiment no longer asks a single question. A cancer researcher may test tens of thousands of genes for a difference between tumour types. A neuroscientist may test activity in more than one hundred thousand brain locations. A climate scientist or computer scientist may scan huge collections of variables, models or configurations. Each individual test may look innocent, but together they create a simple danger: many discoveries will be false positives.

Statistics has developed safeguards for this situation, known as multiple testing methods. One classic safeguard is very strict: try to avoid even one false alarm (Family-Wise Error Rate control – FWER). This is useful in confirmatory studies, but often too conservative for discovery science. A more practical rule, now standard in genomics and many other fields, controls the false discovery rate (FDR). Informally, this means that among the reported discoveries, only a controlled fraction is expected to be false. It is the statistical reason why a long list of genes can be taken seriously enough for follow-up experiments.

But there is a catch. The guarantee is usually attached to the list produced by a pre-specified method. In practice, scientists rarely stop there. They look at the list, compare the discoveries with biological knowledge, study effect sizes, remove uninteresting items, split results into clusters or pathways, and choose a smaller set for expensive validation. This is sensible scientific behaviour, but it falls outside the mathematical guarantee.

A familiar example is the volcano plot, widely used in molecular biology and oncology (Figure 1). Each point represents a gene or other molecular feature. One axis shows how statistically surprising the result is; the other shows how large the estimated effect is. The most attractive discoveries appear in the top left and top right corners: both statistically convincing and practically large. Researchers often first apply an FDR-method and then, using the volcano plot, keep only the most extreme effects. The visual intuition is compelling. Unfortunately, this second filtering step can quietly yet severely break the original error guarantee. A subset of a well-controlled discovery list is not automatically well controlled. Ebrahimpoor and Goeman showed that this can substantially inflate the false discovery rate in realistic genomic settings [1].

Figure 1: Schematic volcano plot illustrating double filtering (synthetic data). Each point represents one feature, such as a gene. The horizontal axis shows the estimated effect size, while the vertical axis shows statistical surprise. The tempting top-left and top-right corners contain results that look both statistically significant and practically large. Although these points are often treated as the most interesting findings, selecting them after applying a multiple-testing procedure can invalidate the original FDR guarantee. E-closure aims to make such post hoc choices part of the valid statistical analysis.

Our work asks whether researchers can have both things at once: rigorous error control and the freedom to interact with the results. The answer is yes, through a new unifying principle called e-closure.

The idea builds on two lines of work. The first is the classical closure principle for multiple testing. For several decades, closure has provided a complete recipe for FWER controlling methods. Later work extended this type of thinking to other error measures (but not FDR). The second ingredient is the e-value, a modern measure of evidence that behaves particularly well when evidence is combined, accumulated over time, or inspected adaptively. Unlike the p-value, which asks how surprising the data would be under a null hypothesis, an e-value can be read as a kind of betting score: large values are evidence against the null, while the rules of the game prevent systematically exaggerated evidence when the null is true.

E-closure combines these ideas. On top of designing one test for each separate hypothesis, we design evidence summaries for groups of hypotheses. The principle then turns these summaries into a menu of discovery sets that are all valid simultaneously. The researcher is not forced to report exactly one preordained list. After seeing the data, they may choose from the menu, for example to focus on genes with larger effects, to report a biologically meaningful pathway, or to separate a brain-imaging result into anatomical regions.

This is the key conceptual shift. Traditional FDR procedures give: "Here is the list of discoveries; do not modify it if you want to keep the guarantee." E-closure gives: "Here is a menu of lists; you may choose from this menu after looking, and the guarantee still holds." In the volcano plot setting, this means that post hoc filtering can be made part of the valid procedure rather than an invalid step after it.

The principle is not merely a new method among many. Our technical result shows that e-closure is necessary and sufficient for a broad class of multiple-testing guarantees based on expected error. In particular, it covers FDR control, and it recovers the classical closure principle for FWER as a special case [2, 3]. This gives the field a common theory for methods that previously looked unrelated.

It also gives practical benefits. When existing procedures such as Benjamini-Hochberg or Benjamini-Yekutieli are expressed through e-closure, they can be uniformly improved: the new version never makes fewer discoveries and sometimes makes more, while preserving the same error guarantee [2, 3]. In real-data examples in our technical paper, closed versions of established procedures find – sometimes many more – additional discoveries at the same target error level.

There are computational challenges, because in principle all groups of hypotheses could be considered: exponentially many in the number of hypotheses. Yet the same was true for classical closure, and decades of work have produced shortcuts for important cases. Our paper gives polynomial-time algorithms for several useful procedures and points to further algorithmic work.

For applied scientists, the promise is that exploratory analysis need not be statistically fragile. For theoreticians, e-closure offers an elegant general theory. For users of volcano plots, pathway analyses and brain maps, it suggests a future in which interactive scientific judgement and formal error control can be used together, rather than treated as opposing goals.

The work described in this article is based on joint research with Jelle Goeman (Leiden University Medical Center), Aldo Solari (Ca' Foscari University of Venice), Ziyu Xu (Carnegie Mellon University), Lasse Fischer (University of Bremen), and Aaditya Ramdas (Carnegie Mellon University).

Links:
[L1] https://arxiv.org/abs/2509.02517, https://github.com/neilzxu/eclosure (python), https://cran.r-project.org/web/packages/eClosure/index.html (R package by Jelle Goeman)
[L2]Active Volcano Plot software: https://github.com/mitra-ep/ActiveVolcanoPlot

References:
[1] M. Ebrahimpoor and J. J. Goeman, “Inflated false discovery rate due to volcano plots: problem and solutions,” Briefings in Bioinformatics, 22(5), 2021.
[2] Z. Xu et al., “Bringing Closure to False Discovery Rate Control: A General Principle for Multiple Testing,” arXiv:2509.02517, 2026.
[3] J. Goeman, “A Uniform Improvement of the Benjamini-Hochberg Procedure using e-Closure”, arXiv preprint arXiv:2606.01854, 2026.

Please contact:
Rianne de Heide
University of Twente and Centrum Wiskunde & Informatica, The Netherlands, This email address is being protected from spambots. You need JavaScript enabled to view it.

Sidebar

Contents

More Discoveries and Flexibility in Multiple Testing