Anytime-Valid Testing in the Age of AI-Assisted Software Development

by Michael Scott Lindon (Netflix)

The statistical guarantees designed to protect against human failures in sequential experimentation turn out to be exactly what is needed to govern autonomous AI agents conducting experiments.

Since the early 2010s, technology companies have been running hundreds to thousands of A/B tests per day. The canonical example of an A/B test is one testing button colours or reworded headlines – visible product elements designed to improve outcomes such as engagement. In reality, the majority of modern A/B tests are not of this kind, but are employed as quality control gates for safely rolling out new features. This is especially true of A/B tests used to roll out new code into production.

In software engineering, observability refers to the ability to infer the internal state of a system from its external outputs – logs, metrics, and traces. App-load time, response latency, and number of stream rebuffers are just a few examples of measurements found in observability data; a modern emerging metric is the LLM-as-a-judge evaluation, in which a large language model scores the quality of AI-generated outputs. While the earliest uses for observability data were dashboards and alerting, such data can be used to test the performance of an incumbent software version against a newer replacement across an extremely heterogeneous population of hardware devices – what the industry calls a canary test. The idea is simple. Before a software update reaches the entire user population, it is first exposed to a small, randomized subset of devices. Observability data is ingested from control and treatment devices and monitored in real time. If a performance degradation is detected, the release is aborted, preventing bugs from reaching end-users.

Detection speed is essential. The longer degraded software remains live, the worse the cumulative harm to users. This necessitates sequential statistical methodologies that preserve Type-I error guarantees under continuous monitoring [1]. Canary tests, by the real-time nature and sheer volume of their data, are one of the clearest examples where anytime-valid inference is superior to group sequential testing methods, which require a small, finite number of pre-specified interim analyses.

The richness of observability data presents a further challenge: a single canary experiment may monitor hundreds of metrics simultaneously, making multiple testing corrections unavoidable. These metrics further exhibit a complicated dependency structure. A further advantage of e-processes, beyond preserving Type-I error under optional stopping, is their amenability to multiple testing procedures. The e-BH procedure [2], for example, controls false discovery rate under arbitrary dependence, in contrast to the positive regression dependence assumption required by the original BH procedure.

Catching a failure, however, is only half the problem. When a canary test fails, the engineer must causally attribute the shift in performance metrics to lines of code changed between software versions. This has traditionally been a very manual and painstaking process. Large language models change this picture. AI coding assistants are enabling developers to ship faster than ever, with decreasing direct supervision over what is pushed to production. An LLM can read a code diff and reason about which changes plausibly caused which metric movements. An agent can go further: generate a hypothesis, make a targeted code change, run a new canary, and evaluate the result. This closed loop may iterate until the root cause is isolated and resolved. The canary experiment in the software delivery pipeline is exactly the quality control gate needed to address this concern.

The analogy to clinical diagnosis is apt. A physician orders an initial battery of tests, forms hypotheses from the results, orders further tests to accumulate evidence, and ultimately confirms a diagnosis by observing whether the patient responds to a drug designed for a specific disease. Agentic software debugging follows the same logic, with canary experiments in place of diagnostic tests and targeted code changes in place of drug therapy.

The entire machinery of safe anytime-valid inference fits this agentic workflow with striking coherence. Within a single experiment, anytime-valid tests allow the agent or human operator to stop as soon as there is enough evidence of improvement, degradation, or futility, making experimentation both more efficient and less risky for users. Across the many metrics monitored in each experiment, e-value based multiple testing procedures control false discoveries even under complex dependence. E-values from distinct experiments can also be merged to combine evidence.

Across the unbounded sequence of hypotheses generated by the agent’s learn-edit-test loop, online FDR procedures control the rate at which false discoveries enter the agent’s evolving understanding of users. When the agent surfaces conclusions to a human reviewer, e-values also support valid post-selection inference over the hypotheses that survived.

This becomes essential when the workflow is not merely autonomous debugging, but autonomous statistical research. An agent can run an experiment, inspect the results, infer something about users, generate follow-up hypotheses, and launch the next experiments without waiting for a human analyst. Run this flywheel without online FDR and false discoveries are no longer isolated reporting errors. They corrupt the trajectory of the research – steering the agent toward hypotheses built on noise rather than genuine insight into users. Online FDR is the governance layer that keeps this loop from optimizing around noise.

The fit between agentic workflows and the toolkit of anytime-valid inference, e-values, and online FDR is not a coincidence – it runs deeper than analogy. The problems these methods solve are exactly the problems autonomous agents recreate: peeking at accumulating data, launching follow-up studies based on borderline results, and reporting only the post-selected outcomes. Agents do not intend any of this, but they recreate the same statistical failure modes by optimizing for finding an answer. Formal statistical guarantees are the only protection available. Safe anytime-valid inference provides the tools to build a statistical harness for agentic experimentation.

Links:
[L1] https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df
[L2] https://netflixtechblog.com/sequential-testing-keeps-the-world-streaming-netflix-part-2-counting-processes-da6805341642

References:
[1] M. Lindon et al., “Rapid regression detection in software deployments through sequential testing,” KDD, ACM, 2022.
[2] R. Wang and A. Ramdas, “False discovery rate control with e-values,” J. R. Stat. Soc. B, vol. 84, no. 3, pp. 822–852, 2022.
[3] Z. Xu and A. Ramdas, “Online multiple testing with e-values,” AISTATS, PMLR, 2024.

Please contact:
Michael Scott Lindon, Netflix
This email address is being protected from spambots. You need JavaScript enabled to view it.