1 The Problem with Conventional P-Values
The standard t-test for a treatment effect relies on asymptotic theory: as the sample grows large, the test statistic converges to a known distribution (the normal or t-distribution). This works well with large samples. But what about small randomised experiments—a trial with 20 villages, 8 counties, or 12 firms? Asymptotic approximations can be poor, and the resulting p-values misleading.
There is an older, more elegant approach that does not require asymptotic approximations: randomization inference (RI), developed by R.A. Fisher in the 1930s [Fisher, 1935]. The key insight is that when treatment is randomly assigned, the randomisation itself provides the distribution needed for inference—no asymptotics needed.
2 The Sharp Null Hypothesis
RI tests a specific hypothesis called the sharp null hypothesis: that the treatment has no effect on any unit. Formally, for each unit i:
This is much stronger than the usual null that the average effect is zero. The sharp null says treatment has zero effect for every single individual, not just on average.
Why test the sharp null? Because it is what makes RI possible. Under the sharp null, we know the potential outcome Yi(1) = Yi(0) = Yi—the observed outcome—regardless of treatment status. This means we can compute what the test statistic would have been under any alternative treatment assignment.
3 The Logic of Randomization Inference
Here is the key idea in three steps.
Step 1: Observe the treatment assignment and outcomes. We observe D = (D1, . . . , Dn) (the actual assignment) and Y = (Y1, . . . , Yn) (the observed outcomes). We compute our test statistic—say, the difference in means: T obs = Ȳ1 − Ȳ0.
Step 2: Consider the distribution of assignments. Because treatment was randomly assigned, we know the set Ω of all possible treatment assignments: all ways to assign exactly nT units to treatment. For example, with 6 units and 3 treated, there are (6/3) = 20 possible assignments.
Step 3: Under the sharp null, compute the permutation distribution. Under H0, each unit's outcome is fixed regardless of assignment. So for each possible assignment ω ∈ Ω, we can compute what the test statistic would have been:
The collection {Tω : ω ∈ Ω} is the permutation distribution of the test statistic under the null.
The p-value is simply the share of permuted statistics at least as extreme as the observed one:
4 A Worked Example
Suppose we run a small experiment: 6 villages, 3 treated with a health intervention. The outcomes (mortality rates per 1,000) are:
The observed mean difference: Tobs = (12+15+14)/3 − (18+20+17)/3 = 13.67 − 18.33 = −4.67.
Under the sharp null, the outcomes are fixed. We list all (6/3) = 20 possible assignments of 3 villages to treatment, compute the mean difference for each, and count how many have |Tω| ≥ 4.67. If only 1 out of 20 assignments is as extreme, the randomization p-value is 1/20 = 0.05.
With small experiments, this exact enumeration is feasible. With larger samples, one uses Monte Carlo: draw many random permutations and approximate the permutation distribution.
5 Randomization Inference in R
For the ri2 and ri packages in R, which support more complex designs:
6 Advantages and Limitations
6.1 Advantages
Exact finite-sample validity: The p-value is exact (not asymptotic) under the sharp null, regardless of sample size.
No distributional assumptions: Works for non-normal outcomes, binary outcomes, skewed distributions.
Flexible test statistics: Can use any test statistic—rank statistics, maximum effects, quantile effects—not just means.
Mirrors the design: The inference procedure mirrors how the data were generated, making it design-based.
6.2 Limitations
Sharp null: RI tests H0 : Yi(1) = Yi(0) for all i, not just that the average effect is zero. Failing to reject says nothing more than that the data are consistent with zero effects for everyone.
No confidence intervals directly: To invert RI into confidence intervals, one must test a range of constant effects (the Hodges–Lehmann approach), which is more involved.
Clusters and stratification: More complex designs require matching the permutation scheme to the assignment scheme—but this is conceptually straightforward.
7 Connection to the Synthetic Control
One of the most celebrated applications of RI is the synthetic control method [Abadie et al., 2010]. When there is only one treated unit (e.g. one state or country), standard t-tests are meaningless—there is no distribution to appeal to. Abadie et al. use a form of RI: they apply the synthetic control procedure to each donor unit in turn (as if each were treated), and use the resulting distribution of placebo effects to assess whether the effect on the true treated unit is unusual. This is exactly the permutation logic, applied to the synthetic control estimator rather than the simple difference in means.
8 Common Mistakes
Wrong permutation scheme: If the design included stratification or clustering, the permutations must respect that structure—not just permute treatment labels freely.
Confusing rejection with effect size: A significant RI p-value tells you the data are unlikely under the sharp null; it does not tell you how large the effect is.
Using RI for observational data: RI is only valid when treatment is actually randomly assigned. In observational studies, the permutation distribution has no physical meaning.
9 Where to Learn More
Fisher [1935] — the original source
Rosenbaum [2002] — comprehensive treatment including sensitivity analysis
Imbens and Rubin [2015] — causal inference textbook with RI chapter
The ri2 R package vignette — practical implementation
10 Conclusion
Randomization inference is one of the cleanest tools in the causal inference toolkit. It asks a simple question—"how often would random chance produce a result this extreme?"—and answers it directly, using the randomisation mechanism itself as the source of the null distribution. For researchers working with small or cluster-randomised experiments, it is often the preferred approach to inference.
References
- Abadie, A., Diamond, A., and Hainmueller, J. Synthetic control methods for comparative case studies: Estimating the effect of California's tobacco control program. Journal of the American Statistical Association, 105(490):493–505, 2010.
- Fisher, R. A. The Design of Experiments. Oliver and Boyd, Edinburgh, 1935.
- Imbens, G. W. and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge, 2015.
- Rosenbaum, P. R. Observational Studies, 2nd ed. Springer, New York, 2002.