Beginner's Corner

Randomization Inference: Fisher's Exact Test and Permutation P-Values

1 The Problem with Conventional P-Values

The standard t-test for a treatment effect relies on asymptotic theory: as the sample grows large, the test statistic converges to a known distribution (the normal or t-distribution). This works well with large samples. But what about small randomised experiments—a trial with 20 villages, 8 counties, or 12 firms? Asymptotic approximations can be poor, and the resulting p-values misleading.

There is an older, more elegant approach that does not require asymptotic approximations: randomization inference (RI), developed by R.A. Fisher in the 1930s [Fisher, 1935]. The key insight is that when treatment is randomly assigned, the randomisation itself provides the distribution needed for inference—no asymptotics needed.

2 The Sharp Null Hypothesis

RI tests a specific hypothesis called the sharp null hypothesis: that the treatment has no effect on any unit. Formally, for each unit i:

\( H_0 : Y_i(1) = Y_i(0) \quad \text{for all } i \)
(1)

This is much stronger than the usual null that the average effect is zero. The sharp null says treatment has zero effect for every single individual, not just on average.

Why test the sharp null? Because it is what makes RI possible. Under the sharp null, we know the potential outcome Yi(1) = Yi(0) = Yi—the observed outcome—regardless of treatment status. This means we can compute what the test statistic would have been under any alternative treatment assignment.

3 The Logic of Randomization Inference

Here is the key idea in three steps.

Step 1: Observe the treatment assignment and outcomes. We observe D = (D1, . . . , Dn) (the actual assignment) and Y = (Y1, . . . , Yn) (the observed outcomes). We compute our test statistic—say, the difference in means: T obs = Ȳ1 − Ȳ0.

Step 2: Consider the distribution of assignments. Because treatment was randomly assigned, we know the set Ω of all possible treatment assignments: all ways to assign exactly nT units to treatment. For example, with 6 units and 3 treated, there are (6/3) = 20 possible assignments.

Step 3: Under the sharp null, compute the permutation distribution. Under H0, each unit's outcome is fixed regardless of assignment. So for each possible assignment ω ∈ Ω, we can compute what the test statistic would have been:

\[ T^{\omega} = \frac{1}{n_T} \sum_{i:\,D_i^{\omega}=1} Y_i \;-\; \frac{1}{n_C} \sum_{i:\,D_i^{\omega}=0} Y_i \]

The collection {Tω : ω ∈ Ω} is the permutation distribution of the test statistic under the null.

The p-value is simply the share of permuted statistics at least as extreme as the observed one:

\[ p = \frac{\left|\{\omega \in \Omega : |T^{\omega}| \geq |T^{\text{obs}}|\}\right|}{|\Omega|} \tag{2} \]

4 A Worked Example

Suppose we run a small experiment: 6 villages, 3 treated with a health intervention. The outcomes (mortality rates per 1,000) are:

Table 1: Small experiment data
VillageTreated?Mortality
AYes12
BYes15
CYes14
DNo18
ENo20
FNo17

The observed mean difference: Tobs = (12+15+14)/3 − (18+20+17)/3 = 13.67 − 18.33 = −4.67.

Under the sharp null, the outcomes are fixed. We list all (6/3) = 20 possible assignments of 3 villages to treatment, compute the mean difference for each, and count how many have |Tω| ≥ 4.67. If only 1 out of 20 assignments is as extreme, the randomization p-value is 1/20 = 0.05.

With small experiments, this exact enumeration is feasible. With larger samples, one uses Monte Carlo: draw many random permutations and approximate the permutation distribution.

5 Randomization Inference in R

# Randomization inference for a simple experiment
set.seed(42)
Y <- c(12, 15, 14, 18, 20, 17) # outcomes
D <- c(1, 1, 1, 0, 0, 0) # treatment

# Observed test statistic
T_obs <- mean(Y[D==1]) - mean(Y[D==0])

# Monte Carlo permutation test (10,000 permutations)
B <- 10000
T_perm <- replicate(B, {
  D_perm <- sample(D) # randomly permute treatment
  mean(Y[D_perm==1]) - mean(Y[D_perm==0])
})

# Two-sided p-value
p_val <- mean(abs(T_perm) >= abs(T_obs))
cat("Observed statistic:", round(T_obs, 3), "\n")
cat("Randomization p-value:", round(p_val, 4), "\n")

For the ri2 and ri packages in R, which support more complex designs:

library(ri2)
probs <- declare_ra(N = 6, m = 3) # 6 units, 3 treated
ri2_result <- conduct_ri(
  formula = Y ~ Z,
  data = data.frame(Y = Y, Z = D),
  declaration = probs,
  sharp_hypothesis = 0 # sharp null of zero effect
)
summary(ri2_result)

6 Advantages and Limitations

6.1 Advantages

Exact finite-sample validity: The p-value is exact (not asymptotic) under the sharp null, regardless of sample size.

No distributional assumptions: Works for non-normal outcomes, binary outcomes, skewed distributions.

Flexible test statistics: Can use any test statistic—rank statistics, maximum effects, quantile effects—not just means.

Mirrors the design: The inference procedure mirrors how the data were generated, making it design-based.

6.2 Limitations

Sharp null: RI tests H0 : Yi(1) = Yi(0) for all i, not just that the average effect is zero. Failing to reject says nothing more than that the data are consistent with zero effects for everyone.

No confidence intervals directly: To invert RI into confidence intervals, one must test a range of constant effects (the Hodges–Lehmann approach), which is more involved.

Clusters and stratification: More complex designs require matching the permutation scheme to the assignment scheme—but this is conceptually straightforward.

7 Connection to the Synthetic Control

One of the most celebrated applications of RI is the synthetic control method [Abadie et al., 2010]. When there is only one treated unit (e.g. one state or country), standard t-tests are meaningless—there is no distribution to appeal to. Abadie et al. use a form of RI: they apply the synthetic control procedure to each donor unit in turn (as if each were treated), and use the resulting distribution of placebo effects to assess whether the effect on the true treated unit is unusual. This is exactly the permutation logic, applied to the synthetic control estimator rather than the simple difference in means.

8 Common Mistakes

Wrong permutation scheme: If the design included stratification or clustering, the permutations must respect that structure—not just permute treatment labels freely.

Confusing rejection with effect size: A significant RI p-value tells you the data are unlikely under the sharp null; it does not tell you how large the effect is.

Using RI for observational data: RI is only valid when treatment is actually randomly assigned. In observational studies, the permutation distribution has no physical meaning.

9 Where to Learn More

Fisher [1935] — the original source

Rosenbaum [2002] — comprehensive treatment including sensitivity analysis

Imbens and Rubin [2015] — causal inference textbook with RI chapter

The ri2 R package vignette — practical implementation

10 Conclusion

Randomization inference is one of the cleanest tools in the causal inference toolkit. It asks a simple question—"how often would random chance produce a result this extreme?"—and answers it directly, using the randomisation mechanism itself as the source of the null distribution. For researchers working with small or cluster-randomised experiments, it is often the preferred approach to inference.

References

  1. Abadie, A., Diamond, A., and Hainmueller, J. Synthetic control methods for comparative case studies: Estimating the effect of California's tobacco control program. Journal of the American Statistical Association, 105(490):493–505, 2010.
  2. Fisher, R. A. The Design of Experiments. Oliver and Boyd, Edinburgh, 1935.
  3. Imbens, G. W. and Rubin, D. B. Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge, 2015.
  4. Rosenbaum, P. R. Observational Studies, 2nd ed. Springer, New York, 2002.

Continue Reading

Browse All Sections →
Home
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

Article Title