1 The Motivating Problem
You are planning a randomised evaluation of a job-training programme. You have a budget for 300 participants. Is that enough to detect a meaningful effect? The answer depends on how large the effect is, how variable wages are, and how you design the study. Without a power calculation, you may invest heavily in a study that has little chance of detecting even a true treatment effect—what statisticians call a low-power study.
Power analysis is the pre-study discipline of determining the minimum sample size needed to achieve a reliable test given the expected effect size and outcome variability. It is required by funding agencies, ethics boards, and increasingly by journals. More importantly, it is good science: an underpowered study wastes resources and produces noisy estimates that mislead.
2 The Key Concepts
Three quantities govern the power of a hypothesis test: • Type I error rate (α). The probability of rejecting the null hypothesis when it is true—a false positive. Conventionally set to 0.05 (5%). • Power (1–κ). The probability of rejecting the null when the alternative is true—a true positive. Conventionally targeted at 0.80 (80%) or 0.90 (90%). • Effect size. How large is the treatment effect you want to detect, relative to the variability in the outcome?
The relationship among these determines the required sample size. A larger required effect size means the study needs fewer observations to achieve the same power; a noisier outcome means more observations are needed.
3 The Basic Formula
Consider a two-sample difference-in-means test: N/2 units receive treatment, N/2 receive control. The outcome Yᵢ has variance σ² in both groups. The true treatment effect is τ = μₜ – μ꜀.
The test statistic is:
which under the null follows a t-distribution with N – 2 degrees of freedom. In large samples, the power of a two-sided test at level α against an alternative is approximately:
where Φ is the standard normal CDF and z₁₋α/₂ = 1.96 for α = 0.05. Solving equation (2) for N at power 1 – κ gives the classic formula:
For α = 0.05 and power = 0.80, (z₀.₉₇₅ + z₀.₈₀)² = (1.96 + 0.84)² = 7.84.
3.1 Working with Standardised Effect Sizes
It is often more natural to express the effect size in standard deviations: δ = τ/σ. Then equation (3) becomes:
The table illustrates a critical insight: power calculations are highly sensitive to the assumed effect size. Researchers who assume a medium effect (δ = 0.5) need about 64 total observations, but if the true effect is small (δ = 0.2), they need over 400. Overestimating the effect size—a common temptation—leads to severely underpowered studies.
Table 1: Required total sample size for two-group comparison at α = 0.05
Note: δ = τ/σ Sample size rounded up to even number.
4 A Worked Example
A researcher wants to evaluate a tutoring programme. The outcome is a standardised exam score (by construction, σ = 1 in the population). The programme director believes tutoring raises scores by 0.3 standard deviations (τ = 0.3, δ = 0.3). She wants power of 0.80 at α = 0.05.
She needs approximately 350 students: 175 in treatment, 175 in control. If she can recruit only 200, her study will be underpowered: at N = 200 the power against a 0.3 SD effect is:
A 56% chance of detecting the effect—barely better than a coin flip. The study design needs to change: recruit more students, reduce outcome variance by using pre-test scores as a covariate, or be willing to detect only larger effects.
5 Complications in Practice
5.1 Covariates and ANCOVA
Including a pre-treatment covariate (e.g., a pre-test score Xᵢ) in the regression reduces residual variance. If the covariate explains proportion ρ² of outcome variance, the effective residual standard deviation becomes σ√(1 – ρ²) and the required sample size drops by a factor (1 – ρ²). For a covariate with ρ = 0.7 (a common value for pre-test/post-test correlations):
The covariate roughly halves the required sample size. This is a strong argument for collecting and using baseline covariates.
5.2 Clustering
Many field experiments assign treatment at a higher level than the outcome is measured. A school-level intervention affecting student outcomes means students within the same school share a common treatment and likely have correlated errors. Ignoring this cluster structure and treating students as independent observations will underestimate the required sample size.
The design effect (DEFF) inflates the required sample size:
where m is the number of units per cluster and ρ_ICC is the intraclass correlation coefficient—the share of total variance attributable to between-cluster differences. For education studies, ρ_ICC is typically 0.05–0.20 for school-level clusters.
DEFF = 1 corresponds to no clustering. For m = 20 and ICC = 0.20, DEFF ≈ 4.8, meaning nearly five times as many observations are needed versus a simple random sample.
5.3 Multiple Outcomes and Multiple Testing
When researchers test treatment effects on many outcomes simultaneously, the probability of at least one false positive increases. Bonferroni correction divides the significance level α by the number of tests K, which also increases the required sample size. If you plan to test K = 5 outcomes, a conservative approach sets the effective α to 0.05/5 = 0.01, requiring (z₀.₉₉₅ + z₀.₈₀)² = (2.58 + 0.84)² = 11.7 in the numerator instead of 7.84—a 49% increase in sample size.
5.4 Non-Compliance and Attrition
In field experiments, not all assigned-to-treatment units receive treatment (partial compliance) and not all units remain in the study (attrition). Let c be the compliance rate (share of treatment-assigned who take up treatment) and a be the attrition rate (share lost to follow-up). The effective sample size for the LATE estimator scales with c²(1 – a). For c = 0.7 and a = 0.15:
Compliance and attrition together can more than double the required sample size.
6 Power Calculations in R
7 Common Mistakes
• Choosing an effect size to justify a desired sample size. Researchers who want to run a small study sometimes input a large expected effect to make the power calculation come out right. This is backwards: the effect size should come from prior literature or a pilot study, not from budget constraints. • Ignoring clustering. The most common power calculation error in field experiments. Always check whether treatment is assigned at a level above individual observations. • Targeting 80% power and treating it as a guarantee. An 80% powered study will fail to detect the true effect 20% of the time even under ideal conditions. • Not pre-specifying the primary outcome. Conducting power calculations post-hoc after seeing the data to justify the sample size is circular. Power calculations should be done before data collection.
8 Where to Learn More
Duflo et al. [2007] provide a practical guide to power calculations for randomised experiments in development economics. Imbens and Rubin [2015] Chapter 9 covers sample size for the potential outcomes framework. For DiD designs, Roth [2022] discusses how pre-trend tests affect the effective power of a DiD study. The R packages pwr, pwrss, and DeclareDesign provide comprehensive power analysis tools.
9 Conclusion
Power calculations are not bureaucratic hurdles—they are essential scientific planning tools. Understanding the sample-size formula, the role of effect size and variance, and how clustering and attrition inflate required sample sizes will help you design studies that have a genuine chance of answering the question they set out to address. An underpowered study does not just waste resources; it contributes noise to the literature and can mislead policy. Doing the power calculation honestly—with realistic effect size priors, not wishful thinking—is an act of scientific integrity.
References
- Duflo, E., Glennerster, R., and Kremer, M. (2007). Using randomization in development economics research: A toolkit. Handbook of Development Economics, 4:3895-3962.
- Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge.
- Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman's critique. Annals of Applied Statistics, 7(1):295-318.
- Roth, J. (2022). Pretest with caution: Event-study estimates after testing for parallel trends. American Economic Review: Insights, 4(3):305-322.
- Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688-701.
- Bloom, H. S. (1995). Minimum detectable effects: A simple way to report the statistical power of experimental designs. Evaluation Review, 19(5):547-556.