The Causal Review

1 Introduction

Few decisions in applied econometrics generate more confusion and more heated referee exchanges than the choice of how to compute standard errors and, crucially, at what level to cluster them. The default OLS standard errors assume independent, homoskedastic errors. In practice, observations are grouped: workers in the same firm, students in the same school, households in the same county, observations in the same time period. When the error term is correlated within these groups, OLS standard errors are too small, t-statistics are inflated, and rejections of null hypotheses happen too often.

This article examines the theory behind clustering, the debate over when and at what level to cluster, the problems that arise with few clusters, and the tools that have emerged for handling edge cases.

2 The Theory of Clustered Standard Errors

2.1 Why OLS Standard Errors Can Be Wrong

Consider the OLS estimator β̂ = (X'X)⁻¹X'Y in the model Yᵢ = Xᵢ'β + uᵢ. The sandwich variance estimator under general heteroskedasticity is:‍

^Var(β^) = (X'X)⁻¹ (

∑ i

X_iX'_iû_i² ) (X'X)⁻¹. (1)

This "HC" standard error [Eicker, 1967, ?, White, 1980] is robust to heteroskedasticity but still assumes that E[uᵢuⱼ] = 0 for all i ≠ j. When observations are grouped into clusters g = 1, ..., G and errors within clusters are correlated, the correct variance is:.

^Var(β^)^CR = (X'X)⁻¹ (

G ∑ g=1

X'_gû_gû'_gX_g ) (X'X)⁻¹, (2)

where X₉ and û₉ collect the regressors and residuals within cluster g. This cluster-robust variance estimator (CRVE) is consistent as G → ∞ regardless of how the errors are correlated within clusters.

2.2 The Moulton Factor

When the treatment variable is constant within clusters (e.g. all workers in a firm get the same wage), the degree of inflation of the OLS t-statistic is captured by the Moulton factor [Moulton, 1990]:.‍

√ 1 + ρ_Xρ_un, (3)

where ρₓ is the intra-cluster correlation of the treatment, ρᵤ is the intra-cluster correlation of the errors, and n̄ is the average cluster size. If ρₓ = ρᵤ = 0.1 and n̄ = 50, the Moulton factor is √(1 + 0.1 × 0.1 × 50) = √1.5 ≈ 1.22, meaning OLS t-statistics are inflated by 22%. With n̄ = 200, the inflation exceeds 50%.

Bertrand et al. [2004] showed that in a sample of DiD studies using U.S. state-year panels, ignoring serial correlation within states—a form of within-cluster correlation across time—led to severe over-rejection. The solution they advocated: cluster at the state level.

3 The "Design-Based" Approach: Abadie et al. (2023)

A more recent and theoretically unified perspective comes from Abadie et al. [2023], who argue that the decision to cluster should be grounded in the assignment mechanism—the source of variation being exploited for identification—rather than in whether residuals appear correlated. The key insight is that clustering corrects for two distinct sources of error:.

Sampling uncertainty: If clusters are sampled from a larger population, observations within the sampled cluster are correlated because they share cluster-level characteristics. The CRVE accounts for this.‍
Design uncertainty: If treatment is assigned at the cluster level (all states get treated or not), then units in the same cluster are correlated through their shared treatment status, regardless of whether their errors are correlated.

Abadie et al. argue that clustering is unnecessary when:

Treatment varies within clusters (individual-level randomisation), and

The sample covers essentially the entire population of interest.

Clustering is necessary when:

Treatment is assigned at the cluster level (even if residuals are uncorrelated), or

Clusters are a sample from a larger population and the estimand concerns population parameters.

This design-based view has important practical implications. Many studies in empirical economics cluster "just in case," adding clusters at a level above the treatment unit. Under the design-based view, this may be unnecessary and may inflate standard errors without justification.

4 At What Level Should We Cluster?

Even researchers who agree that clustering is necessary often disagree about the level. The general principle is: cluster at the level at which treatment is assigned. If treatment is assigned by state-year, cluster by state-year. If it is assigned by state, cluster by state.

Problems arise when there are multiple levels of plausible dependence. In many labour market datasets:

Workers in the same firm may have correlated errors (through shared firm-level shocks).

Workers in the same local labour market may have correlated errors (through regional economic conditions).

Workers in the same industry may have correlated errors (through industry shocks).

4.1 Two-Way Clustering

When two non-nested clustering structures are both relevant, Cameron et al. [2011] propose the two-way clustered variance estimator:.‍

^Var^2W = ^V_G + ^V_H − ^V_G∩H, (4)

where V̂₉ is the cluster-robust variance estimating along dimension G, V̂ₕ along dimension H, and V̂₉∩ₕ along their intersection (to avoid double-counting). This estimator is consistent when both G → ∞ and H → ∞.

A common application: in state-year DiD panels, cluster by both state (for within-state correlation) and year (for common macro shocks), using the two-way clustered estimator.

5 The Few-Clusters Problem

The CRVE is consistent as the number of clusters G → ∞. With few clusters—say G < 30—the asymptotic approximation breaks down, and the t-test based on clustered SEs may reject too often.

Cameron et al. [2008] document that with G = 10 clusters, the rejection rate of a nominal 5% test can exceed 20% using conventional clustered SEs. Their solution: the "wild cluster bootstrap," which generates bootstrap samples by multiplying residuals by random signs within each cluster while preserving the within-cluster correlation structure.

MacKinnon and Webb [2017] refine the procedure and derive conditions under which the wild cluster bootstrap provides asymptotic refinements. Their fwildclusterboot package in R and Stata implements the fast version.

Imbens and Kolesar [2016] and others propose alternative inference procedures based on randomisation/permutation principles that are valid with as few as G = 2 clusters under additional assumptions about treatment assignment.

6 The Debate in Practice

The choice of clustering level is ultimately an empirical judgment call, but it has generated genuine controversy:

The "kitchen sink" tendency: Many applied papers cluster at a high level (state, country) as a conservative default, even when treatment is randomised at the individual level. Under Abadie et al. [2023], this is not conservative—it is simply adding noise. Standard errors clustered at too high a level are not "safe"; they may be wrong in the wrong direction (too large).

Consistent vs inconsistent estimators with few clusters: With G = 5 states, even the wild cluster bootstrap has poor size properties. Some papers use the pairs cluster bootstrap (resample whole clusters), but this also struggles with few clusters. Recent work by MacKinnon et al. [2022] shows that no purely design-based cluster-robust procedure is fully reliable below G ≈ 10 and recommends reporting p-values from multiple methods.

Multi-way clustering for shift-share instruments: Borusyak et al. [2022] show that when identification comes from industry-level shocks in a shift-share design, standard errors should cluster at the industry (shock) level—not at the location level. Using location-level clusters in this setting ignores the actual unit of variation and produces standard errors that can be either too small or too large depending on the structure of the shocks.

7 Practical Guidance

Cluster at the treatment-assignment level. If treatment is assigned by state, cluster by state. If by firm, cluster by firm. This is the design-based criterion of Abadie et al. [2023].
‍Do not cluster "just to be safe" at a level above treatment assignment. This may inflate standard errors without theoretical justification.‍
With few clusters (G < 30), use the wild cluster bootstrap. Standard clustered SEs over-reject severely with few clusters.‍
Report multiple standard errors. For robustness, report OLS SEs, HC SEs, clustered SEs, and (if relevant) two-way clustered SEs side by side.‍
For shift-share IVs, cluster at the shock level. Standard location-level clustering is invalid; see Borusyak et al. [2022].

8 Conclusion

The question of how to cluster standard errors has no single correct answer—it depends on the source of identification, the sampling framework, and the correlation structure of errors. The design-based framework of Abadie et al. [2023] provides a principled foundation: cluster at the level at which treatment is assigned and at which sampling induces dependence. The wild cluster bootstrap [Cameron et al., 2008, MacKinnon and Webb, 2017] is the preferred tool when the number of clusters is small. Two-way clustering [Cameron et al., 2011] handles non-nested dependence structures. The debate is not fully resolved, but the tools are mature enough that applied researchers have no excuse for defaulting to OLS standard errors in settings with obvious within-group dependence.

References

Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. (2023). When should you adjust standard errors for clustering? Quarterly Journal of Economics, 138(1):1-35.
Bertrand, M., Duflo, E., and Mullainathan, S. (2004). How much should we trust differences-in-differences estimates? Quarterly Journal of Economics, 119(1):249-275.
Borusyak, K., Hull, P., and Jaravel, X. (2022). Quasi-experimental shift-share research designs. Review of Economic Studies, 89(1):181-213.
Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3):414-427.
Cameron, A. C. and Miller, D. L. (2015). A practitioner's guide to cluster-robust inference. Journal of Human Resources, 50(2):317-372.
Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2011). Robust inference with multiway clustering. Journal of Business & Economic Statistics, 29(2):238-249.
Eicker, F. (1967). Limit theorems for regression with unequal and dependent errors. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 59-82.
Imbens, G. W. and Kolesár, M. (2016). Robust standard errors in small samples: Some practical advice. Review of Economics and Statistics, 98(4):701-712.
MacKinnon, J. G. and Webb, M. D. (2017). Wild bootstrap inference for wildly different cluster sizes. Journal of Applied Econometrics, 32(2):233-254.
MacKinnon, J. G., Nielsen, M. Ø., and Webb, M. D. (2022). Cluster-robust inference: A guide to empirical practice. Journal of Econometrics, 232(2):272-299.
Moulton, B. R. (1990). An illustration of a pitfall in estimating the effects of aggregate variables on micro units. Review of Economics and Statistics, 72(2):334-338.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4):817-838.

Clustering Standard Errors: When, Why, and At What Level?

1 Introduction

2 The Theory of Clustered Standard Errors

2.1 Why OLS Standard Errors Can Be Wrong

2.2 The Moulton Factor

3 The "Design-Based" Approach: Abadie et al. (2023)

4 At What Level Should We Cluster?

4.1 Two-Way Clustering

5 The Few-Clusters Problem

6 The Debate in Practice

7 Practical Guidance

8 Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

Clustering Standard Errors: When, Why, and At What Level?

1 Introduction

2 The Theory of Clustered Standard Errors

2.1 Why OLS Standard Errors Can Be Wrong

2.2 The Moulton Factor

3 The "Design-Based" Approach: Abadie et al. (2023)

4 At What Level Should We Cluster?

4.1 Two-Way Clustering

5 The Few-Clusters Problem

6 The Debate in Practice

7 Practical Guidance

8 Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title