The Causal Review

Introduction

When estimating a causal effect from observational data, the researcher must choose a set of control variables X to condition on. Get this wrong in one direction—omit a confounder—and the estimate is biased upward. Get it wrong in the other direction—condition on a mediator or collider—and the estimate is biased downward or in the wrong direction entirely. This choice is one of the most consequential and least formalised decisions in applied econometrics.

Two competing schools have emerged. The theory-first school holds that only substantive economic reasoning, guided by causal graphs, can correctly identify the valid adjustment set. The data-first school holds that, in high-dimensional settings, machine learning methods such as LASSO can automatically select the right controls and do so more credibly than researcher judgment, which is subject to bias and degrees of freedom.

1 The Case for Theory-Driven Covariate Selection

Only theory distinguishes confounders from mediators. A variable M that lies on the causal path from treatment D to outcome Y (a mediator) should never be included as a control: doing so blocks the causal channel and attenuates the treatment effect estimate toward zero. A collider C that is caused by both D and Y should also never be controlled: conditioning on it opens a spurious backdoor path and induces bias (Pearl, 2009).

The critical point, emphasised by Cinelli et al. (2022), is that LASSO and other ML methods select variables based on their predictive power for Y—they cannot distinguish confounders from mediators or colliders using data alone. A mediator predicts Y just as well as a confounder; LASSO would include it either way.

The credibility revolution was built on transparency. Angrist and Pischke (2009) argue that the key advance of modern causal econometrics is explicit identification strategies—randomisation, instrumental variables, regression discontinuity—that allow the researcher to make credible causal claims precisely because the identification assumptions are visible and testable. Theory-driven covariate selection is part of this transparency: the researcher commits to a specification before estimation, and the choice of controls is motivated by a clear causal story.

Overfitting and false precision. ML-selected controls that predict Y in the sample may do so for spurious, sample-specific reasons. Conditioning on such variables introduces overfitting bias that is invisible in standard inference. Theory precludes a large class of such mistakes by ruling out implausible correlations before the data are examined.

The DAG is the arbiter. Pearl (2009) shows that the valid adjustment set—the set of variables that, when conditioned on, identify the causal effect via the backdoor criterion—is determined by the causal graph, not by the data. Different causal structures imply different adjustment sets, and no data-driven method can discriminate between them without additional assumptions. Theory supplies those assumptions; data cannot.

2 The Case for ML-Driven Covariate Selection

Post-double-selection LASSO avoids omitted variable bias. Belloni et al. (2014) show that in high-dimensional settings where the potential control vector X has p ≫ n components, specification searching through all subsets of X introduces severe researcher degrees of freedom. Post-double-selection (PDS) LASSO addresses this: run LASSO to predict D from X (selecting confounders of the treatment) and again to predict Y from X (selecting confounders of the outcome); include the union of selected variables as controls. This procedure is shown to achieve √n consistent estimation of the treatment effect under mild regularity conditions, with conservative coverage guarantees.

Critically, PDS-LASSO selects from a pre-specified large set X that the researcher believes may contain confounders—it does not select from all possible variables in existence. The causal reasoning that defines which variables belong in X is still theory-driven.

DML decouples nuisance estimation from causal inference. Chernozhukov et al. (2018) show that Neyman-orthogonal score functions decouple the nuisance estimation (which can use any ML method) from the target parameter estimation (which achieves √n-consistency). The key insight: as long as the nuisance functions E[Y|X] and E[D|X] are estimated consistently, the DML estimator of the treatment effect is valid regardless of how they are estimated. ML is used for prediction; theory is used only to specify the causal model.

Researcher degrees of freedom are a documented problem. Brodeur et al. (2020) document systematic evidence of p-hacking in economics journals: the distribution of test statistics shows excess mass just above 1.96. Automated covariate selection is more credible than manual selection precisely because it leaves fewer degrees of freedom for the researcher to choose specifications post-estimation.

Large datasets require regularisation. Administrative datasets with hundreds of demographic controls, geographic indicators, industry codes, and interaction terms cannot be handled by manually specifying a parsimonious regression. In these settings, regularised methods provide the only feasible path to credible high-dimensional confounding adjustment.

3 Synthesis: What the Debate Is Really About

Imbens (2020) argues that the apparent conflict between these positions is largely terminological: the two schools answer different questions.

Which variables belong in X? This is a causal question. Theory, guided by a DAG, determines whether a variable is a confounder, mediator, or collider. ML cannot answer this. ‍
How to estimate E[Y|X] and E[D|X] given a valid X? This is a statistical prediction problem. ML methods (LASSO, random forests, gradient boosting) can outperform parametric regression when X is high-dimensional and the functional form is unknown.

PDS-LASSO and DML are best understood as estimation strategies given a causal model that has already been specified by theory. They are not alternatives to causal reasoning; they are complements that handle the statistical problem of estimating nuisance functions in high-dimensional settings.

Task	Theory	ML
Determine adjustment set	Essential	Cannot do
Estimate conditional expectation of Y on X	Limited	Strong
Prevent researcher degrees of freedom	Weak	Strong
Handle p ≫ n	Impractical	Strong
Identify mediators and colliders	Essential	Cannot do

Table 1: Theory vs. ML: What each does well [cite: 3811]

4 What Would Resolve the Debate?

Within-study comparisons. Compare the bias of theory-selected vs. ML-selected specifications in settings where the truth is known from a randomised experiment. LaLonde (1986) comparisons of observational to experimental estimates are a template. ‍
Simulation studies with realistic DGPs. Generate data from causal graphs with known mediators and colliders; compare the two approaches' bias and coverage under specification errors. ‍
Adversarial collaboration. Researchers from both camps pre-commit to specifications on the same dataset before accessing outcomes, then compare estimates.

Conclusion

The ML-vs.-theory debate is partly a false dichotomy. ML methods are powerful estimation tools; economic theory is indispensable for causal reasoning. Applied researchers should use both: theory to define the causal model and valid adjustment set, ML to estimate nuisance functions from high-dimensional data without parametric assumptions. The real danger is neither using ML alone (which risks conditioning on mediators and colliders) nor using theory alone (which risks omitting confounders in high-dimensional settings), but failing to distinguish the two tasks.

References

Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608-650.
Brodeur, A., Cook, N., and Heyes, A. (2020). Methods matter: P-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11):3634-3660.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1):C1-C68.
Cinelli, C., Forney, A., and Pearl, J. (2022). A crash course in good and bad controls. Sociological Methods and Research, 53(3):1071-1099.
Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4):1129-1179.
LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76(4):604-620.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267-288.

Machine Learning or Economic Theory? The Debate Over Covariate Selection in Applied Economics

Introduction

1 The Case for Theory-Driven Covariate Selection

2 The Case for ML-Driven Covariate Selection

3 Synthesis: What the Debate Is Really About

4 What Would Resolve the Debate?

Conclusion

References

Continue Reading

The causalml Package in Python: Uplift Modeling and CATE Meta-Learners

The gsynth Package in R: Generalized Synthetic Control with Interactive Fixed Effects

Recent Results: Immigration, Migration, and Labour Markets

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

Machine Learning or Economic Theory? The Debate Over Covariate Selection in Applied Economics

Introduction

1 The Case for Theory-Driven Covariate Selection

2 The Case for ML-Driven Covariate Selection

3 Synthesis: What the Debate Is Really About

4 What Would Resolve the Debate?

Conclusion

References

Continue Reading

The causalml Package in Python: Uplift Modeling and CATE Meta-Learners

The gsynth Package in R: Generalized Synthetic Control with Interactive Fixed Effects

Recent Results: Immigration, Migration, and Labour Markets

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title