Introduction
When estimating a causal effect from observational data, the researcher must choose a set of control variables X to condition on. Get this wrong in one direction—omit a confounder—and the estimate is biased upward. Get it wrong in the other direction—condition on a mediator or collider—and the estimate is biased downward or in the wrong direction entirely. This choice is one of the most consequential and least formalised decisions in applied econometrics.
Two competing schools have emerged. The theory-first school holds that only substantive economic reasoning, guided by causal graphs, can correctly identify the valid adjustment set. The data-first school holds that, in high-dimensional settings, machine learning methods such as LASSO can automatically select the right controls and do so more credibly than researcher judgment, which is subject to bias and degrees of freedom.
1 The Case for Theory-Driven Covariate Selection
Only theory distinguishes confounders from mediators. A variable M that lies on the causal path from treatment D to outcome Y (a mediator) should never be included as a control: doing so blocks the causal channel and attenuates the treatment effect estimate toward zero. A collider C that is caused by both D and Y should also never be controlled: conditioning on it opens a spurious backdoor path and induces bias (Pearl, 2009).
The critical point, emphasised by Cinelli et al. (2022), is that LASSO and other ML methods select variables based on their predictive power for Y—they cannot distinguish confounders from mediators or colliders using data alone. A mediator predicts Y just as well as a confounder; LASSO would include it either way.
The credibility revolution was built on transparency. Angrist and Pischke (2009) argue that the key advance of modern causal econometrics is explicit identification strategies—randomisation, instrumental variables, regression discontinuity—that allow the researcher to make credible causal claims precisely because the identification assumptions are visible and testable. Theory-driven covariate selection is part of this transparency: the researcher commits to a specification before estimation, and the choice of controls is motivated by a clear causal story.
Overfitting and false precision. ML-selected controls that predict Y in the sample may do so for spurious, sample-specific reasons. Conditioning on such variables introduces overfitting bias that is invisible in standard inference. Theory precludes a large class of such mistakes by ruling out implausible correlations before the data are examined.
The DAG is the arbiter. Pearl (2009) shows that the valid adjustment set—the set of variables that, when conditioned on, identify the causal effect via the backdoor criterion—is determined by the causal graph, not by the data. Different causal structures imply different adjustment sets, and no data-driven method can discriminate between them without additional assumptions. Theory supplies those assumptions; data cannot.
2 The Case for ML-Driven Covariate Selection
Post-double-selection LASSO avoids omitted variable bias. Belloni et al. (2014) show that in high-dimensional settings where the potential control vector X has p ≫ n components, specification searching through all subsets of X introduces severe researcher degrees of freedom. Post-double-selection (PDS) LASSO addresses this: run LASSO to predict D from X (selecting confounders of the treatment) and again to predict Y from X (selecting confounders of the outcome); include the union of selected variables as controls. This procedure is shown to achieve √n consistent estimation of the treatment effect under mild regularity conditions, with conservative coverage guarantees.
Critically, PDS-LASSO selects from a pre-specified large set X that the researcher believes may contain confounders—it does not select from all possible variables in existence. The causal reasoning that defines which variables belong in X is still theory-driven.
DML decouples nuisance estimation from causal inference. Chernozhukov et al. (2018) show that Neyman-orthogonal score functions decouple the nuisance estimation (which can use any ML method) from the target parameter estimation (which achieves √n-consistency). The key insight: as long as the nuisance functions E[Y|X] and E[D|X] are estimated consistently, the DML estimator of the treatment effect is valid regardless of how they are estimated. ML is used for prediction; theory is used only to specify the causal model.
Researcher degrees of freedom are a documented problem. Brodeur et al. (2020) document systematic evidence of p-hacking in economics journals: the distribution of test statistics shows excess mass just above 1.96. Automated covariate selection is more credible than manual selection precisely because it leaves fewer degrees of freedom for the researcher to choose specifications post-estimation.
Large datasets require regularisation. Administrative datasets with hundreds of demographic controls, geographic indicators, industry codes, and interaction terms cannot be handled by manually specifying a parsimonious regression. In these settings, regularised methods provide the only feasible path to credible high-dimensional confounding adjustment.
3 Synthesis: What the Debate Is Really About
Imbens (2020) argues that the apparent conflict between these positions is largely terminological: the two schools answer different questions.
- Which variables belong in X? This is a causal question. Theory, guided by a DAG, determines whether a variable is a confounder, mediator, or collider. ML cannot answer this.
- How to estimate E[Y|X] and E[D|X] given a valid X? This is a statistical prediction problem. ML methods (LASSO, random forests, gradient boosting) can outperform parametric regression when X is high-dimensional and the functional form is unknown.
PDS-LASSO and DML are best understood as estimation strategies given a causal model that has already been specified by theory. They are not alternatives to causal reasoning; they are complements that handle the statistical problem of estimating nuisance functions in high-dimensional settings.
4 What Would Resolve the Debate?
- Within-study comparisons. Compare the bias of theory-selected vs. ML-selected specifications in settings where the truth is known from a randomised experiment. LaLonde (1986) comparisons of observational to experimental estimates are a template.
- Simulation studies with realistic DGPs. Generate data from causal graphs with known mediators and colliders; compare the two approaches' bias and coverage under specification errors.
- Adversarial collaboration. Researchers from both camps pre-commit to specifications on the same dataset before accessing outcomes, then compare estimates.
Conclusion
The ML-vs.-theory debate is partly a false dichotomy. ML methods are powerful estimation tools; economic theory is indispensable for causal reasoning. Applied researchers should use both: theory to define the causal model and valid adjustment set, ML to estimate nuisance functions from high-dimensional data without parametric assumptions. The real danger is neither using ML alone (which risks conditioning on mediators and colliders) nor using theory alone (which risks omitting confounders in high-dimensional settings), but failing to distinguish the two tasks.
References
- Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
- Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608-650.
- Brodeur, A., Cook, N., and Heyes, A. (2020). Methods matter: P-hacking and publication bias in causal analysis in economics. American Economic Review, 110(11):3634-3660.
- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1):C1-C68.
- Cinelli, C., Forney, A., and Pearl, J. (2022). A crash course in good and bad controls. Sociological Methods and Research, 53(3):1071-1099.
- Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4):1129-1179.
- LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76(4):604-620.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press.
- Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267-288.