The Problem: High-Dimensional Confounding
The canonical partial linear model is: \[\begin{align} Y_i &= D_i \theta_0 + g_0(X_i) + \varepsilon_i, \quad \mathbb{E}[\varepsilon_i \mid D_i, X_i] = 0 \label{eq:plm} \\ D_i &= m_0(X_i) + v_i, \quad \mathbb{E}[v_i \mid X_i] = 0 \label{eq:first} \end{align}\] where \(Y_i\) is the outcome, \(D_i\) is the treatment (possibly continuous), \(X_i \in \mathbb{R}^p\) is a vector of controls, \(\theta_0\) is the treatment effect parameter of interest, \(g_0\) and \(m_0\) are unknown nuisance functions, and \(\varepsilon_i\) and \(v_i\) are error terms.
When \(p\) is large (potentially \(p > n\)), the nuisance functions \(g_0\) and \(m_0\) cannot be estimated consistently by OLS. One approach is to use a penalised regression such as LASSO. However, if LASSO is naively applied to estimate \(g_0\) and the residual \(Y_i - \hat{g}(X_i)\) is then regressed on \(D_i\), the resulting estimate of \(\theta_0\) will be biased. This "regularisation bias" arises because LASSO shrinks the coefficient on \(D_i\) in equation , and the shrinkage does not disappear in large samples: it is \(O(1/\sqrt{n})\) rather than \(o(1/\sqrt{n})\), so it inflates test statistics.
Neyman Orthogonality
The DML framework addresses regularisation bias through a technique from semiparametric efficiency theory: Neyman orthogonality. Consider a moment condition \(\psi(W_i; \theta, \eta)\) for the parameter \(\theta\), where \(\eta = (g, m)\) is the nuisance parameter. The moment condition is Neyman-orthogonal if: \[ \partial_\eta \mathbb{E}[\psi(W_i; \theta_0, \eta_0)][\Delta \eta] = 0 \quad \text{for all } \Delta\eta \] That is, the Gateaux derivative of the expected moment with respect to \(\eta\), evaluated at the truth, is zero. This means that small errors in estimating \(\eta\) do not translate into first-order bias in the estimator of \(\theta\).
For the partial linear model, the naive moment condition is \(\mathbb{E}[(Y_i - D_i\theta - g(X_i))D_i] = 0\). This is not Neyman-orthogonal: the derivative with respect to \(g\) is \(-\mathbb{E}[D_i]\), which is nonzero. A bias of order \(O(\|\hat{g} - g_0\|)\) in estimating \(g_0\) translates into a bias of the same order in \(\hat\theta\), which does not vanish at \(\sqrt{n}\)-rate.
The orthogonal moment condition is obtained by "partialling out" the confounders from both the outcome and the treatment: \[ \psi(W_i; \theta, g, m) = (Y_i - g(X_i) - (D_i - m(X_i))\theta)(D_i - m(X_i)) \] Setting \(\mathbb{E}[\psi] = 0\) at the truth yields: \[ \theta_0 = \frac{\mathbb{E}[(Y_i - g_0(X_i))(D_i - m_0(X_i))]}{\mathbb{E}[(D_i - m_0(X_i))^2]} \] This is the Frisch–Waugh–Lovell (FWL) representation: regress the outcome residual \(Y_i - g_0(X_i)\) on the treatment residual \(D_i - m_0(X_i)\). The derivative of the expected orthogonal moment with respect to \((g, m)\) is zero at the truth, confirming Neyman orthogonality.
Cross-Fitting
Even with Neyman-orthogonal moments, there is a second source of bias when the nuisance estimators are computed on the same data used to estimate \(\theta\): overfitting bias. Complex machine learners (random forests, neural networks, LASSO) can overfit the training data, which means the residuals computed from a model trained on the full sample are smaller than the true residuals. This leads to underestimation of the residual variance and bias in \(\hat\theta\).
The solution is cross-fitting. Partition the data into \(K\) folds. For each fold \(k\):
- Estimate \(\hat{g}^{(-k)}\) and \(\hat{m}^{(-k)}\) using all observations not in fold \(k\).
- Compute residuals \(\hat\varepsilon_i = Y_i - \hat{g}^{(-k)}(X_i)\) and \(\hat v_i = D_i - \hat{m}^{(-k)}(X_i)\) for \(i\) in fold \(k\).
[Chernozhukov et al.(2018)] Under regularity conditions including that \(\|\hat{g} - g_0\|_2 \cdot \|\hat{m} - m_0\|_2 = o_p(n^{-1/2})\), the DML estimator satisfies: \[ \sqrt{n}(\hat\theta^{DML} - \theta_0) \xrightarrow{d} \mathcal{N}(0, V) \] where \(V = J_0^{-2} \mathbb{E}[\psi^2]\) with \(J_0 = \mathbb{E}[v_i^2]\).
The key condition \(\|\hat{g} - g_0\|_2 \cdot \|\hat{m} - m_0\|_2 = o_p(n^{-1/2})\) is a product condition: it allows both nuisance functions to be estimated at slower than \(n^{1/4}\)-rate individually, as long as their product rate is fast enough. Under sparsity, LASSO achieves \(\|\hat{g} - g_0\|_2 = O_p(\sqrt{s\log p/n})\) where \(s\) is the sparsity; the product of two such rates is \(O_p(s\log p / n)\), which satisfies the condition if \(s\log p = o(\sqrt{n})\).
The Post-Double-Selection LASSO
Belloni et al.(2014) propose a related approach: post-double-selection (PDS) LASSO. The procedure is:
- Regress \(Y\) on \(X\) using LASSO and record the selected covariates \(\hat{S}_1\).
- Regress \(D\) on \(X\) using LASSO and record the selected covariates \(\hat{S}_2\).
- Regress \(Y\) on \(D\) and \(X_{\hat{S}_1 \cup \hat{S}_2}\) using OLS.
The DML framework generalises PDS by allowing arbitrary machine learners for the nuisance functions, not just LASSO. This is important when the true nuisance functions are not well-approximated by sparse linear models.
Extensions
Interactive regression model. When the treatment effect itself depends on covariates — i.e., \(Y_i = D_i \theta(X_i) + g_0(X_i) + \varepsilon_i\) — DML can be used to estimate the CATE \(\theta(x)\) by replacing the single coefficient \(\theta_0\) with a function. The R-learner of Nie and Wager(2021) implements this approach.
IV with high-dimensional controls. DML extends naturally to IV settings where both the first stage and the reduced form have high-dimensional controls. The moment condition is modified to use the instrument residual rather than the treatment residual.
DiD with DML. Callaway and Sant'Anna(2021) and subsequent work use DML-style nuisance estimation in the doubly robust DiD estimator. The conditional propensity score and outcome regression are estimated by flexible methods, and cross-fitting is used to avoid overfitting bias.
Variance Estimation and Inference
The variance of \(\hat\theta^{DML}\) is estimated by: \[ \hat{V} = \hat{J}_0^{-2} \frac{1}{n}\sum_{i=1}^n \hat\psi_i^2 \] where \(\hat{J}_0 = n^{-1}\sum_i \hat{v}_i^2\) and \(\hat\psi_i = \hat{v}_i(\hat\varepsilon_i - \hat{v}_i \hat\theta^{DML})\).
Standard errors and confidence intervals from this formula are valid under the conditions of the theorem, which include the product rate condition and appropriate moment conditions. Under clustering, the variance formula is adjusted using clustered standard errors.
Conclusion
Double Machine Learning provides a principled solution to the problem of high-dimensional confounding in treatment effect estimation. Its two key innovations — Neyman orthogonality and cross-fitting — together ensure that flexible machine-learning estimation of nuisance functions does not bias inference on the parameter of interest. The framework is implemented in the DoubleML package in R and Python. For researchers who have access to rich covariate data and face potential confounding along many dimensions simultaneously, DML is now a standard tool.
References
- Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608--650.
- Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200--230.
- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1--C68.
- Nie, X. and Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299--319.
- Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
- Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846--866.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1):267--288.