New Methods & Techniques

Double Machine Learning: Debiased Estimation with High-Dimensional Controls

The Problem: High-Dimensional Confounding

The canonical partial linear model is:

$$ Y_i = D_i\theta_0 + g_0(X_i) + \varepsilon_i, \quad \mathbb{E}[\varepsilon_i \mid D_i, X_i] = 0 $$
(1)
$$ D_i = m_0(X_i) + v_i, \quad \mathbb{E}[v_i \mid X_i] = 0 $$
(2)

where Yi is the outcome, Di is the treatment (possibly continuous), Xi ∈ Rp is a vectorof controls, θ0 is the treatment effect parameter of interest, g0 and m0 are unknownnuisance functions, and εi and vi are error terms.

When p is large (potentially (p > n), the nuisance functions (g0) and (m0) cannot be estimated consistently by OLS. One approach is to use a penalised regression such as LASSO. However, if LASSO is naively applied to estimate (g0) and the residual Yi − ˆg(Xi) is then regressed on Di will be biased. This "regularisation bias" arises because LASSO shrinks the coefficient on Di in equation , and the shrinkage does not disappear in large samples: it is 0(1/√n) rather than o(1/√n), so it inflates test statistics.

Neyman Orthogonality

The DML framework addresses regularisation bias through a technique from semiparametric efficiency theory: Neyman orthogonality. Consider a moment condition ψ(Wi;θ,η) for the parameter θ, where η = (g,m) is the nuisance parameter. Themoment condition is Neyman-orthogonal if:

$$ \partial_\eta \mathbb{E}[\psi(W_i; \theta_0, \eta_0)][\Delta \eta] = 0 \quad \text{for all } \Delta \eta $$

That is, the Gateaux derivative of the expected moment with respect to η, evaluatedat the truth, is zero. This means that small errors in estimating η do not translate intofirst-order bias in the estimator of θ.

For the partial linear model, the naive moment condition is E[(Yi−Diθ−g(Xi))Di] =0. This is not Neyman-orthogonal: the derivative with respect to g is −E[Di], whichis nonzero. A bias of order O(∥ˆg − g0∥) in estimating g0 translates into a bias of thesame order in ˆθ, which does not vanish at √n-rate.The orthogonal moment condition is obtained by “partialling out” the confoundersfrom both the outcome and the treatment:

$$ \psi(W_i; \theta, g, m) = (Y_i - g(X_i) - (D_i - m(X_i))\theta)(D_i - m(X_i)) $$

Setting E[ψ] = 0 at the truth yields:

$$ \theta_0 = \frac{\mathbb{E}[(Y_i - g_0(X_i))(D_i - m_0(X_i))]}{\mathbb{E}[(D_i - m_0(X_i))^2]} $$

This is the Frisch–Waugh–Lovell (FWL) representation: regress the outcome residualYi − g0(Xi) on the treatment residual Di − m0(Xi). The derivative of the expectedorthogonal moment with respect to (g,m) is zero at the truth, confirming Neymanorthogonality.

Cross-Fitting

Even with Neyman-orthogonal moments, there is a second source of bias when the nuisance estimators are computed on the same data used to estimate \(\theta\): overfitting bias. Complex machine learners (random forests, neural networks, LASSO) can overfit the training data, which means the residuals computed from a model trained on the full sample are smaller than the true residuals. This leads to underestimation of the residual variance and bias in \(\hat\theta\).

The solution is cross-fitting. Partition the data into \(K\) folds. For each fold k:

  1. Estimate ˆg(−k) and ˆm(−k) using all observations not in fold k.
  2. Compute residuals ˆεi = Yi − ˆg(−k)(Xi) and ˆvi = Di − ˆm(−k)(Xi) for i in fold k.

Aggregate the residuals across all folds and estimate:

$$ \hat{\theta}^{DML} = \left( \frac{1}{n} \sum_{i=1}^n \hat{v}_i^2 \right)^{-1} \left( \frac{1}{n} \sum_{i=1}^n \hat{v}_i \hat{\varepsilon}_i \right) $$

Theorem 1 (Chernozhukov et al. 2018). Under regularity conditions including that∥ˆg − g02 · ∥ˆm − m02 = op(n−1/2), the DML estimator satisfies:

$$ \sqrt{n}(\hat{\theta}^{DML} - \theta_0) \xrightarrow{d} \mathcal{N}(0, V) $$

where V = J0-2 E[ψ2] with J0 = E[vi2].

The key condition ∥ˆg − g02 · ∥ˆm − m02 = op(n−1/2) is a product condition: itallows both nuisance functions to be estimated at slower than n1/4-rate individually, aslong as their product rate is fast enough. Under sparsity, LASSO achieves ∥ˆg−g02 = Op( slogp/n) where s is the sparsity; the product of two such rates is Op(slogp/n),which satisfies the condition if slogp = o(√n).

The Post-Double-Selection LASSO

Belloni et al.(2014) propose a related approach: post-double-selection (PDS) LASSO. The procedure is:

  1. Regress Y on X using LASSO and record the selected covariates ˆS1.
  2. Regress Y on X using LASSO and record the selected covariates ˆS2.
  3. Regress Y on D and XˆS1∪ˆS2 using OLS.

  The union of the two selected sets ensures that any variable confounding either theoutcome or the treatment is included. Under sparsity, the PDS estimator is √nconsistent and asymptotically normal.

  The DML framework generalises PDS by allowing arbitrary machine learners forthe nuisance functions, not just LASSO. This is important when the true nuisancefunctions are not well-approximated by sparse linear models.

Extensions

Interactive regression model. When the treatment effect itself depends on covariates — i.e., Diθ(Xi) + g0(Xi) + εi— DML can be used to estimate the θ(x) by replacing the single coefficient θ0 with a function. The R-learner of Nie and Wager(2021) implements this approach.

IV with high-dimensional controls. DML extends naturally to IV settings where both the first stage and the reduced form have high-dimensional controls. The moment condition is modified to use the instrument residual rather than the treatment residual.

DiD with DML. Callaway and Sant'Anna(2021) and subsequent work use DML-style nuisance estimation in the doubly robust DiD estimator. The conditional propensity score and outcome regression are estimated by flexible methods, and cross-fitting is used to avoid overfitting bias.

Variance Estimation and Inference

The variance of ˆθDML is estimated by:
\hat{V} = \hat{J}_0^{-2} \frac{1}{n}\sum_{i=1}^n \hat\psi_i^2
\] where \(\hat{J}_0 = n^{-1}\sum_i \hat{v}_i^2\) and \(\hat\psi_i = \hat{v}_i(\hat\varepsilon_i - \hat{v}_i \hat\theta^{DML})\).

where ˆJ0 = n-1iˆv2i and ˆψi = ˆvi(ˆεi − ˆviˆθDML).

Standard errors and confidence intervals from this formula are valid under the conditions of the theorem, which include the product rate condition and appropriate moment conditions. Under clustering, the variance formula is adjusted using clustered standard errors.

Conclusion

Double Machine Learning provides a principled solution to the problem of high-dimensional confounding in treatment effect estimation. Its two key innovations — Neyman orthogonality and cross-fitting — together ensure that flexible machine-learning estimation of nuisance functions does not bias inference on the parameter of interest. The framework is implemented in the DoubleML package in R and Python. For researchers who have access to rich covariate data and face potential confounding along many dimensions simultaneously, DML is now a standard tool.

References

  1. Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608--650.
  2. Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200--230.
  3. Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1--C68.
  4. Nie, X. and Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299--319.
  5. Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
  6. Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846--866.
  7. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1):267--288.

Continue Reading

Browse All Sections →
Home
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

Article Title