The Causal Review

The Problem: High-Dimensional Confounding

The canonical partial linear model is:

$$ Y_i = D_i\theta_0 + g_0(X_i) + \varepsilon_i, \quad \mathbb{E}[\varepsilon_i \mid D_i, X_i] = 0 $$

(1)

$$ D_i = m_0(X_i) + v_i, \quad \mathbb{E}[v_i \mid X_i] = 0 $$

(2)

where Y_i is the outcome, D_i is the treatment (possibly continuous), X_i ∈ R^p is a vectorof controls, θ₀ is the treatment effect parameter of interest, g₀ and m₀ are unknownnuisance functions, and ε_i and v_i are error terms.

When p is large (potentially (p > n), the nuisance functions (g₀) and (m₀) cannot be estimated consistently by OLS. One approach is to use a penalised regression such as LASSO. However, if LASSO is naively applied to estimate (g₀) and the residual Y_i − ˆg(X_i) is then regressed on D_i will be biased. This "regularisation bias" arises because LASSO shrinks the coefficient on D_i in equation , and the shrinkage does not disappear in large samples: it is 0(1/√n) rather than o(1/√n), so it inflates test statistics.

Neyman Orthogonality

The DML framework addresses regularisation bias through a technique from semiparametric efficiency theory: Neyman orthogonality. Consider a moment condition ψ(W_i;θ,η) for the parameter θ, where η = (g,m) is the nuisance parameter. Themoment condition is Neyman-orthogonal if:

$$ \partial_\eta \mathbb{E}[\psi(W_i; \theta_0, \eta_0)][\Delta \eta] = 0 \quad \text{for all } \Delta \eta $$

That is, the Gateaux derivative of the expected moment with respect to η, evaluatedat the truth, is zero. This means that small errors in estimating η do not translate intofirst-order bias in the estimator of θ.

For the partial linear model, the naive moment condition is E[(Y_i−D_iθ−g(X_i))D_i] =0. This is not Neyman-orthogonal: the derivative with respect to g is −E[D_i], whichis nonzero. A bias of order O(∥ˆg − g₀∥) in estimating g₀ translates into a bias of thesame order in ˆθ, which does not vanish at √n-rate.The orthogonal moment condition is obtained by “partialling out” the confoundersfrom both the outcome and the treatment:

$$ \psi(W_i; \theta, g, m) = (Y_i - g(X_i) - (D_i - m(X_i))\theta)(D_i - m(X_i)) $$

Setting E[ψ] = 0 at the truth yields:

$$ \theta_0 = \frac{\mathbb{E}[(Y_i - g_0(X_i))(D_i - m_0(X_i))]}{\mathbb{E}[(D_i - m_0(X_i))^2]} $$

This is the Frisch–Waugh–Lovell (FWL) representation: regress the outcome residualY_i − g₀(X_i) on the treatment residual D_i − m₀(X_i). The derivative of the expectedorthogonal moment with respect to (g,m) is zero at the truth, confirming Neymanorthogonality.

Cross-Fitting

Even with Neyman-orthogonal moments, there is a second source of bias when the nuisance estimators are computed on the same data used to estimate $\theta$: overfitting bias. Complex machine learners (random forests, neural networks, LASSO) can overfit the training data, which means the residuals computed from a model trained on the full sample are smaller than the true residuals. This leads to underestimation of the residual variance and bias in $\hat\theta$.

The solution is cross-fitting. Partition the data into $K$ folds. For each fold k:

Estimate ˆg^(−k) and ˆm^(−k) using all observations not in fold k.
Compute residuals ˆε_i = Y_i − ˆg^(−k)(X_i) and ˆv_i = D_i − ˆm^(−k)(X_i) for i in fold k.

Aggregate the residuals across all folds and estimate:

$$ \hat{\theta}^{DML} = \left( \frac{1}{n} \sum_{i=1}^n \hat{v}_i^2 \right)^{-1} \left( \frac{1}{n} \sum_{i=1}^n \hat{v}_i \hat{\varepsilon}_i \right) $$

Theorem 1 (Chernozhukov et al. 2018). Under regularity conditions including that∥ˆg − g₀∥₂ · ∥ˆm − m₀∥₂ = op(n^−1/2), the DML estimator satisfies:

$$ \sqrt{n}(\hat{\theta}^{DML} - \theta_0) \xrightarrow{d} \mathcal{N}(0, V) $$

where V = J₀^-2 E[ψ²] with J₀ = E[v_i²].

The key condition ∥ˆg − g₀∥₂ · ∥ˆm − m₀∥₂ = op(n^−1/2) is a product condition: itallows both nuisance functions to be estimated at slower than n^1/4-rate individually, aslong as their product rate is fast enough. Under sparsity, LASSO achieves ∥ˆg−g₀∥₂ = O_p( slogp/n) where s is the sparsity; the product of two such rates is O_p(slogp/n),which satisfies the condition if slogp = o(√n).

The Post-Double-Selection LASSO

Belloni et al.(2014) propose a related approach: post-double-selection (PDS) LASSO. The procedure is:

Regress Y on X using LASSO and record the selected covariates ˆS₁.
Regress Y on X using LASSO and record the selected covariates ˆS₂.
Regress Y on D and X_ˆS1∪ˆS2 using OLS.

The union of the two selected sets ensures that any variable confounding either theoutcome or the treatment is included. Under sparsity, the PDS estimator is √nconsistent and asymptotically normal.

The DML framework generalises PDS by allowing arbitrary machine learners forthe nuisance functions, not just LASSO. This is important when the true nuisancefunctions are not well-approximated by sparse linear models.

Extensions

Interactive regression model. When the treatment effect itself depends on covariates — i.e., D_iθ(X_i) + g₀(X_i) + ε_i— DML can be used to estimate the θ(x) by replacing the single coefficient θ₀ with a function. The R-learner of Nie and Wager(2021) implements this approach.

IV with high-dimensional controls. DML extends naturally to IV settings where both the first stage and the reduced form have high-dimensional controls. The moment condition is modified to use the instrument residual rather than the treatment residual.

DiD with DML. Callaway and Sant'Anna(2021) and subsequent work use DML-style nuisance estimation in the doubly robust DiD estimator. The conditional propensity score and outcome regression are estimated by flexible methods, and cross-fitting is used to avoid overfitting bias.

Variance Estimation and Inference

The variance of ˆθ^DML is estimated by:
\hat{V} = \hat{J}_0^{-2} \frac{1}{n}\sum_{i=1}^n \hat\psi_i^2
\] where $\hat{J}_0 = n^{-1}\sum_i \hat{v}_i^2$ and $\hat\psi_i = \hat{v}_i(\hat\varepsilon_i - \hat{v}_i \hat\theta^{DML})$.

where ˆJ₀ = n^-1∑_iˆv²_i and ˆψ_i = ˆv_i(ˆε_i − ˆviˆθ^DML).

Standard errors and confidence intervals from this formula are valid under the conditions of the theorem, which include the product rate condition and appropriate moment conditions. Under clustering, the variance formula is adjusted using clustered standard errors.

Conclusion

Double Machine Learning provides a principled solution to the problem of high-dimensional confounding in treatment effect estimation. Its two key innovations — Neyman orthogonality and cross-fitting — together ensure that flexible machine-learning estimation of nuisance functions does not bias inference on the parameter of interest. The framework is implemented in the DoubleML package in R and Python. For researchers who have access to rich covariate data and face potential confounding along many dimensions simultaneously, DML is now a standard tool.

References

Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment effects after selection among high-dimensional controls. Review of Economic Studies, 81(2):608--650.
Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200--230.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1--C68.
Nie, X. and Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299--319.
Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427):846--866.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58(1):267--288.

Double Machine Learning: Debiased Estimation with High-Dimensional Controls

The Problem: High-Dimensional Confounding

Neyman Orthogonality

Cross-Fitting

The Post-Double-Selection LASSO

Extensions

Variance Estimation and Inference

Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

Double Machine Learning: Debiased Estimation with High-Dimensional Controls

The Problem: High-Dimensional Confounding

Neyman Orthogonality

Cross-Fitting

The Post-Double-Selection LASSO

Extensions

Variance Estimation and Inference

Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title