Introduction
Few tools in the economist's toolkit inspire both as much admiration and as much scepticism as instrumental variables (IV). At its best, IV can credibly identify a causal effect from observational data in settings where randomisation is impossible and standard regression would be hopelessly confounded. At its worst, it can produce estimates that are wildly wrong, sensitive to small violations of unverifiable assumptions, and dressed up in the language of rigour to lend false confidence.
The method is old. Wright(1928) derived what we would today call an IV estimator to separate supply and demand curves for agricultural commodities in the 1920s. Haavelmo(1943) formalised the simultaneous equations model. But IV's modern renaissance in applied economics began with Angrist and Krueger(1991), who used quarter of birth as an instrument for educational attainment to estimate returns to schooling, and with Angrist(1990), who used the Vietnam draft lottery to estimate the earnings cost of military service. These papers, along with Card(1995) and the entire programme described in Angrist and Pischke(2009), brought IV to the centre of empirical practice.
This article traces the logic of IV, explains the LATE theorem that clarifies what IV estimates, discusses the threats that can render IV estimates misleading, and surveys the state of the art in applied IV practice.
The Logic of Instrumental Variables
The Problem of Endogeneity
Let the structural equation of interest be \[\begin{equation} Y_i = \alpha + \tau D_i + \varepsilon_i, \label{eq:structural} \end{equation}\] where \(Y_i\) is an outcome, \(D_i\) is a treatment (say, years of education), and \(\varepsilon_i\) captures all other determinants of \(Y_i\). The parameter \(\tau\) is the causal effect of interest.
OLS estimation of requires \(\mathbb{E}[D_i \varepsilon_i] = 0\) — that the treatment is uncorrelated with the error term. When \(D_i\) is chosen by the individual (as education is), this assumption fails: ability, family background, and motivation are in \(\varepsilon_i\) and are correlated with educational attainment. OLS is biased.
An instrument \(Z_i\) breaks this impasse. To be valid, \(Z_i\) must satisfy:
- Relevance: \(\text{Cov}(Z_i, D_i) \neq 0\). The instrument must move treatment.
- Exclusion: \(\text{Cov}(Z_i, \varepsilon_i) = 0\). The instrument is uncorrelated with the structural error — i.e., it affects \(Y_i\) only through \(D_i\).
Given these conditions, the IV estimator is \[\begin{equation} \hat{\tau}^{\text{IV}} = \frac{\text{Cov}(Z_i, Y_i)}{\text{Cov}(Z_i, D_i)}. \label{eq:ivformula} \end{equation}\]
In the two-stage least squares (2SLS) implementation, the first stage regresses \(D_i\) on \(Z_i\) (and controls) to obtain \(\hat{D}_i\), and the second stage regresses \(Y_i\) on \(\hat{D}_i\). This "purges" \(D_i\) of its endogenous component.
The LATE Theorem
The classical presentation above treats \(\tau\) as a constant. Imbens and Angrist(1994) and Angrist et al.(1996) showed that in a heterogeneous-treatment-effects world, IV identifies something more specific: the local average treatment effect (LATE) for compliers.
Define potential treatments \(D_i(0)\) and \(D_i(1)\) as the treatment status individual \(i\) would take when \(Z_i = 0\) and \(Z_i = 1\), respectively. Four types of individuals emerge:
- Always-takers: \(D_i(0) = D_i(1) = 1\) — take treatment regardless of \(Z_i\).
- Never-takers: \(D_i(0) = D_i(1) = 0\) — never take treatment.
- Compliers: \(D_i(0) = 0\), \(D_i(1) = 1\) — comply with the instrument.
- Defiers: \(D_i(0) = 1\), \(D_i(1) = 0\) — do the opposite (ruled out by monotonicity).
Under the assumptions of (i) independence (\(Z_i \perp (D_i(0), D_i(1), Y_i(0), Y_i(1))\)), (ii) exclusion (\(Y_i(d, z) = Y_i(d)\)), (iii) relevance, and (iv) monotonicity (no defiers), the IV estimator identifies: \[\begin{equation} \tau^{\text{LATE}} = \mathbb{E}[Y_i(1) - Y_i(0) \mid \text{complier}]. \label{eq:late} \end{equation}\]
This is the average treatment effect for the subpopulation whose treatment status is changed by the instrument. Always-takers and never-takers contribute nothing to identification — only compliers do.
The LATE interpretation has profound implications. When Angrist and Krueger(1991) estimate the return to schooling using quarter of birth as an instrument, they identify the returns for men who would have left school earlier but stayed in school longer because of compulsory schooling laws — a specific group that need not represent the broader population.
Famous Applications
Returns to Schooling
Angrist and Krueger(1991) argued that quarter of birth is a valid instrument for years of schooling. Under compulsory schooling laws, children must remain in school until they reach a specified age (typically 16). Children born earlier in the year reach this age earlier and can legally drop out sooner, resulting in slightly lower average schooling. Because birth quarter is plausibly random with respect to ability and family background, it provides exogenous variation in schooling.
Using data from the 1980 US Census, Angrist and Krueger found returns to schooling of around 7–10%, comparable to OLS estimates. The paper triggered enormous debate: Bound et al.(1995) showed that the instruments were extremely weak in smaller samples, producing severely biased and unreliable IV estimates — an early warning about the weak-instruments problem.
The Shift-Share (Bartik) Instrument
Bartik(1991) proposed constructing local labour demand shocks by interacting national industry employment growth rates with local industry composition. The idea: if a city has a large share of workers in an industry that grows nationally, it receives a positive local demand shock — not because of local conditions, but because of national trends. This "shift-share" or "Bartik" instrument has been used across dozens of applications in labour, housing, and development economics.
Goldsmith-Pinkham et al.(2020) show that the Bartik instrument's validity rests on the exogeneity of the shares (local industry composition), not the shifts (national growth rates). If cities chose their industrial composition in ways correlated with future local outcomes, the instrument is invalid. Borusyak et al.(2022) provide an alternative "many-instruments" interpretation in which validity depends on the orthogonality of the shifts.
Judge Fixed Effects
A creative and increasingly common instrument in criminal justice research uses the random assignment of defendants to judges with systematically different sentencing tendencies. Judges vary in their leniency — some routinely impose short sentences, others long ones. This variation, conditional on the randomness of assignment, can serve as an instrument for incarceration length. Kling(2006) use this approach to estimate the effect of incarceration on earnings.
The Pathologies of Bad IV
Weak Instruments
When the first-stage relationship is weak, IV has three devastating properties: (i) estimates are biased in the direction of OLS, even asymptotically in finite samples; (ii) confidence intervals are unreliable and often far too narrow; (iii) small violations of the exclusion restriction are amplified dramatically.
Formally, if the first-stage \(F\)-statistic is low, the 2SLS estimator approximates \[\begin{equation} \text{plim}(\hat{\tau}^{\text{2SLS}}) \approx \tau + \frac{\text{Cov}(Z_i, \varepsilon_i) / \text{Var}(Z_i)}{\text{Cov}(Z_i, D_i) / \text{Var}(Z_i)}. \label{eq:weakbias} \end{equation}\] The numerator of the bias term is the covariance between instrument and error — which is zero if exclusion holds, but the denominator is the first-stage slope, which shrinks as the instrument weakens. A small violation of exclusion is thus magnified into a large bias when instruments are weak.
Staiger and Stock(1997) recommend the rule of thumb \(F > 10\) as a minimum; Stock and Yogo(2005) provide formal critical values. With multiple instruments, the Sanderson-Windmeijer (2016) conditional \(F\)-statistic is preferred.
Exclusion Restriction Violations
The exclusion restriction is untestable with a single instrument. With overidentification (more instruments than endogenous regressors), the Sargan-Hansen \(J\)-test can detect some violations, but it has low power and tests only a weighted average of violations.
Quarter of birth provides a cautionary tale. Bound et al.(1995) noted that children born in different quarters differ not only in schooling but also in age at school entry, which may directly affect cognitive development and earnings through channels unrelated to years of schooling. If so, the exclusion restriction fails.
The LATE Is Local
Even valid IV estimates may be policy-irrelevant. The LATE is defined for compliers at the margin of the instrument — a potentially small and unusual subpopulation. Heckman(1997) emphasises that policymakers often need to know the average treatment effect (ATE) or the effect for a specific target group, neither of which IV necessarily identifies.
State of the Art in Applied IV
Modern applied IV practice incorporates several advances. Andrews et al.(2019) survey methods for weak-instrument-robust inference, recommending the conditional likelihood ratio (CLR) test and the Anderson-Rubin test over standard Wald-based inference whenever the first-stage \(F\) is below 100. Lee et al.(2022) show that even with \(F > 10\), bias-corrected confidence intervals may be substantially wider than naive ones.
Machine learning has entered IV estimation through the double machine learning (DML) framework of Chernozhukov et al.(2018), which allows high-dimensional controls in IV settings while maintaining valid inference. The DoubleML package implements these methods in Python and R.
Shift-share designs have their own growing literature on validity conditions. Borusyak et al.(2022) establish that consistent estimation requires many industries with no single industry dominating the shock, and that the industry-level shocks must be independent of industry-level confounders.
Conclusion
Instrumental variables remain one of the most important identification strategies in applied economics. The LATE theorem gives them a precise interpretation; the weak-instruments literature teaches us when to be cautious; and the shift-share, judge-fixed-effects, and draft-lottery literatures show how imaginative applied researchers continue to find credible instruments for new questions.
But the history of IV is also a history of abuse: instruments that barely move treatment, exclusion restrictions that strain credulity, and LATE estimates presented as if they answered policy questions they cannot. The lesson for readers and producers of applied research is clear: demand and supply good instruments, scrutinise the exclusion restriction rigorously, test for weak instruments, and be honest about for whom and where the LATE applies.
References
- Andrews, I., Stock, J. H., and Sun, L. (2019). Weak instruments in instrumental variables regression: Theory and practice. Annual Review of Economics, 11:727--753.
- Angrist, J. D. (1990). Lifetime earnings and the Vietnam era draft lottery: Evidence from Social Security administrative records. American Economic Review, 80(3):313--336.
- Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444--455.
- Angrist, J. D. and Krueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics, 106(4):979--1014.
- Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
- Bartik, T. J. (1991). Who Benefits from State and Local Economic Development Policies? W.E.\ Upjohn Institute for Employment Research, Kalamazoo, MI.
- Borusyak, K., Hull, P., and Jaravel, X. (2022). Quasi-experimental shift-share research designs. Review of Economic Studies, 89(1):181--213.
- Bound, J., Jaeger, D. A., and Baker, R. M. (1995). Problems with instrumental variables estimation when the correlation between the instruments and the endogenous explanatory variable is weak. Journal of the American Statistical Association, 90(430):443--450.
- Card, D. (1995). Using geographic variation in college proximity to estimate the return to schooling. In Christofides, L., Grant, E., and Swidinsky, R., editors, Aspects of Labour Market Behaviour. University of Toronto Press, Toronto.
- Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1):C1--C68.
- Goldsmith-Pinkham, P., Sorkin, I., and Swift, H. (2020). Bartik instruments: What, when, why, and how. American Economic Review, 110(8):2586--2624.
- Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations. Econometrica, 11(1):1--12.
- Heckman, J. J. (1997). Instrumental variables: A study of implicit behavioral assumptions used in making program evaluations. Journal of Human Resources, 32(3):441--462.
- Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2):467--475.
- Kling, J. R. (2006). Incarceration length, employment, and earnings. American Economic Review, 96(3):863--876.
- Lee, D. S., McCrary, J., Moreira, M. J., and Porter, J. (2022). Valid $t$-ratio inference for IV. American Economic Review, 112(10):3260--3290.
- Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instruments. Econometrica, 65(3):557--586.
- Stock, J. H. and Yogo, M. (2005). Testing for weak instruments in linear IV regression. In Andrews, D. and Stock, J., editors, Identification and Inference for Econometric Models. Cambridge University Press, Cambridge.
- Wright, P. G. (1928). The Tariff on Animal and Vegetable Oils. Macmillan, New York.