Beginner's Corner

What Does "Identification" Mean? A Beginner's Guide to Causal Identification

1 Motivation: The Ambiguous Regression

Suppose you are a researcher trying to understand whether attending a selective university causes higher lifetime earnings. You collect data on earnings and university selectivity and run the regression:

           Earnings = a + β Selectivity + ε

You find β̂ = 20,000: students at selective universities earn $20,000 more per year on average. Does attending a selective university cause $20,000 higher earnings?

Almost certainly not or at least, not by that amount. Students who attend selective universities are also more talented, come from wealthier families, and have better professional networks. Even if a selective university added zero value, students there would still earn more. The OLS estimate of β is mixing up the causal effect of the university with the pre-existing advantages of the students who attend it.

This is the identification problem. The number β̂ is well-defined statistically we can compute it precisely. But it does not identify the causal parameter we care about: the effect of selectivity itself on earnings. Understanding what "identification" means is the foundational skill in modern causal inference.

2 What Does it Mean to Identify a Causal Effect?

In econometrics, a parameter is identified if it can be uniquely recovered from the observable data distribution (the joint distribution of outcomes, treatments, and covariates) given the assumptions of the model [Manski, 1995].

Note carefully what identification is not:

  • It is not a statement about sample size. A parameter can be identified even in a sample of 10 and unidentified even with 10 million observations.
  • It is not the same as estimation. Identification is a population-level concept: does the true (population) distribution of the data uniquely pin down the parameter?
  • It is not the same as statistical significance. An unidentified parameter can be estimated with great precision it is just estimating the wrong thing.

Informally: a causal effect is identified if, with infinite data, you could recover the true effect from what you observe. If not— if two different true causal effects would produce the same observable data distribution— the effect is unidentified.

3 The Source of Non-Identification: Endogeneity

Why is the return to selective universities unidentified by OLS? Because the treatment (attending a selective university) is endogenous: it is correlated with the error term in the causal model. Suppose the true causal model is:

Earningsi = α + β · Selectivityi +
(Abilityi + Backgroundi) εi
, (1)

where β is the causal effect of selectivity and εᵢ captures unobservable ability and family background. Since more able students attend more selective universities, Corr(Selectivityᵢ, εᵢ) > 0. OLS estimates:

^βOLS
p
β +
Cov(Selectivityi, εi)
Var(Selectivityi)
omitted variable bias > 0
. (2)

The formula in (2) is the omitted variable bias (OVB) formula: OLS is biased upward because ability is omitted and ability is positively correlated with both selectivity and earnings.

Equation (2) reveals that OLS does identify something it identifies β plus the bias term. But this is not the causal effect β alone. The causal effect is not identified from OLS in this setting.

4 How Do We Achieve Identification?

Identification of a causal effect requires either:

(I) Randomisation: If treatment is randomly assigned, Corr(Selectivityᵢ, εᵢ) = 0 by construction, and OLS identifies the causal effect. This is why randomised experiments are the gold standard.

(II) A strong assumption about the error: If we assume εᵢ = 0 (no unobservables), OLS identifies the effect but this assumption is usually implausible.

(III) An instrument: A variable Zᵢ that affects treatment (Selectivityᵢ) but is unrelated to the error (εᵢ) can be used to isolate the variation in treatment that is orthogonal to confounders. This is instrumental variables (IV) identification.

(IV) A discontinuity or panel design: Regression discontinuity exploits the fact that units just above and below a threshold are comparable; difference-in-differences exploits the fact that trends in untreated outcomes are similar across groups. Both provide settings where treatment variation is as-good-as-random conditional on specific design features.

Each of these strategies identifies the causal effect under different assumptions. The job of the econometrician is to argue, based on institutional knowledge and data, that those assumptions hold.

5 A Visual Intuition

Instrument Z
Treatment D
Outcome Y
Unobservable U
First stage
Causal effect β
excluded (0)

Figure 1: IV identification. The instrument Z affects treatment D (first stage) but hasno direct effect on Y (exclusion restriction, shown as dashed). Unobservable U confoundsD→Y. Byusing only the variation in D driven by Z, IV isolates the causal β.

The diagram in Figure 1 illustrates the IV strategy. The red arrows show confounding from U. OLS captures both the causal arrow (D → Y) and the confounding path (D ← U → Y). IV uses only the variation in D caused by Z— this variation is independent of U, so it isolates the causal effect β.

6 Design-Based vs Model-Based Identification

There are two broad philosophies of identification in econometrics [Imbens, 2009]:

Model-based identification imposes parametric or functional form restrictions on the joint distribution of outcomes and treatments. For example, a structural model might assume that the relationship between treatment and outcome is log-linear, that errors are normally distributed, and that instruments enter additively. These assumptions can identify effects even from observational data but they are only as credible as the model.

Design-based identification relies on features of the data-generating process— random assignment, institutional rules, policy thresholds— rather than functional form assumptions. RCTs, IV, RD, and DiD are all design-based strategies. Their identifying assumptions are (in principle) empirically testable and rest on institutional facts rather than statistical conveniences.

The "credibility revolution" in econometrics [Angrist and Pischke, 2010] is largely a shift from model-based to design-based identification. The insight is that causal claims based on an institutional feature ("these firms were randomly audited") are more convincing than claims based on a distributional assumption ("errors are normally distributed with mean zero").

7 What Is Identified, for Whom?

An important subtlety: different identification strategies identify different versions of the causal effect, for different populations.

  • A randomised experiment identifies the average treatment effect (ATE): the average causal effect across all units in the experimental population.
  • An IV strategy identifies the local average treatment effect (LATE): the causal effect for the subgroup of units whose treatment status is affected by the instrument— the compliers [Imbens and Angrist, 1994].
  • An RD design identifies a local average treatment effect at the threshold: the causal effect for units at the treatment boundary.

Identification in the same data for the same question can yield different answers depending on the strategy, because different strategies recover effects for different subpopulations. This is not a bug it is a feature that carries important policy implications. Understanding what population your identification strategy speaks to is as important as achieving identification in the first place.

8 Common Mistakes

(1) Confusing precision with identification. A very precise OLS estimate with narrow confidence intervals does not mean the causal effect is identified. It means the estimate of the (potentially biased) OLS coefficient is precise.

(2) Assuming "controlling for observables" identifies causal effects. Adding covariates to a regression removes omitted variable bias only if you control for all confounders. Partial control still leaves omitted variable bias for the uncontrolled confounders.

(3) Forgetting that identification assumptions are untestable. The parallel trends assumption in DiD, the exclusion restriction in IV, and the continuity assumption in RD are all untestable in a strict sense— we can never directly observe counterfactual outcomes. We can provide supportive evidence (pre-trends, placebo tests, balance checks), but we cannot prove the assumptions hold.

9 Where to Learn More

For an intuitive treatment of identification and the potential outcomes framework, see Angrist and Pischke [2009]. For a more formal treatment of the identification concept including partial identification, see Manski [1995]. For the distinction between design-based and model-based identification, see Imbens [2009].

References

  1. Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
  2. Angrist, J. D. and Pischke, J.-S. (2010). The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2), 3-30.
  3. Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treatment effects. Econometrica, 62(2), 467-475.
  4. Imbens, G. W. (2009). Better LATE than nothing: Some comments on Deaton (2009) and Heckman and Urzua (2009). Journal of Economic Literature, 48(2), 399-423.
  5. Manski, C. F. (1995). Identification Problems in the Social Sciences. Harvard University Press.

Continue Reading

Browse All Sections →
Home
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

Article Title