1 The Problem: Selection Bias
Imagine you want to know whether attending a private university increases a student's future earnings, compared to attending a public university. You collect data on 1,000 students: some went private, some went public. You compare average earnings. Private-university graduates earn more.
Does that mean private universities cause higher earnings?
Not necessarily. Students who attend private universities tend to come from wealthier families, have higher test scores, and live in areas with better economic networks. These students might have earned more regardless of which type of university they attended. The observed difference in earnings reflects both the true causal effect of private universities and the pre-existing differences between the groups. This is called selection bias.
In a randomised experiment, random assignment eliminates selection bias: treated and control groups have the same pre-existing characteristics on average, so any difference in outcomes can be attributed to the treatment. But we cannot randomly assign students to private or public universities. What can we do?
2 The Core Idea: Conditional Independence
Matching methods build on a key assumption called conditional independence (also called "selection on observables" or "ignorability"). The assumption is:
$(Y^{1},Y^{0})\perp\perp D|X$ (1)
This says: once we control for the observed covariates X (family income, test scores, de- mographics), treatment D is as good as randomly assigned. Put differently, conditional on observables, there is no remaining selection bias.
This assumption is untestable it concerns unobserved factors but it is more plausible when X includes a rich set of relevant pre-treatment characteristics.
Under Equation 1, the average treatment effect on the treated (ATT) can be identified as:
$ATT=\mathbb{E}[Y^{1}-Y^{0}|D=1]=\mathbb{E}[\mathbb{E}[Y|D=1,X]-\mathbb{E}[Y|D=0,X]|D=1]$ (2)
The problem is that X may be high-dimensional: if we have 20 covariates, directly conditioning on all combinations is infeasible. This is where propensity scores and matching come in.
3 The Propensity Score
Rosenbaum and Rubin [1983] showed a remarkable result: if conditional independence holds given X, it also holds given just the propensity score:
$e(x)=Pr(D=1|X=x)$ (3)
the probability of treatment given observed covariates. Formally:
$(Y^{1},Y^{0})\perp\perp D|e(X)$ (4)
This reduces the problem to one dimension regardless of how many covariates are in X. Instead of matching on many covariates simultaneously, we can match on the single propensity score.
3.1 Estimating the Propensity Score
In practice, $e(x)$ is unknown and must be estimated. The standard approach is logistic regression:
$\hat{e}(x_{i})=\frac{exp(\hat{\gamma}^{\prime}x_{i})}{1+exp(\hat{\gamma}^{\prime}x_{i})}$ (5)
where $\hat{\gamma}$ is estimated by regressing $D_{i}$ on $x_{i}$. More flexible methods (random forests, gradient boosting) can also be used.
3.2 Common Support
The propensity score approach requires common support: for every value of $X,$ both treated and untreated units must be observed. Formally, $0<e(x)<1$ for all 2. If treated units have $\hat{e}(x)=0.99$ and no untreated units have similar scores, we cannot find a valid comparison. Such units should be dropped from the analysis.
4 Matching Methods
Once the propensity score is estimated, several matching procedures are available:
4.1 Nearest-Neighbour Matching
For each treated unit i, find the untreated unit $j^{*}(i)$ with the closest propensity score:
j*(i) = arg min lê(xi) - ê(x)| j:Dj=0
The ATT is estimated as:
$\hat{ATT}_{NN}=\frac{1}{N_{1}}\sum_{i:D_{i}=1}[Y_{i}-Y_{j*(i)}]$
where $N_{1}$ is the number of treated units. Variants include k-nearest-neighbours (where $k>1$ untreated units are matched to each treated unit) and matching with replacement (the same control unit can be used multiple times).
4.2 Caliper Matching
Nearest-neighbour matching can produce poor matches if the closest untreated unit is still far from the treated unit in propensity score space. A caliper imposes a maximum distance: matches beyond the caliper are discarded. A common caliper is 0.2 standard deviations of the propensity score [Austin, 2011].
4.3 Kernel Matching
Instead of a single nearest neighbour, kernel matching uses a weighted average of all untreated units, with weights declining with distance in propensity score space. This reduces variance but can increase bias from distant matches.
5 Checking Balance
Matching is only useful if it achieves balance the distribution of covariates should be similar between matched treated and control units. The standard diagnostic is the standardised mean difference (SMD) for each covariate before and after matching:
$SMD=\frac{\overline{X}_{treated}-\overline{X}_{control}}{\sqrt{(s_{treated}^{2}+s_{control}^{2})/2}}$ (6)
An SMD below 0.1 indicates good balance. A "Love plot" visualising the SMD for all covariates before and after matching is standard practice.
Figure 1: Illustrative Love plot. Before matching (blue), standardised mean differences are large. After propensity score matching (red), balance is achieved.
6 Inverse Probability Weighting (IPW)
A related approach is inverse probability weighting: instead of matching, reweight each observation by the inverse of its propensity score:
$\hat{ATT}_{IPW}=\frac{\sum_{i}D_{i}Y_{i}}{\sum_{i}D_{i}}-\frac{\sum_{i}\frac{\hat{e}(x_{i})}{1-\hat{e}(x_{i})}(1-D_{i})Y_{i}}{\sum_{i}\frac{\hat{e}(x_{i})}{1-\hat{e}(x_{i})}(1-D_{i})}$ (7)
IPW creates a "pseudo-population" in which the treatment is unconfounded. It is more efficient than matching but sensitive to extreme propensity scores (weights become very large when $\hat{e}(x_{i})$ is close to 1).
7 Common Mistakes
- Matching on post-treatment variables: Only include pre-treatment covariates in X. Including variables affected by treatment introduces bias.
- Not checking balance: Always verify that matching achieved balance on the original covariates, not just the propensity score.
- Ignoring common support: Treated units with no comparable controls should be dropped, not imputed.
- Mistaking matching for randomisation: Matching controls for observed con- founders only. If important unobserved confounders exist, matching does not solve the problem.
8 Where to Learn More
- Rosenbaum and Rubin [1983] is the foundational reference.
- Imbens [2015] provides a comprehensive graduate-level treatment.
- The MatchIt package in R implements all major matching methods with balance di- agnostics.
References
- Rosenbaum, P.R. and Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41-55.
- Imbens, G.W. (2015). Matching methods in practice: Three examples. Journal of Human Resources, 50(2):373-419.
- Austin, P.C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate Behavioral Research, 46(3):399-424.
- Imbens, G.W. and Rubin, D.B. (2015). Causal Inference for Statistics, Social, and Biomed- ical Sciences. Cambridge University Press.