The Causal Review

The Basic Idea

Suppose the government introduces a new school lunch programme in some districts (the "treated" group) but not others (the "control" group). You want to know whether the programme improved student test scores.

A naive approach compares test scores in treated and control districts after the programme. But treated districts might have started with higher (or lower) scores. This comparison is contaminated by pre-existing differences between groups.

A better approach: compare test scores before and after the programme in treated districts. But maybe scores were improving everywhere over this period, regardless of the programme. This simple before-after comparison is contaminated by time trends.

Difference-in-differences solves both problems at once. The idea is:

Look at how scores changed over time in the treated group.
Look at how scores changed over time in the control group.
The DiD estimate is the difference between these two changes.

By taking the difference of two differences, we remove both the pre-existing level difference between groups and the common time trend. What remains is (under the key assumption) the causal effect of the programme.

A Worked Numerical Example

Setup

Suppose we observe average test scores (out of 100) in two sets of school districts in two years:

Table 1: Test Scores Before and After the Programme
Group	Before (2019)	After (2021)	Change
Treated districts	64	71	+7
Control districts	72	75	+3
Difference-in-differences			7 − 3 = +4

Interpretation

Treated districts improved by 7 points between 2019 and 2021.
Control districts improved by 3 points over the same period.
The DiD estimate is $7 - 3 = 4$ points.

Our estimate is that the school lunch programme caused a 4-point improvement in test scores. Here is the logic:

The 3-point increase in control districts captures the "background trend" — the improvement that would have occurred in treated districts too, in the absence of the programme.
After removing this background trend, the 4-point excess improvement in treated districts is attributed to the programme.

Formally, the DiD estimator is:

$$ \widehat{ATT} = (\bar{Y}_{T,\text{post}} - \bar{Y}_{T,\text{pre}}) - (\bar{Y}_{C,\text{post}} - \bar{Y}_{C,\text{pre}}) = (71 - 64) - (75 - 72) = 7 - 3 = 4 $$

Why Not Just Compare After-Periods?

In 2021, treated districts score 71 and control districts score 75. The simple after-period comparison gives 71 - 75 = -4, suggesting the programme lowered scores! This is because treated districts started with lower scores (64 vs.72). The after-period comparison confounds the programme effect with the pre-existing gap.

Why Not Just Look at the Before-After Change for Treated Districts?

The 7-point increase in treated districts includes both the programme effect and any general improvement over this period (perhaps due to teacher training, economic growth, or other factors that affected all districts). The control group tells us that 3 points of improvement would have occurred anyway. DiD removes this common trend.

The Parallel Trends Assumption

The DiD estimate is only valid if the "parallel trends" assumption holds. This assumption states:

In the absence of the programme, test scores in treated districts would have followed the same trend as test scores in control districts.

In our example: if the programme had not been introduced, treated districts would have improved by 3 points (like the control group), not by 7 points. The remaining 4 points are caused by the programme.

This is an assumption — it cannot be directly tested, because we never observe what treated districts' trend would have been without the programme. But we can provide supporting evidence by checking whether the two groups were on parallel trends before the programme. If test scores in treated and control districts were moving in parallel in the years leading up to the programme, this gives us more confidence that they would have continued in parallel after the programme in the absence of treatment.

DiD as a Regression

The DiD estimator can be implemented as a linear regression. Define:

Treated_i = 1 if unit i is in the treated group, 0 otherwise.
Post_t = 1 if time period t is after the treatment, 0 otherwise.
Treated_i× Post_t times text Post_t: the interaction term (1 only for treated units in the post-period).

The regression is:

$$ Y_{it} = \beta_0 + \beta_1 \text{Treated}_i + \beta_2 \text{Post}_t + \beta_3(\text{Treated}_i \times \text{Post}_t) + \varepsilon_{it} $$

60 65 70 75 80 2019 2021 Test Score Time Control Treated (actual) Treated (counterfactual) $$ \color{red}{\widehat{ATT} = 4} $$

Figure 1: DiD logic: the treated group rises from 64 to 71 (+7), while the controlgroup rises from 72 to 75 (+3). The counterfactual for treated (dashed) mirrors thecontrol trend. The gap between actual and counterfactual outcomes at 2021 gives theDiD estimate of 4.

The coefficient β₃ is the DiD estimator.

Interpreting the Coefficients

β₀: average outcome for control group in pre-period = 72

β₁: level difference between treated and control groups in pre-period = 64−72 =−8
β₂: time trend for control group = 75 − 72 = 3
β₃: DiD = excess trend for treated group = (71 −64)−(75−72) = 4

Let us verify: the model predicts the treated group's post-period score as 72 + (-8) + 3 + 4 = 71. Correct.

Testing the Parallel Trends Assumption

If we have data on multiple pre-treatment periods, we can test whether the treated and control groups were on parallel trends before the treatment. We run an event-study regression:

$$ Y_{it} = \alpha_i + \lambda_t + \sum_{k \neq -1} \delta_k \cdot \mathbf{1}\{t - g_i = k\} + \varepsilon_{it} $$

where g_i is the treatment date and k is the number of periods relative to treatment. The coefficients δ_k for k < 0 are "pre-trend" coefficients. If parallel trends holds, these should be close to zero. A plot of all δ_k coefficients against k is called an event-study plot.

Rambachan and Roth(2023) formalise this logic: even if pre-trend coefficients are small, post-treatment estimates can be sensitive to violations of parallel trends. They propose confidence intervals that are valid even if trends diverge by a bounded amount after treatment.

Common Pitfalls

Selecting control groups based on outcomes. If you choose control groups because they look similar to treated groups in the post-period, you have introduced bias. Control groups should be chosen based on pre-treatment characteristics and prior trends.

Violation of SUTVA. The DiD framework assumes that one unit's treatment does not affect another unit's outcomes (the "stable unit treatment value assumption"). If the lunch programme in treated schools draws students away from control schools, control schools' scores might fall, biasing the estimate.

Heterogeneous timing. If different treated units are treated at different times ("staggered adoption"), the simple 2-period DiD can be misleading. The Callaway–Sant'Anna estimator (Callaway and Sant'Anna(2021)) is designed for this case.

Small samples. DiD is often applied to aggregate data (e.g., states or districts). With few clusters, conventional standard errors are unreliable, and permutation-based inference may be needed (Angrist and Pischke(2009)).

Conclusion

Difference-in-differences is a powerful and intuitive method for estimating causal effects. Its logic — remove common time trends by differencing, remove permanent group differences by differencing again — makes it one of the cleanest quasi-experimental designs available. Its main limitation is the parallel trends assumption, which is untestable but can be supported by pre-period evidence and sensitivity analysis. When the assumption is plausible and the comparison group well-chosen, DiD provides credible causal evidence from non-experimental data.

References

Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200--230.
Card, D. and Krueger, A. B. (1994). Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania. American Economic Review, 84(4):772--793.
Rambachan, A. and Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5):2555--2591.
Roth, J., Sant'Anna, P. H. C., Bilinski, A., and Poe, J. (2023). What's trending in difference-in-differences? A synthesis of the recent econometrics literature. Journal of Econometrics, 235(2):2218--2244.
Imbens, G. W. and Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press, Cambridge.

Difference-in-Differences from Scratch

The Basic Idea

A Worked Numerical Example

Setup

Interpretation

Why Not Just Compare After-Periods?

Why Not Just Look at the Before-After Change for Treated Districts?

The Parallel Trends Assumption

DiD as a Regression

Interpreting the Coefficients

Testing the Parallel Trends Assumption

Common Pitfalls

Conclusion

References

Continue Reading

The causalml Package in Python: Uplift Modeling and CATE Meta-Learners

The gsynth Package in R: Generalized Synthetic Control with Interactive Fixed Effects

Recent Results: Immigration, Migration, and Labour Markets

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

Difference-in-Differences from Scratch

The Basic Idea

A Worked Numerical Example

Setup

Interpretation

Why Not Just Compare After-Periods?

Why Not Just Look at the Before-After Change for Treated Districts?

The Parallel Trends Assumption

DiD as a Regression

Interpreting the Coefficients

Testing the Parallel Trends Assumption

Common Pitfalls

Conclusion

References

Continue Reading

The causalml Package in Python: Uplift Modeling and CATE Meta-Learners

The gsynth Package in R: Generalized Synthetic Control with Interactive Fixed Effects

Recent Results: Immigration, Migration, and Labour Markets

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title