The Causal Review

Introduction

In 1994, David Card and Alan Krueger published a study of the New Jersey minimum wage increase that would overturn a generation of received wisdom (Card and Krueger(1994)). Their method was deceptively simple: compare fast-food employment in New Jersey before and after the 1992 minimum wage rise to employment in neighbouring Pennsylvania, which served as a control. No structural model, no untested equilibrium conditions — just a comparison of differences. The paper was controversial precisely because it was so legible. Readers could see exactly what assumptions were required, and they could argue about whether those assumptions held. This legibility — the ability to point to the identifying variation, name the required assumptions, and invite criticism — is the hallmark of what Angrist and Pischke would later call the credibility revolution (Angrist and Pischke(2010)).

This article tells the story of that revolution: where it came from, what it achieved, and where the frontier lies today.

The Pre-Revolutionary Landscape

To appreciate what changed, one must understand what came before. Through the 1970s and much of the 1980s, the dominant mode of empirical macroeconomics and much of microeconomics was the simultaneous equations model. Researchers specified systems of equations derived from economic theory, estimated them by instrumental variables or full-information maximum likelihood, and used the results to conduct policy analysis. The Cowles Commission programme, associated with Haavelmo, Koopmans, and Klein, provided the theoretical framework (Angrist and Pischke(2009)).

The problem was that identification in these models typically rested on exclusion restrictions — the assumption that some variable entered one equation but not another — that were derived from theory rather than observable data. As Sims(1980) argued forcefully, these restrictions were "incredible." The reader had no way to verify whether the exclusion restriction was satisfied because there was no natural benchmark.

By the late 1980s, a parallel tradition had been developing in statistics and biometrics, centred on the potential outcomes (or Rubin causal model) framework. Rubin(1974) formalised the idea, drawing on earlier work by Neyman, that a causal effect should be defined as a comparison between two potential outcomes for the same unit: Yᵢ(1), the outcome that would obtain if unit i were treated, and Yᵢ(0), the outcome that would obtain if it were not. The fundamental problem of causal inference is that only one of these is ever observed. Identification, in this framework, is the problem of recovering features of the distribution of Yᵢ(1) - Yᵢ(0) from observed data.

The Rise of Natural Experiments

The key insight that sparked the credibility revolution was that nature and policy occasionally conduct experiments for us. A minimum wage law takes effect in one state but not another. A lottery randomly assigns children to schools. A quirk in administrative rules creates an age cutoff that determines programme eligibility. These "natural experiments" provide variation in treatment that is, at least arguably, as good as random — or at least uncorrelated with the potential outcomes, conditional on observable covariates.

The practical toolkit for exploiting natural experiments crystallised around four core methods, memorably surveyed by Angrist and Pischke(2009): randomised controlled trials (RCTs), instrumental variables (IV), regression discontinuity designs (RDD), and difference-in-differences (DiD). Each method identifies a different estimand under a different set of assumptions, and each has a distinct visual diagnostic that allows the reader to assess credibility informally.

Instrumental Variables

The IV estimator, dating to the work of Working (1927) and formalised in the simultaneous-equations literature, was rehabilitated in the credibility era by Angrist et al.(1996), who showed that under a set of conditions now called the LATE (local average treatment effect) theorem, IV identifies the average treatment effect for "compliers" — units whose treatment status is changed by the instrument. This was both a clarification and a caution: IV estimates a well-defined causal quantity, but it may not be the quantity of most policy relevance.

Regression Discontinuity

The RDD, introduced by Thistlethwaite and Campbell (1960) but largely ignored for decades, was revived and formalised by Hahn et al.(2001) and Imbens and Lemieux(2008). The idea is that when treatment assignment is a deterministic function of a running variable crossing a threshold, units just above and below the threshold are locally comparable. The estimated discontinuity in the outcome at the threshold identifies the average treatment effect for units at the margin.

Difference-in-Differences

DiD is perhaps the most widely used quasi-experimental design in applied economics. Its logic is equally simple: compare the change in outcomes for a treated group before and after treatment to the change for an untreated comparison group. Under the "parallel trends" assumption — that in the absence of treatment the two groups would have followed the same trajectory — the DiD estimator identifies the average treatment effect on the treated (ATT). Card and Krueger(1994) provided the canonical application; Angrist and Pischke(2009) provided the canonical textbook treatment.

The Maturing of the Revolution

By the mid-2000s, the credibility revolution had largely won the methodological debate in empirical microeconomics. Journal editors and referees routinely demanded identification strategies. The era's achievements were substantial: clean evidence on the effects of class size, minimum wages, immigration, education, incarceration, and scores of other policy-relevant questions. But success bred new problems.

The TWFE Problem

The workhorse implementation of DiD — the two-way fixed effects (TWFE) estimator — had been applied to settings with staggered treatment adoption (where different units are treated at different times) without careful thought about what it actually estimated. A sequence of theoretical papers beginning around 2018 showed that the TWFE estimator is, in general, a weighted average of all possible two-by-two DiD comparisons, with weights that can be negative for treatment effect heterogeneity (Goodman-Bacon(2021)). de Chaisemartin and D'Haultf uille(2020) showed that the estimator can even produce the wrong sign if heterogeneity is severe enough. Callaway and Sant'Anna(2021) proposed an alternative estimator based on group-time average treatment effects, and Sun and Abraham(2021) proposed an interaction-weighted estimator that is robust to heterogeneity. The practical lesson was that a tool used by thousands of researchers for decades contained a latent flaw that had gone unnoticed.

Heterogeneous Treatment Effects and Machine Learning

A second frontier opened with the question: not just what is the average treatment effect, but for whom does treatment work? Chernozhukov et al.(2018) proposed the Double Machine Learning (DML) framework, which uses cross-fitting and Neyman-orthogonal moment conditions to estimate treatment effects while using flexible machine-learning methods to control for high-dimensional covariates. The key insight is that the nuisance functions (the conditional mean of the outcome and of treatment given controls) can be estimated with any consistent learner, without the bias contaminating the treatment effect estimate, provided the learner converges fast enough and a cross-fitting procedure is used.

Sensitivity Analysis

A third development has been the formalisation of sensitivity analysis. The parallel trends assumption in DiD is fundamentally untestable: it makes a claim about the counterfactual path of the treated group, which is never observed. For many years, practitioners relied on pre-trend tests as informal proxies. Rambachan and Roth(2023) formalised a framework for "honest" sensitivity analysis in which the researcher specifies, explicitly, how much the parallel trends assumption might be violated, and reports confidence intervals that are valid under that class of violations. This shifts the burden of proof constructively: rather than claiming that an assumption holds, the researcher shows how bad the violation would have to be to overturn the conclusion.

The Current Frontier

The credibility revolution of the 1990s was, at its core, about imposing discipline on identification. The current moment is about three further disciplines: discipline over what is estimated (heterogeneous effects vs. averages), discipline over what is assumed (honest sensitivity analysis), and discipline over what controls are used (high-dimensional inference). These three threads are beginning to be woven together: researchers now routinely combine DML-style nuisance estimation with DiD-style parallel trends arguments, and apply Rambachan–Roth sensitivity analysis to report results that are honest about the fragility of the identifying assumption.

Whether this constitutes a second revolution, or merely the maturation of the first, is a matter of perspective. What is clear is that the fundamental commitment — to state assumptions explicitly, to use variation that is as close to experimental as possible, and to invite scrutiny — remains the defining characteristic of serious empirical work.

Conclusion

The credibility revolution transformed empirical economics from a discipline in which identification was implicitly assumed inside black-box structural models into one in which identification is the central object of inquiry. The toolkit it bequeathed — IV, RDD, DiD, RCT — is now standard across the social sciences. The current generation of methodological work is deepening that toolkit by confronting heterogeneity, high dimensionality, and the honest acknowledgement of assumption violations. Thirty years on, the revolution's core insight — that you must be able to see your identifying variation — remains as important as ever.

References

Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press, Princeton, NJ.
Angrist, J. D. and Pischke, J.-S. (2010). The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2):3--30.
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444--455.
Callaway, B. and Sant'Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2):200--230.
Card, D. and Krueger, A. B. (1994). Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania. American Economic Review, 84(4):772--793.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1--C68.
de Chaisemartin, C. and D'Haultfuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9):2964--2996.
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2):254--277.
Hahn, J., Todd, P., and van der Klaauw, W. (2001). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1):201--209.
Imbens, G. W. and Lemieux, T. (2008). Regression discontinuity designs: A guide to practice. Journal of Econometrics, 142(2):615--635.
Rambachan, A. and Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5):2555--2591.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688--701.
Sims, C. A. (1980). Macroeconomics and reality. Econometrica, 48(1):1--48.
Sun, L. and Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2):175--199.

The Credibility Revolution in Econometrics: Thirty Years On

Introduction

The Pre-Revolutionary Landscape

The Rise of Natural Experiments

Instrumental Variables

Regression Discontinuity

Difference-in-Differences

The Maturing of the Revolution

The TWFE Problem

Heterogeneous Treatment Effects and Machine Learning

Sensitivity Analysis

The Current Frontier

Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

The Credibility Revolution in Econometrics: Thirty Years On

Introduction

The Pre-Revolutionary Landscape

The Rise of Natural Experiments

Instrumental Variables

Regression Discontinuity

Difference-in-Differences

The Maturing of the Revolution

The TWFE Problem

Heterogeneous Treatment Effects and Machine Learning

Sensitivity Analysis

The Current Frontier

Conclusion

References

Continue Reading

The ivmte Package in R: Marginal Treatment Effects and Bounding Policy-Relevant Parameters

The contdid Package in R: Estimating Dose-Response Functions with Continuous Treatments

Recent Results: Housing Markets, Rent Control, and Urban Economics

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title