Beginner's Corner

Good Controls and Bad Controls: When Does Adding a Variable Hurt?

1 The Naive View and Why It Fails

Most students learn a simple rule in their first econometrics course: to estimate the causal effect of X on Y, control for all variables that are related to both X and Y. More controls, the reasoning goes, means less omitted variable bias and better causal identification. This rule is dangerously wrong.

Adding a control variable can introduce bias where none existed, transform an unbiased estimate into a severely biased one, or block the very causal pathway you are trying to measure. Understanding when controls help and when they hurt requires thinking in terms of the causal graph underlying the data-generating process [Pearl, 2009].

 This article drawing on Cinelli et al. [2022] explains the four main types of bad controls and provides a decision rule for applied researchers.

2 A Quick Reminder: What Controls Are For

A variable Z is a good control (or confounder) if it satisfies two conditions:

  1. It affects the treatment X.
  2. It affects the outcome Y.
  3. It is not on the causal path from X to Y.

The classic example is ability in the returns-to-education regression. Ability affects both years of schooling (treatment) and wages (outcome), but ability is not caused by education- it is a common cause of both. Controlling for ability removes the spurious correlation that runs through the common cause. In the language of directed acyclic graphs (DAGs), Z is a backdoor path confounder [Pearl, 2009].

The bad controls discussed below all fail condition (3): they are either caused by the treatment, or they create new spurious associations when conditioned upon.

3 Type 1: The Mediator

X (Education)
M (Occupation)
Y (Wages)

Figure 1: Controlling for a mediator M blocks the indirect causal path from X to Y .

A mediator is a variable that sits on the causal path from treatment to outcome. In Figure 1, occupation M is caused by education X and causes wages Y. Part of education's effect on wages runs through occupation education leads to better jobs, which pay higher wages. The direct path X → Y captures the portion of the education premium not mediated by occupation.

What happens if you control for occupation in a wage regression? You block the X → M → Y pathway, estimating only the direct effect of education on wages. This is not the total effect of education it is a partial effect that excludes occupation-mediated returns. If you wanted the total effect of education, you have introduced bias by over-controlling. Rule: Control for a mediator only if you specifically want the direct effect of X on Y, not mediated through M. For total effects, do not control for mediators.

4 Type 2: The Collider

X (Beauty)
Y (Productivity)
C (Hired by firm)

Figure 2: Conditioning on a collider C opens a spurious path between X and Y.

A collider is a variable that is caused by two other variables. In Figure 2, suppose X is physical attractiveness and Y is productivity. Neither causes the other. But both affect hiring decisions (C). If you analyse only employed workers (conditioning on C'), you induce a spurious negative correlation between beauty and productivity: among the hired, those who are less beautiful were probably hired because they are very productive, and those who are very beautiful might have lower productivity. Conditioning on the collider opens a path between X and Y that did not previously exist.

Collider bias is pernicious because it is invisible in the data. The pre-conditioning correlation between X and Y may be zero, but conditioning on C makes them appear negatively correlated. No amount of larger sample sizes will fix this; it is a fundamental identification problem.

Common research examples of collider bias:

  • Controlling for income when studying the effect of education on health, if income is a collider caused by both education and health.
  • Restricting to employed workers when studying the effect of a labour market programme on wages.
  • Conditioning on surviving (non-attrition) in a longitudinal study when attrition is caused by both treatment and outcome.

5 Type 3: The Post-Treatment Variable

A post-treatment variable is one that is causally affected by the treatment. Controlling for it is a special case of the mediator problem: if Z is caused by X, then conditioning on Z partials out some of the variation in X that is causally relevant.

 Example. You want to estimate the effect of a job training programme on wages. You include current employment status as a control. But employment status is an intermediate outcome: the training programme may work precisely by improving employment. Conditioning on employment status removes part of the very mechanism you are trying to measure.

The rule is simple: never condition on post-treatment variables unless you are specifically interested in the effect of treatment on Y given a fixed value of the post-treatment variable.

6 Type 4: M-Bias

M-bias is a subtler form of collider bias. In Figure 3, X and Y have no common causes so without controls, there is no confounding. But M is a collider on the path X ← U₁ → M ←

U1
U2
X
M
Y

Figure 3: M-bias: conditioning on M opens a spurious path X ← U1 → M ← U2 → Y.

U₂ → Y. Conditioning on M opens this path, creating a spurious correlation between X and Y.

M-bias arises when a researcher adds a "harmless-looking" pre-treatment variable that happens to be a collider of unobserved common causes. The variable may be pre-treatment (not caused by X) and yet dangerous to condition on. This scenario is less common in practice when confounders are large relative to the M-bias, but it is theoretically important and can arise in certain research designs.

7 A Decision Framework

Before adding any control variable, ask three questions:

  1. Is it caused by the treatment? If yes, it is a mediator or post-treatment variable. Do not include it unless you want the direct effect.
  2. Is it caused by the outcome? Including reverse-caused variables introduces collider bias if the outcome itself has other causes.
  3. Is it a common effect of X (or its causes) and Y (or its causes)? If so, it is a collider. Do not include it.

Only include a variable if it is a cause of both X and Y but is not on the causal path from X to Y-i.e. it is a genuine confounder that satisfies the backdoor criterion [Pearl, 2009].

This reasoning requires drawing (or at least mentally committing to) a causal graph before running any regressions. The DAG is not a data object-it expresses the researcher's substantive assumptions about the causal structure of the world. Making those assumptions explicit, and then reasoning from them about which variables to include, is more rigorous than adding controls because they are "correlated with" the treatment or outcome.

8 Practical Guidance

  • Use dagitty or ggdag in R to draw and query your causal graph before running regressions. These tools can identify which variables need to be controlled (open backdoor paths) and which should not be (colliders, mediators).
  • Document your decision. In your paper, justify the inclusion of each control variable with a brief causal argument, not just a correlation statistic.
  • Report sensitivity to controls. If adding or removing a variable substantially changes your estimate, investigate why. The change may reveal a collider or mediator issue.
  • Use "negative controls" to test for collider bias. A negative control outcome one that should not be affected by the treatment can reveal collider-induced bias if it shows a treatment effect after conditioning.

9 Conclusion

The instinct to add more controls is natural but can be harmful. Mediators block causal pathways. Colliders open spurious pathways. Post-treatment variables absorb causal variation. Each of these mistakes leads to biased causal estimates, not better-identified ones.  Cinelli et al. [2022] and the DAG literature [Pearl, 2009] have made these issues precise and actionable. Applied researchers should think carefully about the causal structure of their problem not just about which variables are correlated with the treatment before deciding what to include in a regression.

References

  1. Angrist, J. D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
  2. Cinelli, C., Forney, A., and Pearl, J. (2022). A crash course in good and bad controls. Sociological Methods & Research, 53(3):1071-1099.
  3. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition. Cambridge University Press.
  4. Elwert, F. and Winship, C. (2014). Endogenous selection bias: The problem of conditioning on a collider variable. Annual Review of Sociology, 40:31-53.
  5. Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41-55.
  6. Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.

Continue Reading

Browse All Sections →
Home
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

Article Title