Beginner's Corner

Directed Acyclic Graphs: Drawing Your Causal Assumptions

1 Why Draw Your Assumptions?

Every causal inference study rests on assumptions. The trouble is that these assumptions are often implicit buried in sentences like "we control for observable confounders" or "we assume selection on observables." When assumptions are implicit, they are hard to scrutinise, hard to communicate, and easy to get wrong.

Directed acyclic graphs (DAGs) make assumptions explicit and visual. A DAG is a diagram in which nodes represent variables and arrows represent causal relationships. By drawing a DAG before running a regression, a researcher commits to a specific causal story and opens it up for critique.

The systematic use of DAGs in social science is associated with the work of Pearl [2009], who showed that DAGs provide a formal mathematical language for causal reasoning. This article introduces the basics.

2 The Language of DAGS

A DAG consists of:

  • Nodes: variables in the model (X, Y, Z, U, etc.)
  • Directed edges (arrows): $X\rightarrow Y$ means "X directly causes Y"
  • Acyclicity: no variable can cause itself through a chain of arrows. There are no cycles.

In Figure 1, D directly causes Y (the arrow from D to Y). But X causes both D and Y-X is a confounder. If we regress Y on D without controlling for X, we get a biased

X (Confounder) D (Treatment) Y (Outcome)

Figure 1: A simple DAG with one confounder X. Both D and X affect Y; X also affects D, creating a backdoor path.

estimate of the causal effect of D on Y because some of the correlation between D and Y is driven by the common cause X.

3 Paths, Backdoor Paths, and Colliders

Understanding DAGs requires three concepts: paths, backdoor paths, and colliders.

3.1 Paths

A path between D and Y is any sequence of edges connecting them, regardless of arrow direction. In Figure 1, there are two paths:

  1. $D\rightarrow Y$ (the direct causal path)
  1. $D\leftarrow X\rightarrow Y$ (a non-causal, "backdoor" path)

To identify the causal effect of D on Y, we must block all backdoor paths while leaving the direct path open.

3.2 Backdoor Criterion

A set of variables Z satisfies the backdoor criterion for estimating the effect of D on Y if:

  1. Z blocks all backdoor paths from D to Y.
  1. Z does not contain any descendant of D.

If Z satisfies the backdoor criterion, then controlling for Z identifies the causal effect: P(Y | do(D = d)) = P(Y  D = d, Z = z)P(Z = 2)

In Figure 1, controlling for X blocks the only backdoor path $D\leftarrow X\rightarrow Y$ identifying the effect.

3.3 Colliders

A collider is a node where two arrowheads meet: $A\rightarrow C\leftarrow B$. Colliders behave counter- intuitively: they block paths through them by default, but open them if you condition on the collider.

D C (Collider) Y

Figure 2: C is a collider on the path $D\rightarrow C\leftarrow Y$. Conditioning on C opens this path and induces a spurious correlation between D and Y.

Example: Suppose D is talent, Y is effort, and C is "selected into a top firm" (which requires either talent or effort). If we restrict our sample to employees at top firms (condi- tioning on C), we create a spurious negative correlation between talent and effort: among people at top firms, those with less talent got there through greater effort. This is called collider bias or selection bias.

The practical lesson: never control for a collider. Only control for confounders.

4 Mediation and the Front-Door Criterion

Sometimes we want to know not just whether D causes Y, but how through what mecha- nism. A mediator M lies on the causal path from D to Y: $D\rightarrow M\rightarrow Y$

If there is an unmeasured confounder U between D and Y, the backdoor criterion cannot be satisfied by observables. But if M "mediates" the entire effect of Don Y, and Mis unconfounded conditional on D, the front-door criterion allows identification without controlling for U: $P(Y|do(D=d))=\sum_{m}P(M=m|D=d)\sum_{d}^{\prime}P(Y|D=d^{\prime},M=m)P(D=d^{\prime})$

This is a profound result: by sequentially applying identification to each link in the causal chain, we can identify effects even with unmeasured confounding.

5 Common Mistakes in Regression

DAGs clarify several common errors in applied work:

  1. Controlling for a mediator ("over-controlling"): If M is on the causal path from D to Y and you include M in the regression, you block part of the causal effect. You identify only the direct effect of D on Y, not the total effect.
  2. Controlling for a collider: As described above, this induces bias even if the regres- sion looks "more controlled."
  3. Bad controls: Angrist and Pischke [2009] discuss "bad controls" variables that are themselves outcomes of treatment. Including them in a regression blocks causal paths and biases estimates.

6 Worked Example: Returns to Education

Suppose we want to estimate the causal effect of education (D) on wages (Y). A plausible DAG includes:

  • Ability (U, unobserved) → Education and Wages (confounder)
  • Family background (X) → Education and Wages (observable confounder)
  • Occupation (M) on the path Education → Occupation → Wages (mediator)

What should we control for?

  • Family background: Yes it is an observable confounder. Controlling blocks the backdoor path via family background.
  • Occupation: No (if we want the total effect) it is a mediator. Controlling for it blocks part of the causal effect of education.
  • Ability: We cannot (unobserved). This is why IV strategies (e.g., twin studies, college proximity) are needed to fully identify the effect.

7 Software: dagitty and ggdag

The dagitty package allows you to draw DAGs, identify adjustment sets, and test for implied conditional independences. The ggdag package wraps dagitty with ggplot2 for visualisation.

library(dagitty)
library(ggdag)

dag <- dagitty("dag{
␣␣D->Y
␣␣X->D;X->Y
␣␣D->M->Y
}")
adjustmentSets(dag, exposure="D", outcome="Y")
# Returns: { X }  (control for X to identify total effect)
ggdag(dag) + theme_dag()

8 Conclusion

DAGs are a powerful tool for making causal assumptions explicit. By drawing the causal structure of your problem before running regressions, you can identify which variables to control for, which to avoid, and where unmeasured confounding is likely to be a problem. The backdoor criterion and front-door criterion give formal rules for identification; collider bias warns against the intuitive-but-wrong instinct to "control for everything." Drawing your DAG is an act of intellectual honesty and it often reveals that the standard regression controls are wrong.

References

  1. Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press.
  2. Pearl, J., Glymour, M., and Jewell, N.P. (2016). Causal Inference in Statistics: A Primer. Wiley.
  3. Angrist, J.D. and Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
  4. Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press.

Continue Reading

Browse All Sections →
Home
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

Article Title