Toolbox

dagitty and ggdag in R: Drawing and Querying Causal Graphs

1 What Problem Do These Tools Solve?

Every causal inference study rests on a set of assumptions about the data-generating process: which variables affect which others, which paths are open, and which variables confound the treatment-outcome relationship. These assumptions are often stated in prose— "we assume that conditional on education and age, income is independent of the error term"— but prose can obscure logical inconsistencies, miss collider bias, or fail to identify the minimal sufficient adjustment set.  

Directed acyclic graphs (DAGs) provide a formal, transparent way to encode these assumptions [Pearl, 2009]. A DAG is a set of nodes (variables) and directed edges (arrows representing direct causal relationships), with no cycles. Once drawn, a DAG supports rigorous, algorithmic reasoning about identification: the backdoor criterion [Pearl, 1993], d-separation, minimal adjustment sets, and collider identification.  

The dagitty package [Textor et al., 2016] implements Pearl's graphical framework in R, allowing analysts to encode a DAG and query it for adjustment sets, testable implications, and instrument validity. The ggdag package [Barrett, 2021] provides ggplot2-compatible visualisations of dagitty objects.  

2 Installation and Setup

# Listing 1: Package installation install.packages(c("dagitty", "ggdag", "ggplot2")) library(dagitty) library(ggdag) library(ggplot2)

Both packages are on CRAN and have no system dependencies beyond a standard R installation.  

3 A Minimal Working Example: Education and Earnings

3.1 Encoding the DAG

# Listing 2: Building a DAG for the education-earnings example # Define the DAG using dagitty syntax dag_edu <- dagitty(' dag { Ability [latent, pos="0,1"] Family [pos="0,2"] School [exposure, pos="1,1.5"] Earn [outcome, pos="2,1.5"] Ability -> School Ability -> Earn Family -> School Family -> Earn School -> Earn } ') # Check that it is a valid DAG isAcyclic(dag_edu) # should return TRUE

The DAG encodes:  

  • Ability (latent/unobserved) affects both School and Earn, creating backdoor confounding.  
  • Family background affects both School and Earn.  
  • School directly affects Earn (the causal effect of interest).  

3.2 Querying Adjustment Sets

The key query: what set of observed variables do we need to condition on to identify the effect of School on Earn via the backdoor criterion?  

# Listing 3: Finding sufficient adjustment sets adjustmentSets(dag_edu, exposure = "School", outcome = "Earn") # Returns: { Family } # (Ability is latent and cannot be adjusted for)

The output tells us that conditioning on Family blocks all backdoor paths. Since Ability is latent (unobserved), it cannot be in an adjustment set— meaning selection on observables cannot identify the effect of schooling once ability is unobserved. This confirms what econometricians know from the omitted variable bias formula.  

3.3 Checking Testable Implications

Every DAG implies a set of conditional independence relations (d-separations) that can in principle be tested in data:  

# Listing 4: Extracting testable implications from the DAG impliedConditionalIndependencies(dag_edu) # Returns the conditional independencies implied by the graph

If the DAG is correctly specified, these independence restrictions should hold approximately in the data. Testing them (e.g., using partial correlations or regression residuals) is a form of DAG specification testing.  

4 Visualising with ggdag

# Listing 5: Visualising the DAG with ggdag # Convert dagitty object to tidy format tidy_dag <- tidy_dagitty(dag_edu) # Basic plot ggdag(tidy_dag, layout = "nicely") + theme_dag() + geom_dag_edges_arc() + geom_dag_node(aes(color = name)) + geom_dag_label_repel(aes(label = name)) + labs(title = "Education and Earnings DAG") # Highlight adjustment set ggdag_adjustment_set(tidy_dag, exposure = "School", outcome = "Earn") + theme_dag()

ggdag_adjustment_set() shades nodes in the adjustment set green and shows which variables are adjusted vs. unadjusted, making it easy to communicate the identification strategy visually.  

5 Collider Bias: A DAG-Based Warning System

One of the most valuable uses of DAGs is identifying collider bias— the bias introduced by conditioning on a common effect of two variables. Consider a healthcare setting:  

# Listing 6: Collider bias example: hospitalisation and mortality dag_collider <- dagitty(' dag { Disease [pos="0,1"] Injury [pos="0,0"] Hospital [pos="1,0.5"] Death [outcome, pos="2,0.5"] Disease -> Death Disease -> Hospital Injury -> Hospital Hospital -> Death } ') # Is Disease d-separated from Injury given Hospital? dseparated(dag_collider, "Disease", "Injury", c("Hospital")) # Returns FALSE — conditioning on Hospital opens a collider path!

Without conditioning on Hospital, disease and injury are independent (no common cause). Conditioning on hospitalisation opens a collider path: among hospitalised patients, disease and injury are negatively correlated (if you're in hospital, knowing you have a disease makes it less likely you have an injury as the cause). This is the infamous "Berkson's paradox" [Berkson, 1946], and it can create spurious associations or mask real ones.  

DAGs make collider bias transparent and mechanical to detect: conditioning on a collider always opens a path between its parents that was previously closed.  

6 Instrument Validity

DAGs also formalize IV validity conditions:  

# Listing 7: Testing instrument validity in a DAG dag_iv <- dagitty(' dag { Z [pos="0,1"] # Instrument (proximity to college) U [latent, pos="1,0"] # Unobservable (ability) D [exposure, pos="1,1"] # Treatment (college) Y [outcome, pos="2,1"] # Outcome (earnings) Z -> D U -> D U -> Y D -> Y } ') # Is Z a valid instrument for D -> Y? instrumentalVariables(dag_iv, exposure = "D", outcome = "Y") # Returns: Z satisfies IV conditions (relevance, exclusion, exogeneity)

The instrumentalVariables() function checks whether the proposed instrument satisfies the graphical conditions for IV validity: it must be a cause of the treatment, d-separated from the outcome given the treatment, and not a descendant of a collider on the instrument-outcome path.  

7 Comparison to Alternatives

For the econometrics researcher using R, dagitty + ggdag is the natural choice: it implements the full Pearl graphical calculus, integrates with ggplot2 for publication-quality figures, and is available on CRAN with no special installation requirements.  

Tool Language Strength Limitation
dagitty (R) R Full Pearl calculus, CRAN Static visualisation
ggdag (R) R ggplot2 integration Plotting only
dagitty.net Web browser Interactive GUI No scripted workflow
causaldag (Py) Python DAG + estimation Less graph querying
DoWhy (Py) Python End-to-end pipeline Heavy dependencies
Table 1: Causal Graph Tools Comparison

8 Key Options and Pitfalls

  • Mark latent variables: Use [latent] to tag unobserved variables. This prevents adjustmentSets() from including them in adjustment sets.  
  • Use pos for consistent layouts: Specifying node positions in the DAG string ensures figures are reproducible.  
  • DAGs encode qualitative structure, not effect sizes: A DAG arrow means "direct causal relationship exists"; it says nothing about magnitude or sign.  
  • DAGs require completeness: Omitting a node or arrow that exists in the true data-generating process can lead to erroneous identification conclusions. When in doubt, include more structure and check whether the conclusions are robust.  

References

  1. Barrett, M. (2021). ggdag: Analyse and Create Elegant Directed Acyclic Graphs. R package version 0.2.3. https://CRAN.R-project.org/package=ggdag.  
  2. Berkson, J. (1946). Limitations of the application of fourfold table analysis to hospital data. Biometrics Bulletin, 2(3), 47-53.  
  3. Pearl, J. (1993). Bayesian analysis in expert systems: Comment: Graphical models, causality and intervention. Statistical Science, 8(3), 266-269.  
  4. Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge University Press.  
  5. Textor, J., van der Zander, B., Gilthorpe, M. S., Liskiewicz, M., and Ellison, G. . . . (2016). Robust causal inference using directed acyclic graphs: The R package 'dagitty'. International Journal of Epidemiology, 45(6), 1887-1894.  

Continue Reading

Browse All Sections →
Home
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.

Article Title