The Causal Review

1 The Question

Causal discovery—the task of recovering a directed acyclic graph (DAG) from observational data—has been a core problem in statistics and computer science since the work of Pearl [2009] and the PC/FCI algorithms of Spirtes et al. [2000]. Traditional methods (PC, FCI, LINGAM, NOTEARS) rely on conditional independence tests applied to data, supplemented occasionally by domain knowledge encoded by hand.

The arrival of large language models (LLMs) has prompted a provocative question: can a model trained on vast corpora of human knowledge do causal discovery not through data analysis, but through reading? After all, the causal structure of many systems is discussed extensively in scientific literature: textbooks state that smoking causes cancer, economics papers assert that monetary shocks affect output, medical guidelines embed causal diagrams. If LLMs have absorbed this knowledge, they might constitute a form of domain expert that can propose DAG structures without running a single conditional independence test.

This article reviews what we know and what we do not about LLMs as causal reasoning engines, drawing on a growing body of 2024-2025 empirical evaluations.

2 What Causal Discovery Requires

Recovering a causal DAG G from data requires more than pattern recognition. Formally, a DAG G over variables V = {V₁, ..., Vₚ} is Markov and faithful to the data-generating distribution P(V) if the d-separation statements in G coincide exactly with the conditional independencies in P [Pearl, 2009]. The PC algorithm recovers the Markov equivalence class (the CPDAG) under faithfulness and causal sufficiency. The FCI algorithm relaxes causal sufficiency to allow latent confounders.

These algorithms are agnostic about variable meanings. They test Vᵢ ⊥ Vⱼ | S for all subsets S, then orient edges using Meek rules. The critical limitation is sample complexity: with p variables and n observations, reliable skeleton recovery requires n growing faster than 2ᵖ in the worst case, making the problem hard for high-dimensional systems.

LLMs operate differently. They do not conduct independence tests. Instead, they draw on semantic information—the meaning of variable names, their associations in text, and causal claims made by authors in the training corpus. This suggests a complementary, not competing, role: LLMs as a source of prior causal knowledge to augment or initialise statistical causal discovery.

3 What the Evidence Shows

3.1 LLMs Rely on Semantics, Not Data

The most consistently replicated finding in evaluations of LLM causal discovery is that models rely primarily on the names of variables, not their data distributions. Kıcıman et al. [2023] test GPT-4 on benchmark causal discovery tasks and find strong performance but performance collapses when variable names are permuted to remove semantic meaning. A model that correctly identifies "smoking → cancer" often fails to recover the same edge when the variables are relabelled V₁ and V₂.

This is not a bug but a feature of how LLMs are trained: they compress associations from text, and causal associations in text are entangled with variable semantics. The implication is that LLM-based causal discovery is really causal knowledge retrieval: the model answers "does X cause Y?" by recalling what the literature has said about X and Y, not by reasoning from first principles.

3.2 Hallucination and Confabulation

A second persistent problem is hallucination. LLMs confidently assert causal relationships that are contested, reversed, or fabricated. In evaluations on biomedical DAGs [Ban et al., 2023], GPT-4 achieves F1 scores around 0.6 on edge orientation—respectable, but with a long tail of confidently incorrect orientations on edges where the literature is ambiguous. Models frequently confuse correlation with causation when the association is strong but the direction is debated.

A subtle failure mode is what Zevčević et al. [2023] term "causal parroting": the model reproduces the dominant view in its training data, even when that view is wrong or the question requires reasoning beyond what has been stated. This means LLMs encode the biases of the scientific literature they were trained on, including publication biases toward positive results.

3.3 Performance on Structured Benchmarks

Jiralerspong et al. [2024] evaluate LLMs on SACHS (protein signalling), ASIA (a medical DAG), and CHILD networks. LLMs outperform constraint-based algorithms on SACHS (a well-studied DAG whose edges appear in the literature) but underperform on synthetic networks with no semantic content.

Ban et al. [2023] find that prompting strategies matter considerably: chain-of-thought prompting improves F1 by 10-15 percentage points over direct edge queries. Providing partial graph structure ("given that A causes B, does A cause C?") further improves performance.

The CAUSALFUSION system of Amazon Science [2025] combines an LLM as domain expert with graph falsification tests on data. The LLM proposes a candidate DAG; the statistical tests prune implausible edges; the LLM revises. This hybrid approach outperforms either component alone on health and finance DAGS.

3.4 Can LLMs Do Causal Reasoning, Not Just Retrieval?

A deeper question is whether LLMs can engage in novel causal reasoning—inferring causal directions in domains not well represented in their training data, or reasoning through multi-step causal chains. The evidence here is more pessimistic.

Zevčević et al. [2023] design causal reasoning tasks that require applying do-calculus rules to novel DAGs described in the prompt. GPT-4 performs near chance on backdoor adjustment when the DAG is presented abstractly (as a graph structure) and the question cannot be answered by pattern matching to known examples. Performance improves substantially when variable names are semantically loaded, confirming the reliance on retrieval rather than reasoning.

Ji et al. [2024] further show that LLMs systematically conflate three distinct operations: (i) reporting an observed correlation, (ii) predicting the outcome of an intervention, and (iii) identifying counterfactual outcomes. These three rungs of Pearl's causal hierarchy [Pearl, 2009] require strictly different information, but LLMs treat them as equivalent. A model asked "if we set X = 1, what happens to Y?" often reports the conditional expectation E[Y | X=1] (associational) rather than the interventional quantity E[Y | do(X=1)].

4 The Hybrid Paradigm

Given these findings, the most promising research direction is not LLMs replacing statistical causal discovery but LLMs augmenting it as a prior-knowledge engine. Several architectures have emerged:

Figure 1: The hybrid LLM-statistical causal discovery pipeline

LLM

(domain prior)

candidate edges

feedback

Statistical

algorithm (PC/FCI)

orient ambiguous edges

pruned skeleton

Revised DAG

‍Prior-constrained discovery. The LLM generates a prior over edges (positive, negative, or absent), which is fed into a Bayesian causal discovery algorithm such as the Friedman and Koller [2003] BDeu scorer. This narrows the search space and speeds convergence, at the cost of inheriting the LLM's biases.

Orientation tie-breaking. Constraint-based algorithms (PC, FCI) often fail to orient all edges uniquely, leaving a Markov equivalence class. Within this class, edge orientations are statistically indistinguishable. LLMs can break ties using domain knowledge: "does rain cause wet ground, or wet ground cause rain?" is semantically trivial even if statistically unidentified.

LLM-guided iterative refinement. The CAUSALFUSION system iterates: LLM proposes a DAG; statistical tests identify edges inconsistent with data; LLM revises the DAG taking these constraints as input. Iteration continues until statistical fit is satisfactory and domain plausibility is preserved [Amazon Science, 2025].

5 Limitations and Open Problems

Several fundamental limitations constrain the LLM-based causal discovery programme:

Provenance and hallucination. LLMs cannot provide uncertainty quantification over their causal assertions or cite the evidence underlying them. A causal claim that GPT-4 states with high confidence may reflect one paper in the training data, a popular-science summary, or pure confabulation.

Novel domains. In scientific areas underrepresented in training data—emerging markets, novel biological pathways, rare diseases—LLMs have little to retrieve and fall back on surface-level pattern matching.

Causal hierarchy conflation. The inability to distinguish association from intervention from counterfactual [Pearl, 2009] is not easily fixed by prompting. It may require architectures that explicitly track the "rung" being queried.

Reproducibility. LLM outputs are stochastic and model-dependent. The same causal query can elicit different DAGs from different model versions. Causal discovery workflows built on LLMs face a reproducibility problem that does not afflict algorithmic methods.

6 Implications for Empirical Economists

For practitioners in causal inference, the practical takeaways are nuanced. LLMs are probably not ready to replace the researcher's judgment in constructing a DAG to guide identification strategy—the hallucination and causal-hierarchy conflation problems are too serious for high-stakes empirical decisions. However, LLMs can be useful as:

A brainstorming tool for identifying plausible confounders and mediators to include or exclude

A rapid literature retrieval engine to surface what prior work has said about causal relationships between variables of interest

A way to elicit alternative model structures for sensitivity analysis: "if we add an edge from X to Z, does the identification strategy break?"

The deeper lesson is that LLMs encode a compression of the observational scientific literature, with all its correlational biases. They are powerful pattern-matchers that struggle with interventional and counterfactual reasoning. Building tools that combine LLM knowledge with rigorous statistical causal discovery is a promising research frontier but the tools are not yet production-ready for high-stakes causal inference.

7 Conclusion

LLMs represent a genuinely new kind of resource for causal discovery: vast, fast, and semantically rich, but prone to hallucination and incapable of the interventional reasoning that distinguishes causal inference from correlation analysis. The evidence from 2024-2025 benchmarks consistently shows that LLMs retrieve causal associations from text rather than reason from data—a useful capability when the true DAG is well-documented in the literature, a dangerous one when it is not. The most productive near-term research direction is hybrid architectures that use LLMs for what they are good at—prior knowledge and tie-breaking—while retaining statistical algorithms for what they are good at—data-driven independence testing. For causal econometricians, the field offers a fascinating methodological frontier but demands appropriate epistemic humility about what LLMs can actually do.

References

Ban, T., Chen, L., Wang, X., and Chen, H. (2023). From query tools to causal architects: Harnessing large language models for causal discovery. arXiv:2306.16902. ‍
Friedman, N. and Koller, D. (2003). Being Bayesian about network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50:95-125. ‍
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. (2024). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1-38. Jiralerspong, T., Liu, J., Plummer, B., Zhang, Y., Gidel, G., and Bhatt, S. (2024). Efficient causal graph discovery using large language models. arXiv:2402.01207. ‍
Kıcıman, E., Ness, R., Sharma, A., and Tan, C. (2023). Causal reasoning and large language models: Opening a new frontier for causality. arXiv:2305.00050. ‍
Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press, Cambridge. ‍
Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press, Cambridge, MA. ‍
Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Prediction, and Search. 2nd ed. MIT Press, Cambridge, ΜΑ. ‍
Amazon Science (2025). Causal Fusion: Integrating LLMs and graph falsification for causal discovery. Amazon Science Publications. ‍
Zevčević, M., Willig, M., Dhami, D. S., and Kersting, K. (2023). Causal parrots: Large language models may talk causality but are not causal. arXiv:2308.13067.

LLMs and Causal Discovery: Can Large Language Models Identify Causal Structure?

1 The Question

2 What Causal Discovery Requires

3 What the Evidence Shows

3.1 LLMs Rely on Semantics, Not Data

3.2 Hallucination and Confabulation

3.3 Performance on Structured Benchmarks

3.4 Can LLMs Do Causal Reasoning, Not Just Retrieval?

4 The Hybrid Paradigm

5 Limitations and Open Problems

6 Implications for Empirical Economists

7 Conclusion

Continue Reading

The causalml Package in Python: Uplift Modeling and CATE Meta-Learners

The gsynth Package in R: Generalized Synthetic Control with Interactive Fixed Effects

Recent Results: Immigration, Migration, and Labour Markets

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Article Title

LLMs and Causal Discovery: Can Large Language Models Identify Causal Structure?

1 The Question

2 What Causal Discovery Requires

3 What the Evidence Shows

3.1 LLMs Rely on Semantics, Not Data

3.2 Hallucination and Confabulation

3.3 Performance on Structured Benchmarks

3.4 Can LLMs Do Causal Reasoning, Not Just Retrieval?

4 The Hybrid Paradigm

5 Limitations and Open Problems

6 Implications for Empirical Economists

7 Conclusion

Continue Reading

The causalml Package in Python: Uplift Modeling and CATE Meta-Learners

The gsynth Package in R: Generalized Synthetic Control with Interactive Fixed Effects

Recent Results: Immigration, Migration, and Labour Markets

Natural Experiments: Finding Causal Evidence Without Randomisation

Regression Discontinuity Design: Sharp, Fuzzy, and the CCT Bandwidth

The Credibility Revolution in Econometrics: Thirty Years of Causal Inference

Stay current with causal inference

Article Title