---
title: "Causal Discovery"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Causal Discovery}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(causalDisco)
```

This vignette provides a very brief introduction to causal discovery using simulated data. For a thorough introduction to causal
discovery concepts, we recommend Glymour et al. (2019) [Review of Causal Discovery Methods Based on Graphical Models](https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.00524/full) or Zanga et al. (2022) [A Survey on Causal Discovery: Theory and Practice](https://arxiv.org/abs/2305.10032).

The goal of causal discovery is to infer the causal relationships among a set of observed variables using observational
data.

# Example of causal discovery

Suppose we have this DAG:

```{r dag}
cg <- caugi::caugi(
    Z %-->% X1,
    X3 %-->% X2,
    X1 %-->% Y,
    X2 %-->% Y
)
```

We define a layout which we will use for plotting the graphs in this vignette:

```{r plot dag}
layout <- caugi::caugi_layout_sugiyama(cg)
plot(cg, layout = layout, main = "True DAG")
```

We can create data from a linear Gaussian model corresponding to the above DAG using `generate_dag_data()`. We generate
10,000 samples with a fixed random seed from the DAG above, where we let the edge coefficients be sampled with absolute
values between 0.1 and 0.9 and assigned random signs, and where the standard deviation of the additive Gaussian noise
at each node is sampled from a log-uniform distribution between 0.3 and 2.

```{r simple causal discovery}
data_linear <- generate_dag_data(
  cg,
  n = 10000,
  seed = 1405,
  coef_range = c(0.1, 0.9),
  error_sd = c(0.3, 2)
)
head(data_linear)
```

The R code used to generate the data is stored as an attribute of the data frame:

```{r generating model}
attr(data_linear, "generating_model")
```

The goal is now to recover the causal structure from the data alone.

We will use the PC algorithm to learn the causal structure. One of its key assumptions is causal sufficiency,
meaning that there are no unobserved confounders.
We can use the PC algorithm from any of the "tetrad", "pcalg", or "bnlearn" engines to learn the structure from the
data. The motivation for having multiple engines is that they may implement different algorithms, tests, or
scoring methods, or the same algorithm may be implemented differently across engines.

Below, we set up the PC algorithm using Fisher's Z test, a significance level of 0.05, and `pcalg` as the engine. To do
so, we first define the PC method using the `pc()` function, and then pass it to the `disco()` function along with
the data.

```{r pc algorithm simple}
pc_pcalg <- pc(engine = "pcalg", test = "fisher_z", alpha = 0.05)
pc_result_pcalg <- disco(data = data_linear, method = pc_pcalg)
```

We can visualize the results using the `plot()` function, where we explicitly provide the layout defined above so graphs
are easier to compare.

```{r plot pc results simple}
plot(pc_result_pcalg, layout = layout, main = "PC (pcalg)")
```

The PC algorithm recovers the correct causal structure up to Markov equivalence, represented as a CPDAG. A CPDAG
represents the equivalence class of DAGs that encode the same conditional independencies.

The first notable feature of this plot is that some edges are directed, while others are undirected.
For example, the edge from `X1` to `Y` is directed, indicating a causal effect of `X1` on `Y`, but not in the
reverse direction. In contrast, the edge between `X2` and `X3` is undirected, indicating that the data alone do
not provide sufficient information to determine the causal direction.
Both orientations `X2 %-->% X3` and `X3 %-->% X2` are compatible with the observed conditional independencies.

We demonstrate the non-identifiability of the causal direction between `X2` and `X3` by reversing the direction
of this edge in the data-generating process above and applying the PC algorithm to the resulting data set.

```{r pc algorithm reversed}
cg_reverse <- caugi::caugi(
  Z %-->% X1,
  X2 %-->% X3,
  X1 %-->% Y,
  X2 %-->% Y
)

data_linear_reverse <- generate_dag_data(
  cg_reverse,
  n = 10000,
  seed = 1405,
  coef_range = c(0.1, 0.9),
  error_sd = c(0.3, 2)
)

pc_result_reverse <- disco(data = data_linear_reverse, method = pc_pcalg)
plot(pc_result_reverse, layout = layout, main = "PC (pcalg) reversed")
```

We learn the same causal structure as before, demonstrating that the direction of influence between `X2` and `X3`
cannot be determined from the data alone.

# Unobserved confounding

In practice, unobserved confounding may be present, violating one of the assumptions of the PC algorithm. Suppose
we have the following DAG with an unobserved confounder `U` between `X1` and `X2`:

```{r dag unobserved confounder}
cg_unobserved <- caugi::caugi(
  Z %-->% X1,
  X3 %-->% X2,
  X1 %-->% Y,
  X2 %-->% Y,
  U %-->% X1 + X2
)
```

We can visualize this DAG, marking the unobserved variable `U` in red and using dashed edges to indicate
that it is unobserved:

```{r plot dag unobserved confounder}
plot(
  cg_unobserved,
  edge_style = list(
    by_edge = list(
      U = list(col = "red", fill = "red", lty = "dashed")
    )
  ),
  node_style = list(
    by_node = list(
      U = list(col = "red", fill = "red")
    )
  ),
  main = "True DAG"
)
```

We can generate data from this DAG using `generate_dag_data()`, and then afterwards remove the unobserved variable
`U` from the data frame:

```{r data unobserved confounder}
data_unobserved <- generate_dag_data(
  cg_unobserved,
  n = 10000,
  seed = 1405,
  coef_range = c(0.1, 0.9),
  error_sd = c(0.3, 2)
)
data_unobserved <- data_unobserved[, names(data_unobserved) != "U"]
head(data_unobserved)
```

We can then apply the PC algorithm as before:

```{r pc algorithm unobserved confounder}
pc_pcalg_unobserved <- pc(engine = "pcalg", test = "fisher_z", alpha = 0.05)
pc_result_unobserved <- disco(data = data_unobserved, method = pc_pcalg_unobserved)
plot(pc_result_unobserved, layout = layout, main = "PC (pcalg)")
```

We see that the PC algorithm does not recover the correct causal structure due to the presence of the
unobserved confounder. In particular, it found an incorrect edge between `X1` and `X2`.

# Next steps

For more information about how to incorporate knowledge into causal discovery methods, see the
[knowledge vignette](knowledge.html).

For more information about how to visualize causal discovery results, see the
[visualization vignette](visualization.html).