Skip to content

Commit

Permalink
Added figures
Browse files Browse the repository at this point in the history
  • Loading branch information
vtraag committed Dec 18, 2024
1 parent 8155898 commit 645388d
Show file tree
Hide file tree
Showing 5 changed files with 9 additions and 21 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 9 additions & 21 deletions sections/0_causality/open_data_cost_savings.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,25 +27,19 @@ Despite the theoretical benefits of open data, several limitations hinder a comp

## Directed Acyclic Graph (DAG)

As discussed in the general introduction on causal inference, we use DAGs to represent structural causal models. In the following, a DAG (Figure 1) is employed to examine the causal relationship between *Open Data* and *Cost Savings*. The visual illustrates multiple potential pathways, including a direct path from Open Data and Cost Savings, an indirect one involving Time Savings (i.e., a mediator), and additional paths that incorporate factors affecting either Open Data or Time Savings (i.e., confounders). These additional factors, such as technological infrastructure, data quality and availability, standardisation, user skills, innovation, and collaboration introduce layers of complexity to the model. As we will show in the subsequent sections, they are essential to discuss the causal and non-causal, open and closed, relationships among all these variables.
As discussed in the general introduction on causal inference, we use DAGs to represent structural causal models. In the following, a DAG ([@fig-model]) is employed to examine the causal relationship between *Open Data* and *Cost Savings*. The visual illustrates multiple potential pathways, including a direct path from Open Data and Cost Savings, an indirect one involving Time Savings (i.e., a mediator), and additional paths that incorporate factors affecting either Open Data or Time Savings (i.e., confounders). These additional factors, such as technological infrastructure, data quality and availability, standardisation, user skills, innovation, and collaboration introduce layers of complexity to the model. As we will show in the subsequent sections, they are essential to discuss the causal and non-causal, open and closed, relationships among all these variables.

Figure 1: Hypothetical structural causal model on Open Data

![Immagine che contiene testo, linea, diagramma, Carattere Descrizione generata automaticamente](figures/1d1cf98828cbe8eab6cfc7690c05d6db29154e28.png)
![Hypothetical structural causal model on Open Data](figures/DAG_open_data_cost_savings-0.png){#fig-model}

## The effect of Open data on Cost Saving

In this section, we apply the concepts presented in the section Causality in Science Studies to potential research questions. We present a specific perspective on causal inference through the lens of structural causal models (Pearl 2009).

Suppose we are interested in assessing the *total causal effect* of *Open Data* on *Cost Saving*. According to our model (Figure 1), there are multiple pathways from Open Data to Cost Saving, some are causal, some are not. To estimate the causal effect of interest, we need to make sure that all causal paths are open, and all non-causal paths are closed. Within the DAG representation, two causal pathways can be identified: a *direct* pathway of *Open Data → Cost Saving,* representing the direct effect of Open Data on Cost Savings*,* and an *indirect* pathway *Open Data → Time Saving → Cost Saving*, where the effect is indirect and mediated by Time Saving. The *direct* effect captures the immediate benefits of providing free access to datasets, while the *indirect* effect, mediated by Time Savings, strengthens the relationship by triggering additional efficiencies that also lead to Cost Savings.

To properly estimate the *total* causal effect of Open Data on Cost Saving, an empirical model should not control for Time saving. On the contrary, if the model conditions on Time Savings, even implicitly (e.g., by accounting for approaches and tools that optimise and speed up data access and processing), it closes the causal path and introduces biases into the estimation of the total effect (Figure 2).

Figure 2: Misleading structural causal model on Open Data, conditioning on mediator (Time Saving)

![Immagine che contiene testo, linea, schermata, diagramma Descrizione generata automaticamente](figures/f0d9bcfde3955951768ec4951070567661000e25.png)
To properly estimate the *total* causal effect of Open Data on Cost Saving, an empirical model should not control for Time saving. On the contrary, if the model conditions on Time Savings, even implicitly (e.g., by accounting for approaches and tools that optimise and speed up data access and processing), it closes the causal path and introduces biases into the estimation of the total effect ([@fig-mediator]).

*Note: DAG illustrating the misleading effect of conditioning on the mediator variable, Time Saving. Nodes that are controlled for have a thick outline. Grey nodes represent variables not considered in this figure. Green nodes are open, indicating they allow associations or relationships to flow through them along the paths they connect, while orange nodes are closed, blocking associations or relationships from flowing through the paths they connect. Black arrows represent potential causal influence, whereas grey dashed arrows indicate indirect association that may involve non-causal relationship.*
![DAG illustrating the misleading effect of conditioning on the mediator variable, Time Saving. Nodes that are controlled for have a thick outline. Grey nodes represent variables not considered in this figure. Green nodes are open, indicating they allow associations or relationships to flow through them along the paths they connect, while orange nodes are closed, blocking associations or relationships from flowing through the paths they connect. Black arrows represent potential causal influence, whereas grey arrows indicate indirect association that may involve non-causal relationship.](figures/DAG_open_data_cost_savings-1.png){#fig-mediator}

As mentioned before, the proposed model accounts for additional variables such as *Data Availability, Quality and Standardisation*, *Users' Skills, Collaboration, and Technological Infrastructure*. These variables can act as *confounders* along different pathways illustrated in Figure 1. Examples of non-causal paths represented in the DAG are:

Expand All @@ -59,21 +53,15 @@ There might be instances where a confounder is not observed because is not inclu

Another case that makes not possible the identification of the causal effect is erroneously controlling for a collider (see the introduction for further details). In the pathway *Open Data**Innovation* ← *Collaboration**Cost Saving,* the variable *Innovation* acts as a collider. Hence, this path is already closed, and the bias arises when controlling for innovation.

Empirically speaking, in a model where to estimate the causal effect of Open Data on Cost savings we condition on Innovation (and not for Collaboration), it is likely to get a downward estimation of the causal effect, since both Collaboration and Open Data have a positive impact on Innovation. This conditioning opens up the non-causal pathway, Open Data → Innovation ← Collaboration → Cost Saving, which connect *Open Data* and *Cost Saving* through *Collaboration*, creating a spurious association and distorting the true effect of *Open Data* on *Cost Saving*. This is an example of *bad controls (*Angrist and Pischke 2009), a concept explained in the general introduction. Only by ignoring the collider, meaning non-conditioning on it in empirical models, we can effectively isolate the causal effect. This non-causal path is open because *Innovation* is open (because it is a collider that is conditioned on), and because *Collaboration* is open (because it is a confounder that is not conditioned on) (see Figure 3).

Figure 3: Misleading structural causal model on Open Data, not conditioning on Collaboration (confounder) and Innovation (collider)

![A diagram of cost saving and cost saving Description automatically generated](figures/26de0c31a0e5f229ff6935be961496367a3b6cd4.png)

Note: DAG illustrating the misleading effect of conditioning on a collider variable, *Innovation,* and not conditioning on a confounder, *Collaboration*. Nodes that are controlled for have a thick outline. Grey nodes represent variables not considered in this figure. Green nodes are open, indicating they allow associations or relationships to flow through them along the paths they connect, while orange nodes are closed, blocking associations or relationships from flowing through the paths they connect. Black arrows represent potential causal influence, whereas grey dashed arrows indicate indirect association that may involve non-causal relationship.
Empirically speaking, in a model where to estimate the causal effect of Open Data on Cost savings we condition on Innovation (and not for Collaboration), it is likely to get a downward estimation of the causal effect, since both Collaboration and Open Data have a positive impact on Innovation. This conditioning opens up the non-causal pathway, Open Data → Innovation ← Collaboration → Cost Saving, which connect *Open Data* and *Cost Saving* through *Collaboration*, creating a spurious association and distorting the true effect of *Open Data* on *Cost Saving*. This is an example of *bad controls (*Angrist and Pischke 2009), a concept explained in the general introduction. Only by ignoring the collider, meaning non-conditioning on it in empirical models, we can effectively isolate the causal effect. This non-causal path is open because *Innovation* is open (because it is a collider that is conditioned on), and because *Collaboration* is open (because it is a confounder that is not conditioned on) (see @fig-collider).

In addition, *Collaboration* acts as a confounder on the non-causal path *Open data* ← *Collaboration**Cost saving*. To identify the causal effect, we hence need to close this non-causal path by conditioning on *Collaboration*. After controlling for Collaboration, whether *Innovation* is conditioned on is then irrelevant for the identification of the causal effect. When all non-causal paths are closed, the research design is said to meet the *backdoor* criterion, a formal requirement that ensures the design blocks all non-causal paths between the treatment (*Open Data*) and the outcome (*Cost Saving*), enabling us to identify the causal effect in question (Cunningham, 2021) (Figure 4).
![DAG illustrating the misleading effect of conditioning on a collider variable, *Innovation,* and not conditioning on a confounder, *Collaboration*. Nodes that are controlled for have a thick outline. Grey nodes represent variables not considered in this figure. Green nodes are open, indicating they allow associations or relationships to flow through them along the paths they connect, while orange nodes are closed, blocking associations or relationships from flowing through the paths they connect. Black arrows represent potential causal influence, whereas grey arrows indicate indirect association that may involve non-causal relationship.](figures/DAG_open_data_cost_savings-2.png){#fig-collider}

Figure 4: Hypothetical structural causal model on Open Data, conditioning on confounders and not on mediator and collider
In addition, *Collaboration* acts as a confounder on the non-causal path *Open data* ← *Collaboration**Cost saving*. To identify the causal effect, we hence need to close this non-causal path by conditioning on *Collaboration*. After controlling for Collaboration, whether *Innovation* is conditioned on is then irrelevant for the identification of the causal effect. When all non-causal paths are closed, the research design is said to meet the *backdoor* criterion, a formal requirement that ensures the design blocks all non-causal paths between the treatment (*Open Data*) and the outcome (*Cost Saving*), enabling us to identify the causal effect in question (Cunningham, 2021) ([@fig-adjustment]).

![A diagram of cost saving and standardisation Description automatically generated](figures/0558967f8b3c0c345051326ee1dea0137c4caf1e.png) Note: DAG illustrating the total effect *Open Data* on *Cost Saving*, conditioning on confounders and not on mediator and collider. Nodes that are controlled for have a thick outline. Green nodes are open, indicating they allow associations or relationships to flow through them along the paths they connect, while orange nodes are closed, blocking associations or relationships from flowing through the paths they connect. Black arrows represent potential causal influence, whereas grey dashed arrows indicate indirect association that may involve non-causal relationship.
![DAG illustrating the total effect *Open Data* on *Cost Saving*, conditioning on confounders and not on mediator and collider. Nodes that are controlled for have a thick outline. Green nodes are open, indicating they allow associations or relationships to flow through them along the paths they connect, while orange nodes are closed, blocking associations or relationships from flowing through the paths they connect. Black arrows represent potential causal influence, whereas grey dashed arrows indicate indirect association that may involve non-causal relationship.](figures/DAG_open_data_cost_savings-3.png){#fig-adjustment}

This example highlights key components of causal inference: controlling for confounders *(Data availability, quality and standardisation, User skills, Collaboration,* and *Technological Infrastructure*), not controlling for mediators *(Time saving),* and not controlling for colliders *(Innovation),* as shown in Figure 4*.* Constructing an appropriate DAG is important when aiming to draw causal conclusions. Without making assumptions explicit via a DAG, it would be unclear which variables should be controlled for and which not. Omitting important variables weakens the study's ability to draw accurate conclusions about cause and effect. Moreover, adding complexity to a DAG does not always change the variables that need to be controlled for when identifying the causal effect. In some cases, such as when adding confounders between unrelated variables, the identification of the relationship between Open data and Cost savings remains unaffected. However, if a confounder is introduced between Open data and Cost savings directly, it becomes necessary to control for it.
This example highlights key components of causal inference: controlling for confounders *(Data availability, quality and standardisation, User skills, Collaboration,* and *Technological Infrastructure*), not controlling for mediators *(Time saving),* and not controlling for colliders *(Innovation),* as shown in @fig-adjustment. Constructing an appropriate DAG is important when aiming to draw causal conclusions. Without making assumptions explicit via a DAG, it would be unclear which variables should be controlled for and which not. Omitting important variables weakens the study's ability to draw accurate conclusions about cause and effect. Moreover, adding complexity to a DAG does not always change the variables that need to be controlled for when identifying the causal effect. In some cases, such as when adding confounders between unrelated variables, the identification of the relationship between Open data and Cost savings remains unaffected. However, if a confounder is introduced between Open data and Cost savings directly, it becomes necessary to control for it.

Klebel and Traag (2023) emphasise the importance of carefully selecting variables when analysing causal relationships. They warn against two common errors: relying on the data (e.g., through stepwise regression) to decide which variables to control for, or including all available variables, which they refer to as "causal salad" (McElreath, 2020). Both approaches can lead to incorrect conclusions. Specifically, including mediating variables or focusing only on certain cases could obscure the true effect of open data on outcomes like cost savings. In this regard, Pearl proposes using the *do-operator^[\[1\]](#footnote-2){#footnote-ref-2}^* to define causal effects, where an intervention on one variable allows us to observe changes in another, thereby illuminating causal connections within a system (Pearl, 2009).

Expand Down

0 comments on commit 645388d

Please sign in to comment.