Leveraging Structured Biological Knowledge for Counterfactual Inference: A Case Study of Viral Pathogenesis

Jeremy Zucker; Kaushal Paneri; Sara Mohammad-Taheri; Somya Bhargava; Pallavi Kolambkar; Craig Bakker; Jeremy Teuton; Charles Tapley Hoyt; Kristie Oxford; Robert Ness; Olga Vitek

doi:10.1109/TBDATA.2021.3050680

. 2021 Jan 18;7(1):25–37. doi: 10.1109/TBDATA.2021.3050680

Leveraging Structured Biological Knowledge for Counterfactual Inference: A Case Study of Viral Pathogenesis

Jeremy Zucker ¹, Kaushal Paneri ², Sara Mohammad-Taheri ^3,^✉, Somya Bhargava ³, Pallavi Kolambkar ³, Craig Bakker ¹, Jeremy Teuton ¹, Charles Tapley Hoyt ⁴, Kristie Oxford ¹, Robert Ness ⁵, Olga Vitek ³

PMCID: PMC8769018 PMID: 37981991

Abstract

Counterfactual inference is a useful tool for comparing outcomes of interventions on complex systems. It requires us to represent the system in form of a structural causal model, complete with a causal diagram, probabilistic assumptions on exogenous variables, and functional assignments. Specifying such models can be extremely difficult in practice. The process requires substantial domain expertise, and does not scale easily to large systems, multiple systems, or novel system modifications. At the same time, many application domains, such as molecular biology, are rich in structured causal knowledge that is qualitative in nature. This article proposes a general approach for querying a causal biological knowledge graph, and converting the qualitative result into a quantitative structural causal model that can learn from data to answer the question. We demonstrate the feasibility, accuracy and versatility of this approach using two case studies in systems biology. The first demonstrates the appropriateness of the underlying assumptions and the accuracy of the results. The second demonstrates the versatility of the approach by querying a knowledge base for the molecular determinants of a severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)-induced cytokine storm, and performing counterfactual inference to estimate the causal effect of medical countermeasures for severely ill patients.

Keywords: Biological expression language, structural causal model, counterfactual inference, causal biological knowledge graph, systems biology, SARS-CoV-2

1. Introduction

Each time a cell senses changes in its environment, it marshals a complex choreography of molecular interactions to initiate an appropriate response. When a virus infects the cell, this delicate balance is disrupted and can result in a cascade of systemic failures leading to disease. In particular, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the novel pathogen responsible for the COVID-19 pandemic, has a complex etiology that differs in subtle and substantial ways from previously studied viruses. To make informed decisions about the risk that a new pathogen presents, it is imperative to rapidly predict the determinants of pathogenesis and identify potential targets for medical countermeasures. Current solutions for this task include systems biology data-driven models, which correlate biomolecular expression to pathogenicity, but cannot go beyond associations in the data to reason about causes of the disease [1], [2]. Alternatively, hypothesis-driven mathematical models capture causal relations, but are hampered by limited parameter identifiability and predictive power [3], [4].

We argue that counterfactual inference [5] helps bridge the gap between data-driven and hypothesis-driven approaches. It enables questions of the form: “Had we known the eventual outcome of a patient, what would we have done differently?” At the heart of counterfactual inference is a formalism known as a structural causal model [5], [6]. It represents prior domain knowledge in terms of causal diagrams, assumes a probability distribution on exogenous variables, and assigns a deterministic function to endogenous variables. SCM are particularly attractive in systems biology, where structured domain knowledge is extracted from the biomedical literature and is readily available through advances in natural language processing [7], [8], [9], large-scale automated assembly systems [10], and semi-automated curation workflows [11]. This knowledge is curated by multiple organizations [12], [13], [14], [15], [16] and stored in structured knowledge bases [17], [18], [19], [20]. It can be brought to bear for answering causal questions regarding SARS-CoV-2.

This manuscript contributes a three-part algorithm that leverages existing structured biological knowledge to answer counterfactual questions about viral pathogenesis. Algorithm 1 formalizes biologically relevant questions as queries to an existing causal knowledge graph. Algorithm 2 converts the query result into a structural causal model. Algorithm 3 operationalizes the counterfactual inference by interrogating the model with the observed data to estimate a causal effect.

We illustrate the benefits of this approach using two case studies. Case study 1 illustrates the increased precision of counterfactual estimates, as compared to the ODE- and SDE-based forward simulation, in a situation with known ground truth mechanisms of data generation. Case study 2 demonstrates the automated construction of an SCM and the value of counterfactual reasoning in novel situations with limited treatment options (as is the case for SARS-CoV-2). It shows that counterfactual inference enables more precise predictions regarding who would be likely to survive without receiving treatment, who would be likely to die even if they did receive treatment, and who would likely survive only if they received treatment.

2. Background

Biological Signaling Pathways. Signaling pathways are composed of entities that engage in activities [21]. Examples of entities are proteins and metabolites, but also higher level biological processes such as an immune response. Activities are the producers of change. Examples include catalytic activity, kinase activity, or transcriptional activity.

The basic unit of causality in signaling pathways is a directed molecular interaction, where the activity of an upstream molecule increases or decreases the activity of a downstream molecule. For example, the mitogen-activated protein kinase (MAPK) intracellular signaling pathway is a causal chain of directed molecular interactions shown in eq. (1)

a (S_{1}) \to k i n (p (R a f)) \to k i n (p (M e k)) \to k i n (p (E r k)) . ((1))

The interactions transmit information about a stimulus at the cell surface to the nucleus, where proteins called transcription factors activate an appropriate biological process [22]. A causal diagram of MAPK consists of a signaling molecule Inline graphic $S_{1}$ and three proteins $R a f$ , $M e k$ , and $E r k$ , each of which engage in kinase activity. We represent signaling molecule abundance with $a ()$ , protein abundance with $p ()$ and the kinase activity of a protein with $k i n ()$ . In the case of MAPK, the abundance or activity of an upstream entity causes the abundance or activity of a downstream entity to increase, and is represented with a sharp edge Inline graphic $\to$ . The diagram is a abstraction showing that the abundance of the signaling molecule $S_{1}$ increases the kinase activity of $R a f$ , which increases the kinase activity of $M e k$ , which increases the kinase activity of $E r k$ . In other cases, if the abundance or activity of an upstream entity causes the abundance or activity of a downstream entity to decrease, we represent this with a blunt edge.

Viral Dysregulation. Viral disruptions of a signaling pathway take form of overactivation or repression of its activities. For example, by amplifying the release of intercellular signaling molecules that overstimulate the immune response, known as Cytokine Release Syndrome (cytokine storm), a virus can cause severe system-level cellular damage.

Quantitative Modeling of Biological Processes With ODE/SDE. Temporal dynamics of biological processes can be expressed quantitatively using ordinary (or stochastic) differential equations. A small number of high quality, validated models have been published in the literature and stored in a computable form in repositories such as Biomodels [23], [24]. For example, the MAPK signaling pathway in eq. (1) is well characterized. We denote Inline graphic $R (t)$ , $M (t)$ , and $E (t)$ as the respective amounts of active $R a f$ , $M e k$ , and $E r k$ at time $t$ ; We denote $T_{R}$ , $T_{M}$ , and $T_{E}$ as their total amounts, which we assume do not change during the considered timeframe; $v_{R}^{act}$ , $v_{R}^{inh}$ , Inline graphic $v_{M}^{act}$ , $v_{M}^{inh}$ , $v_{E}^{act}$ , and $v_{E}^{inh}$ are experimentally derived activation or inhibition kinetic rate constants; and $S_{1}$ is the amount of the input signal. The system of ordinary differential equations (ODEs) is specified as follows [25], [26]:

\begin{matrix} \frac{d R}{d t} & = v_{R}^{act} S_{1} (T_{R} - R (t)) - v_{R}^{inh} R (t) \\ \frac{d M}{d t} & = \frac{{(v_{M}^{act})}^{2}}{v_{M}^{inh}} R {(t)}^{2} (T_{M} - M (t)) - v_{M}^{act} R (t) M (t) - v_{M}^{inh} M (t) \\ \frac{d E}{d t} & = \frac{{(v_{E}^{act})}^{2}}{v_{E}^{inh}} M {(t)}^{2} (T_{E} - E (t)) - v_{E}^{act} M (t) E (t) - v_{E}^{inh} E (t) . ((2)) \end{matrix}

Given initial conditions, forward simulations from the ODEs can be used to generate the temporal trajectories of the amounts of activated proteins, such as Inline graphic $R (t)$ , $M (t)$ , and $E (t)$ in the MAPK example. In this manuscript we refer to such simulated data as observational data. We define an ideal intervention as an event that fixes the amount of an activated protein. For example, if we fix the kinase acivity of $M e k$ at $M (t) = m$ , the second equality Inline graphic $\frac{d M}{d t}$ in eq. (2) becomes zero. We can simulate data from eq. (2) with $\frac{d M}{d t} = 0$ , and refer to these as interventional data. Contrasting observational and interventional data helps evaluate the outcome of the intervention [27].

The deterministic ODE ignore the fact that at low concentration, stochasticity becomes a significant factor in determining the reaction [28]. As the collisions between molecules participating in biochemical process become stochastic, a stochastic model is required. In contrast to ODE, a stochastic differential equation model or stochastic differential equation (SDE) specifies biological process as a random process. For example, in the case of MAPK, the random process of the reaction Inline graphic $M e k \to E r k$ is specified with

\frac{d P_{E} (t)}{d t} = g_{E} (t, v_{E}^{a c t}, v_{E}^{i n h}, M (t)), E (0) = e_{0} ((3))

where Inline graphic $P_{E} (t)$ is marginal probability density of $E (t)$ , function $g_{E}$ determines the probability of a state change between $E (t)$ and $E (s), s > t$ , $e_{0}$ is initial condition, and $M (t)$ is the value of its parent Mek at $t$ . Once stochastic differential equation are fully specified, one can use, e.g., Gillespie's stochastic simulation algorithm [29] to simulate observational and interventional data, and evaluate the outcomes of interventions.

Unfortunately, even simple ODEs such as the one in the MAPK example are difficult to build de novo. This is nearly impossible for novel and poorly studied systems that lack the existence or findability of experimental information describing the structure or boundaries of the process, kinetic equations governing their dynamics [30], rate constants for these equations, or rules governing each agents’ states and functions.

Equilibrium Enzyme Kinetics. Simpler and more general quantitative models can be specified when a reaction reaches the state of chemical equilibrium [31]. One commonly used such model is Hill function in the form of

X = β \frac{{PA}_{X}^{n}}{K^{n} + {PA}_{X}^{n}}, ((4))

where Inline graphic $X$ is the abundance of a protein in a causal diagram (such as $E r k$ in eq. (1)), ${PA}_{X}$ is the set of its parents, $n$ is a parameter interpreted as the number of ligand binding sites of the protein, and $β$ is the total number of molecules of the protein. A special and frequently used case of the Hill function, called Michaelis-Menten function, occurs when Inline graphic $n = 1$ . Although simple to use, these models are deterministic, and do not describe the stochasticity that is a distinctive property of biological systems at low concentrations.

Modeling Biological Processes With Structural Causal Models. The stochastic nature of biological processes at steady-state can be represented by an SCM such as in Fig. 1a [27], [32]. SCMs represent the dependencies between a child node Inline graphic $X$ and its parents ${PA}_{X}$ in terms of a deterministic function $X = f_{X} ({PA}_{X}, N_{X})$ called structural assignment, and a noise variable $N_{X}$ . In Fig. 1a, $f_{M e k}$ and $f_{E r k}$ are linear or non-linear structural assignments, and $N_{R a f}$ , $N_{M e k}$ , and Inline graphic $N_{E r k}$ are statistically independent noise variables with defined probability distributions

\begin{matrix} R a f & = N_{R a f}; M e k = f_{M e k} (R a f; N_{M e k}) \\ E r k & = f_{E r k} (M e k, N_{E r k}), ((5)) \end{matrix}

An ideal intervention in an SCM is performed on a functional assignment. For example, an ideal intervention on Inline graphic $M e k$ sets $M e k = m^{'}$ , defining a new SCM

R a f = N_{R a f}; M e k = m^{'}; E r k = f_{E r k} (M e k, N_{E r k}) . ((6))

An ideal intervention can also be thought of as a process of mutilating the causal graph. For example, intervening on Inline graphic $M e k$ eliminates its dependence upon $R a f$ , and therefore the edge from $R a f$ to $M e k$ is removed as shown in Fig. 1b.

Fig. 1. — *Causal modeling of MAPK signaling pathway.* Circles are variables, double circles are variables intervened upon, squares are deterministic functional assignments, gray nodes are observed variables, and white nodes are hidden variables. (a) Structural causal model. $N_{R a f}$ , $N_{M e k}$ and $N_{E r k}$ are statistically independent noise variables. Root node $R a f$ is only dependent on noise variable $N_{R a f}$ . Non-root nodes $M e k$ and $E r k$ are dependent on their parent and on the associated noise variable. (b) Counterfactual model. The intervention fixes the count of phosphorylated $M e k$ to $m^{'}$ , such that $M e k$ is no longer dependent on $R a f$ and $N_{M e k}$ . Given an observed data point, counterfactual inference infers the noise variables ${\hat{N}}_{R a f}$ , and ${\hat{N}}_{E r k}$ .

Counterfactual Inference With SCM. Beyond direct model-based predictions, SCMs enable counterfactual inference, i.e., the process of inferring the unseen outcomes of a hypothetical intervention given data observed in absence of the intervention [5]. In the context of SCM, counterfactuals are defined as operations

Y_{d o (T = t^{'})} (u) ≜ Y_{M_{d o (T = t^{'})}} (u), ((7))

In other words, the outcome Inline graphic $Y$ that individual $u$ would have had she received treatment $t^{'}$ is defined as the value that $Y$ would have in a structural causal model $M$ mutilated to replace $T = f_{T} (\cdot)$ with $T = t^{'}$ .

For example, in the MAPK signaling pathway, we may be interested in the counterfactual question: Having observed the kinase activities of $R a f = r$ , $M e k = m$ , $E r k = e$ , what would be the kinase activity of $E r k$ in a hypothetical experiment where the kinase activity of $M e k$ was fixed to $m^{'}$ ? This counterfactual query is more formally translated into

\begin{matrix} P (E r k_{d o (M e k = m^{'})} | R a f = r, M e k = m, E r k = e) . ((8)) \end{matrix}

The probability distribution in eq. (8) is estimated with the following steps:

1)
Abduction: Given observational data, estimate the posterior distribution of the noise variables. In the MAPK example, we estimate the posterior distribution of the noise variables:
$\begin{matrix} {\hat{N}}_{R a f} = & {N_{R a f} | R a f = r, M e k = m, E r k = r} \\ {\hat{N}}_{E r k} = & {N_{E r k} | R a f = r, M e k = m, E r k = r} \end{matrix}$
Several inference algorithms are available for this task, e.g., Markov Chain Monte Carlo [33], Gibbs sampling [34], or no-u-turn Hamiltoninan Monte Carlo (HMC) [35]. In recent years, gradient-based inference algorithms such as stochastic variational inference [36] have become popular, because they can scale to larger models by converting an inference problem into an optimization problem.
2)
Intervention: Apply the intervention to the SCM to generate a mutilated SCM as in Fig. 1b. In the MAPK SCM, $M e k = f_{M e k} (R a f, N_{M e k})$ is replaced with $M e k = m^{'}$ as shown in Fig. 1b.
3)
Prediction: Generate samples from the mutilated SCM using the estimated posterior distribution over the exogenous variables ${\hat{N}}_{R a f}$ and ${\hat{N}}_{E r k}$ to obtain the counterfactual distribution, as shown in Fig. 1b.

Causal Effects. We distinguish between two causal effects. The first is the average treatment effect (ATE), defined as the difference between the outcome of a hypothetical intervention and the observed outcome in the entire population. In the MAPK example, the ATE of Inline graphic $E r k$ upon an intervention fixing $R a f = r^{'}$ is:

\{E r k_{d o (R a f = r^{'})} - E r k\} . ((9))

This requires no observational data, and therefore the ATE can be inferred with forward simulation.

On the other hand, the individual treatment effect (ITE) is defined as the difference between the outcome of a hypothetical intervention and the observed outcome for a specific individual or context. In the MAPK example, the individual treatment effect of Inline graphic $E r k$ upon an intervention fixing $R a f = r^{'}$ in a context where $R a f = r$ , $M e k = m$ , $E r k = e$ is:

\begin{matrix} \{E r k_{d o (R a f = r^{'})} - E r k\} | R a f = r, M e k = m, E r k = e ((10)) \end{matrix}

The ITE shares stochastic components of the noise variables between observational and interventional data, and is therefore often more precise than a comparison based on a direct simulation [27].

In cases where domain knowledge is available to describe the systems dynamics in the form of an SDE, the system at equilibrium can be translated into an SCM to enable counterfactual reasoning and estimation of the individual treatment effect [27], [37]. Unfortunately, this process is challenging in novel and poorly studied systems, due to our limited ability to establish the structure of the causal graph.

Structured Knowledge Graphs. Although there exist a multitude of biological knowledge bases that are manually curated from the literature [12], [13], [14], [15], [16], the systems biology community has coalesced around a small number of structured knowledge representations that differ mainly in their intended purpose. For example, the Biological Pathway Exchange Language (BioPAX) was designed for pathway database integration [17], and the Systems Biology Graphical Notation (SBGN) was designed for graphical layout [19].

In contrast, the Biological Expression Language (BEL) was specifically designed for manual extraction and automated integration of author statements about causal relationships among biological entities, biological processes, and cellular-level observable phenomena [11]. The syntax of a BEL statement is comprised of a triple in the form of {subject, predicate, object}. Each subject and object represents an activity or abundance whose entities are grounded using terms from standard namespaces. If the subject directly increases the abundance or the activity of the object, we represent this with =>, and for directly decreasing relationships, we use =|. BEL statements can be chained together from the object of the first statement to the subject of the next statement, as shown in Fig. 2 for the case of the MAPK pathway.

Fig. 2. — *Example BEL statement.* The statement details the processes in the MAPK signaling pathway in eq. (1). The first line states that the kinase activity of RAF directly increases the kinase activity of MEK. The second line states that kinase activity of MEK directly increases the kinase activity of ERK.

BEL provides a number of valuable features for causal modeling. First, the restriction of BEL edges to causal relations implies the topology of the BEL graph can be reflected in the topology of the causal model. Second, the language is expressive enough for humans to manually curate a wide range of biological concepts, but formal enough to serve as a training corpus for corpus for natural language processing of biomedical literature competitions [38]. Third, the BEL ecosystem is sufficiently mature that causal knowledge represented in other languages can be readily converted to BEL [39], [40].

3. Methods

3.1. Notation, Definitions, and Assumptions

Let Inline graphic $X = {X_{i}}$ be a set of variables, such as molecular activities in a signaling pathway. Let $P = {P_{j}}$ be a set of causal predicates that link these variables, such as increases, or regulates. Using this notation, we define a knowledge graph $K$ as a set of $k$ triples

K = {X_{i}, P_{j}, X_{i^{'}} | X_{i} \in X, P_{j} \in P, X_{i^{'}} \in {X ∖ X_{i}}}_{j = 1}^{k} . ((11))

We define a causal query Inline graphic $Q$ as a set ${X^{c}, X^{e}, X^{z}}$ of variables that are potential causes, effects and covariates of interest for the biological investigation, where

X^{c} \subset X, X^{e} \subset X ∖ X^{c}, and X^{z} \subset X ∖ X^{c} ∖ X^{e} .

A pathway Inline graphic $P (X_{1}, X_{k^{'} + 1})$ , $k \leq k^{'}$ is a sequence of a subset of triples from $K$ , where the object of the previous triple is subject of the next triple

\{(X_{1}, P_{1}, X_{2}), (X_{2}, P_{2}, X_{3}), ..., (X_{k^{'}}, P_{k^{'}}, X_{k^{'} + 1})\} . ((12))

Our goal is to query the knowlege graph to generate a qualitative causal model Inline graphic $B$ that links the causes, the effects and the covariates of interest. Importantly, the query result $B$ induces a directed acyclic graph $G$ with $p$ variables from $X$ as nodes, and causal relations from $P$ as edges.

We assume that every variable in Inline graphic $B$ is continuous. We denote $D = {X_{1 j}, X_{2 j}, ..., X_{p j}}_{j = 1}^{m}$ the observational data of $m$ samples from the joint distribution $P (X; θ)$ . The distribution is specified in terms of parameters $θ$ . We denote $R \subset X$ a set of nodes in $G$ without parents.

3.2. Querying a Knowledge Graph to Obtain a Qualitative Causal Model

Given a biological knowledge graph Inline graphic $K$ and a causal query of interest $Q$ , our first objective is to generate a qualitative causal model $B$ capable of answering the query. To this end, we need to explore all potential directed acyclic paths in $K$ from the cause to the effect in $Q$ , and then consider all covariates that may act as confounders of the causal question. This is done with the steps in Algorithm 1. The algorithm can be implemented on any knowledge graph that represents causal relationships as directed edges, such as BEL or the Systems Biology Graphical Notation Activity Flow (SBGN-AF) language [41].

In the case of MAPK, the qualitative causal model that is capable of answering the counterfactual question in eq. (8) corresponds to the result of this query: Inline graphic $Q = {X^{c} = k i n (p (MEK)), X^{e} = k i n (p (ERK)), X^{z} = k i n (p (RAF))}$ .

Algorithm 1. Causal query to Biological Expression Language (QUERY2BEL) algorithm

Inputs: knowledge graph $K$

causal query $Q = {X^{c}, X^{e}, X^{z}}$
Outputs: $B$
1:
procedure query2bel( $X^{c}, X^{e}, X^{z}, K$ )
2:
$▸$ Get all pathways from cause to effect
3:
for each cause $X_{i}^{c} \in X^{c}$ and for each effect $X_{j}^{e} \in X^{e}$ do
4:
find all pathways ${P (X_{i}^{c}, X_{j}^{e})}$
5:
$▸$ Get all pathways from covariates to causes
6:
for each covariate $X_{i}^{z} \in X^{z}$ and for each cause $X_{j}^{c} \in X^{c}$ do
7:
find all pathways ${P (X_{i}^{z}, X_{j}^{c})}$
8:
$▸$ Get all pathways from covariates to effects
9:
for each covariate $X_{i}^{z} \in X^{z}$ and for each effect $X_{j}^{e} \in X^{e}$ do
10:
find all pathways ${P (X_{i}^{z}, X_{j}^{e})}$
11:
$B = {P (X_{i}^{c}, X_{j}^{e})} \cup {P (X_{i}^{z}, X_{j}^{c})} \cup {P (X_{i}^{z}, X_{j}^{e})}$
12:
return $B$

We execute Algorithm 1 step 2 to obtain all pathways from the cause to the effect:

k i n (p (MEK)) \to k i n (p (ERK)) .

We execute Algorithm 1 step 6 to obtain all pathways from the covariate to the cause:

k i n (p (RAF)) \to k i n (p (MEK)) .

We execute Algorithm 1 step 10, but since there are no new pathways from the covariate Inline graphic $k i n (p (RAF))$ to the effect $k i n (p (E R K))$ , we obtain the empty set. The final returned model is:

k i n (p (RAF)) \to k i n (p (MEK)) \to k i n (p (ERK)) .

3.3. Compiling a Qualitative Causal Model to a Quantitative Structural Causal Model

Our second objective is to express the qualitative causal structure in Inline graphic $B$ into a quantitative SCM, and estimate the parameters of the SCM from experimental data. These steps are described in Algorithm 2.

Input. The algorithm takes as input a BEL causal query result Inline graphic $B$ and observed measurements on its variables $D$ .

Get Network Structure $G$ From $B$ (Algorithm 2 Line 3). Since a set of BEL statements identifies parents and children, it induces a causal network structure. We determine this structure by traversing BEL statements with the breadth first search approach, starting with root variables (such as Inline graphic $R a f$ in Fig. 2). For all the non-root variables, the algorithm waits until all the parents are traversed.

For Each Root Node $R$ , Use $D$ to Estimate Parameters $θ$ of $P (R; θ)$ (Algorithm 2 Line 5). In order to specify the SCM, we need to define the type and parameters of the marginal probability distributions of the root variables Inline graphic $P (R; θ)$ . The BEL statements provide prior knowledge about the distribution in a parametric form. Therefore, this step involves techniques such as maximum likelihood to estimate the parameters of this distribution.

Algorithm 2. Biological Expression Language to Structural Causal Models (BEL2SCM) algorithm

Inputs: BEL statements $B$

$D \sim P (X_{1}, ..., X_{p})$
Outputs: $S C M$ $M = {f_{i} ({PA}_{i}, N_{i})}_{i = 1}^{p}$
1:
procedure bel2scm( $B$ , $D$ )
2:
$M = {}$
3:
Get network structure $G$ from $B$ .
4:
for each $R \in R$ in $G$ do
5:
$▸$ Use $D$ to estimate parameters $θ$ of $P (R; θ)$
6:
$θ = a r g m a x_{θ} P (R; θ ∣ D)$
7:
$▸$ Reparameterize $P (R; θ)$ in terms of $f_{R}$ and $N_{R}$
8:
$N_{R} \sim N (0, 1)$
9:
$f_{R} (N_{R}) = F_{P (R; θ)}^{- 1} (N_{R})$
10:
$M$ .Add( $f_{R} (N_{R})$ )
11:
for each $X$ $\in {X ∖ R}$ in $G$ do
12:
$▸$ Estimate parameters $w$ and $b$ of sigmoid function
13:
$log (\frac{X}{β_{X} - X}) = w^{'} {PA}_{X} + b$
14:
$▸$ Define distribution of $N_{X}$ from model residuals.
15:
$r e s i d u a l = X - \frac{β_{X}}{1 + exp (- w^{'} {PA}_{X} - b)}$
16:
$N_{X} \sim N (0, M S E (r e s i d u a l)$ )
17:
$▸$ Get $f_{X} ({PA}_{X}, N_{X})$ with additive $N_{X}$ .
18:
$f_{X} ({PA}_{X}, N_{X}) = \frac{β_{X}}{1 + e x p (- w_{X}^{'} {PA}_{X} - b_{X})} + N_{X}$
19:
$M$ .Add( $f_{X} ({PA}_{X}, N_{X})$ ).
20:
return $M$

For example, in a stochastic MAPK system at equilibrium the root variable the number of active Inline graphic $R a f$ in a cell follows a Binomial distribution. When the maximum number of active or inactive particles in the system is large, the Binomial distribution can be approximated with a Normal distribution with $θ_{R a f} = (μ_{R a f}, σ_{R a f}^{2})$ . We then estimate $θ_{R a f}$ using maximum likelihood from the observed Inline graphic $R a f$ in $D$ .

For Each Root Node $R$ , Reparameterize $P (R; θ)$ in Terms of $f_{R}$ and $N_{R}$ (Algorithm 2 Line 7). The specification of an SCM requires us to separate the deterministic and the stochastic components of variation of each variable as shown in Fig. 1. We accomplish this using a reparameterization technique popularized by variational autoencoders [42], which was shown to make counterfactual inference consistent with core biological assumptions [43]. In the case of root nodes, we reparameterize Inline graphic $P (R; θ)$ with Uniform(0,1), and then pass it to the inverse CDF of $P (R; θ)$ , as follows

\begin{matrix} Original : R \sim P (R; θ) \\ Reparametrized : N_{R} \sim Uniform (0, 1) \\ f_{R} (N_{R}) = F_{P (R; θ)}^{- 1} (N_{R}), ((13)) \end{matrix}

where Inline graphic $F_{P (R; θ)}^{- 1} (N_{R})$ is the inverse cumulative distribution function of $P (R; θ)$ . In the case of MAPK, since $R a f$ follows a Normal distribution with parameters $θ_{R a f}$ , the reparameterization simplifies even further to

\begin{matrix} Original : R a f \sim N (μ_{R a f}, σ_{R a f}^{2}) \\ Reparametrized : N_{R a f} \sim N (0, 1) \\ f_{R a f} (N_{R a f}) = σ_{R a f} N_{R a f} + μ_{R a f} . ((14)) \end{matrix}

Add $R$ to $M$ (Algorithm 2 line 10) For each root node, we add the corresponding function Inline graphic $f_{R} (N_{R})$ and its noise variable $N_{R}$ to $M$ . For example, since MAPK has only one root node $R a f$ , the Algorithm adds $f_{R a f} (N_{R a f})$ to $M$ .

For Each $X \in {X ∖ R}$ , Estimate Parameters $w$ and $b$ of Sigmoid Function (Algorithm 2 Line 12). In order to specify the SCM for non-root nodes, we need to define the form (polynomial, linear, non-linear, sigmoid, etc.) of functional assignments linking the measurements on the parent nodes to the measurements on the child. We chose the functional assignment in the form of a sigmoid function

log (\frac{X}{β_{X} - X}) = w^{'} {PA}_{X} + b, ((15))

where Inline graphic $β_{X}$ is the maximum number of activated protein molecules. For a node $X$ with $q$ parents, ${PA}_{X}$ is a $q \times 1$ vector of measurements on the parent nodes, $w$ is a $1 \times q$ vector of weights, $w^{'}$ is the transpose of $w$ , and $b$ is a scalar bias. Parameters Inline graphic $w$ and $b$ of the sigmoid function are estimated from the data, e.g., using smooth $L_{1}$ loss function.

In the example of the MAPK pathway, Inline graphic $f_{M e k}$ has only one parent. Therefore $f_{M e k}$ has the form

f_{M e k} (R a f, N_{M e k}) = \frac{β_{M e k}}{1 + e x p (- w_{M e k} R a f - b)} + N_{M e k} . ((16))

We use the sigmoid function in eq. (15) as a special case of the Hill equation. The full parametric description of the Hill equation has a nuanced precise biochemical interpretation. For example, the parameter Inline graphic $n$ represents the number of times a protein must be phosphorylated before it becomes active and can therefore be obtained from domain knowledge. However, it is difficult to estimate this parameter from data. The sigmoid function maintains the Hill equation's functions, but with a reduced set of parameters that are easier to estimate. Fig. 3 shows that the approximation is reasonable for a range of parameter values.

Fig. 3. — *Examples of hill function and sigmoid function for two variables.* $X$ is a single node that has a single parent ${PA}_{X}$ . We use the Hill function ( $X = β \frac{{PA}_{X}^{n}}{K^{n} + {PA}_{X}^{n}}$ ) and sigmoid function as in eq. (15) to predict the value of $X$ given its parent value. In the Hill function, $K$ is the activation rate, $n$ defines the steepness of function and $β$ is fixed at 100. Blue lines correspond to Hill equation with $K = 30$ and $n \in {1, 2, 3}$ . Brown lines correspond to sigmoid function where $b \in {0.4, 0.3, 0.4}$ and $w \in {0.025, 0.1, 0.5}$ .

Define Distribution of $N_{X}$ From Model Residuals (Algorithm 2 Line 14). Similarly to the root variables, for non-root variables we assume that the noise variables follow Normal distribution with 0 mean. The variance of this distribution is estimated from the residuals of the model fit in the previous step. For example, in the MAPK pathway, Inline graphic $f_{M e k}$ has only one parent $R a f$ . Therefore, the residuals of the sigmoid curve fit for $M e k$ are defined as

r e s i d u a l_{M e k} = M e k - \frac{β_{M e k}}{1 + exp (- w_{M e k} R a f - b)}, ((17))

and the distribution of the noise variable is defined as Inline graphic $N_{M e k} \sim N (0, M S E (r e s i d u a l_{M e k}))$

Get $f_{X} ({PA}_{X}, N_{X})$ With Additive $N_{X}$ (Algorithm 2 Line 17). The step combines the sigmoid functional assignment and the independent noise variable. In the example of Inline graphic $M e k$ in the MAPK pathway, the step outputs

f_{M e k} (R a f, N_{M e k}) = \frac{β_{M e k}}{1 + e x p (- w_{M e k} R a f - b)} + N_{M e k} ((18))

Add $f_{X} ({PA}_{X}, N_{X})$ to SCM (Algorithm 2 Line 19). The step iteratively adds Inline graphic $(f_{X}, N_{X})$ for all $X \in X$ .

Output (Algorithm 2 Line 20). The algorithm returns a generative structural causal model Inline graphic $M = {f_{i} ({PA}_{i}, N_{i})}_{i = 1}^{p}$ where ${PA}_{i} \subset X$ . For example, in the case of the MAPK model, it returns $[N_{R a f}, N_{M e k}, N_{E r k}, f_{R a f} (N_{R a f}), f_{M e k} (R a f, N_{M e k}), f_{E r k} (M e k, N_{E r k})]$ .

3.4. Counterfactual Inference Procedure

The generated SCM enables counterfactual inference using a standard procedure [5]. Given a new observation Inline graphic $D^{n e w}$ ,

1)
Abduction: Update the probability $P (N_{X})$ to obtain $P (N_{X} | D^{n e w})$ .
2)
Action: Replace the equations determining the variables in set $X^{c}$ by $X^{c} = x^{c'}$ .
3)
Prediction: Sample from the modified model to generate the target distribution $X_{d o (X^{c} = x^{c'})}^{e}$ .

After generating the target distribution of the intervention model, we estimate causal effects. Algorithm 3 describes the detailed steps of both counterfactual inference (with Inline graphic $D^{n e w}$ ) and forward simulation (if $D^{n e w}$ is empty)

3.5. Implementation

QUERY2BEL was implemented manually using a publicly available instance of BioDati Studio, then validated using Integrated Dynamical Reasoner and Assembler (INDRA)'s [10] interactive dialogue system Bob with BioAgents [10]. Parameter estimation in BEL2SCM was implemented in PyTorch. Let Inline graphic $C$ be the number of nodes in causal graph $G$ with parents. Let $k$ be the number of iterations for gradient descent, let $N$ be the number of samples in data, and let $d$ be the maximum number of parents in graph $G$ . Computational complexity of parameter estimation step is given by Inline graphic $O (C k N d)$ .

SCM-based counterfactual inference was performed with Pyro [44], due to its ability to perform interventions on probabilistic models and scalability to larger models, as described in Algorithm 3. Specifically, the implementation relies on the following functionalities in Pyro. The pyro.do method is an implementation of Pearl's do-operator used for causal inference. The pyro.infer.SVI method performs abduction using stochastic variational inference with ELBO loss. The pyro.infer.Importance method performs posterior inference by importance sampling. The pyro.infer.EmpiricalMarginal method performs empirical marginal distribution from the trace posterior's model.

Algorithm 3. Estimate causal effect on $X^{E}$ upon intervening on $X^{C}$

Inputs: New data point $D^{n e w}$

effect node $X^{E}$

observational data for effect node $D^{E} \in D^{n e w}$

intervention value $c$

node to intervene upon $X^{C}$

number of iteration $I$

network structure $G$

SCM $M$
Outputs: Causal Effect $C E$
1:
procedure getCausalEffect( $D^{n e w}, E, D^{E}, X^{C}, c, I, G, M$ )
2:
$\hat{N} = {}$
3:
$▸$ Interventional data for effect node $X^{E}$
4:
${ID}^{E} = {}$
5:
for $I$ do
6:
for each $X$ $\in {X ∖ X^{C}}$ in $G$ do
7:
$▸$ Abduction: Apply stochastic variational inference
8:
${\hat{N}}_{X} = S V I (D^{n e w})$
9:
$\hat{N}$ .Add( ${\hat{N}}_{X}$ )
10:
$▸$ Action: Apply intervention on $X^{C}$
11:
$C M = p y r o . d o (M, X^{C} = c)$
12:
$▸$ Get posterior of $C M$ with importance sampling
13:
$C M P = p y r o . i n f e r . I m p o r t a n c e (C M, \hat{N})$
14:
$▸$ Prediction: Get EmpiricalMarginal (EM) for $X^{E}$
15:
$C M M = p y r o . i n f e r . E M (C M P, X^{E})$
16:
${ID}^{E}$ .Add( $C M M$ )
17:
$C E = {ID}^{E} - D^{E}$
18:
return $C E$

Experiments in this manuscript took between 13 to 82 seconds depending on the graph size on a system with Intel Core i7 8th Gen CPU, 16 GB RAM and Ubuntu 18.04 Operating System. The code is available at https://github.com/bel2scm.

4. Case Studies

Below we introduce two biological case studies investigated using the approach proposed in this manuscript. The first case study allows us to evaluate the accuracy of the results based on known ground truth. The second uses counterfactual reasoning to pinpoint the mechanism by which SARS-CoV-2 infection can lead to a cytokine storm in severely ill coronavirus disease 2019 (COVID- 19) patients. The details of the case studies, parameter values of the simulations, and of the results are at https://github.com/bel2scm.

4.1. Case Study 1: The IGF Signaling System

The System. The IGF signaling pathway (Fig. 4) regulates growth and energy metabolism of a cell. The IGF system has been extensively investigated, and its dynamics are well characterized in form of ODE and SDE models [25]. Activated by external stimuli, insulin-like growth factor (IGF) or epidermal growth factor (EGF) triggers a signaling event, which includes the MAPK signaling pathway in eq. (1). Similarly to eq. (1), nodes in the system are kinase activities, and edges represent whether the kinase activity of the upstream protein directly increases or decreases the kinase activity of the downstream protein. However, the system is larger and more complex. It includes two different paths from Inline graphic $R a s$ to $E r k$ , one direct and the other through $P I 3 K$ and $A k t$ . This challenges estimates of outcomes of interventions. In this case study, we assume that the IGF system has no unobserved confounders.

Intervention. We considered two interventions. The first fixes the kinase activity of Inline graphic $M e k$ to 40. The second fixes the kinase activity of $R a s$ to 30.

Causal Effects of Interest. We are interested in two causal questions. First, what would have been the kinase activity of $E r k$ had we intervened to fix the kinase activity of $M e k$ to 40? The second query is as above, but with the intervention fixing the kinase activity of Inline graphic $R a s$ to 30. More formally, we are interested in the average treatment effect

\begin{matrix} \{E r k_{d o (M e k = 40)} - E r k\} ((19)) \end{matrix}

\begin{matrix} \{E r k_{d o (R a s = 30)} - E r k\} . ((20)) \end{matrix}

Next, we introduce a new piece of information about a specific data point generated from the ODE-based simulation. We wish to estimate the causal effect of intervention for this specific data point. More formally, we are interested in the individual treatment effect

\begin{matrix} \{E r k_{d o (M e k = 40)} - E r k\} | D^{n e w} ((21)) \end{matrix}

\begin{matrix} \{E r k_{d o (R a s = 30)} - E r k\} | D^{n e w}, ((22)) \end{matrix}

where Inline graphic $D^{n e w}$ is a new data point. We note that this counterfactual inference can only be performed with an SCM. We wish to compare these estimates of causal effects, in order to characterize the ability of counterfactual inference via $D^{n e w}$ to improve the precision of the estimates.

Evaluation. The kinetic equations described by the ODE and SDE represent the true underlying dynamics of the IGF signaling pathway. Since the ODE and the SDE can estimate the causal effects by forward simulation, we view the estimates as the ground truth. We then wish to compare the estimates from the SCM against the ground-truth estimates from the ODE and the SDE. Since an SCM represents causal relationships at steady state, we train the parameters of the SCM using data generated from the ground-truth SDE after it has reached steady state.

We consider two types of evaluations. First, we compare the estimates of the forward simulation of the ODE and SDE with the forward simulation of the SCM. This allows us to characterize the impact of SCM specification and estimates of weights on the accuracy of causal effects. We do not expect to see a substantial difference between these two approaches for a correctly specified SCM. We then compare the SCM-based counterfactual inference of causal effects with the estimates based on forward simulation. We expect that the counterfactual inference will provide more precise estimates, illustrating the statistical efficiency of counterfactual inference as compared to the forward simulation.

4.2. Case Study 2: Host Response to Viral Infection

The System. Retrospective studies have indicated that high levels of pro-inflammatory cytokine Interleukin 6 (IL6) are strongly associated with severely ill COVID-19 patients [45]. One recently proposed explanation for this is the viral induction of a positive feedback loop, known as Interleukin 6 Amplifier (IL6-AMP) [46]. IL6-AMP is stimulated by simultaneous activation of nuclear factor kappa-light-chain-enhancer of activated B cell (NF- Inline graphic $κ$ B) and Signal Transducer and Activator of Transcription 3 (STAT3) [47]. This in turn induces various pro-inflammatory cytokines and chemokines, including Interleukin 6, which recruit activated T cells and macrophages. This strengthens the Interleukin 6 Amplifier into a positive feedback loop leading to a cytokine storm [48], which is believed to be responsible for the tissue damage observed in patients with acute respiratory distress syndrome (ARDS) [46].

Intervention. Originally developed to treat autoimmune disorders such as rheumatoid arthritis [49], Tocilizumab (Toci) is an immunosuppressive drug consisting of a recombinant monoclonal antibody that targets the soluble Interleukin 6 receptor and can effectively block the IL6 signal transduction pathway [50]. Tocilizumab has emerged as a promising drug repurposing candidate to reduce mortality in severely ill COVID-19 patients [51], [52].

Causal Effect of Interest. We define a severely ill COVID-19 patient as someone with CytokineStorm > 65. We are interested in the individual treatment effect (ITE)

\begin{matrix} \{{CytokineStorm}_{d o (T o c i = 0)} - CytokineStorm\} | D^{n e w}, & ((23)) \end{matrix}

where Inline graphic $D^{n e w}$ is an observed patient who received Tocilizumab treatment and became severely ill. We wish to characterize the severity of cytokine storm which would have occurred had she not received the treatment. We further wish to compare the ITE with the ATE

\{{CytokineStorm}_{d o (T o c i = 0)} - CytokineStorm\} . ((24))

Evaluation. Tocilizumab is known to have a strong inhibitory effect on soluble Interleukin 6 receptor. We therefore expect that the severity of the cytokine storm would have been worse had the patient not received treatment. Unfortunately, at the time of writing, there were no ODE or SDE-based models of the pathway, nor were there publicly available COVID-19 datasets quantifying the kinase activity of the Interleukin 6 Amplifier pathway at the single-cell level. Therefore, we simulated data from a “ground-truth” sigmoidal structural causal model, where the topology reflects the causal structure of the pathway, and the numeric values of the parameters were fixed to reflect our prior qualitative knowledge of the IL6-AMP pathway.

We evaluate the ITE the proposed approach in two ways. First, we train the parameters of the SCM using the simulated data, and compare the counterfactual inference of the ITE obtained from the “trained” SCM to the counterfactual inference of the ITE from the “ground-truth” SCM. This comparison allows us to characterize the impact of weight estimation on the accuracy of causal effects. We expect that the need to estimate the weights will inflate the variance of the estimates. Second, we compare the estimates of ITE to the estimates of the ATE using the trained SCM. This comparison allows us to characterize the statistical efficiency of counterfactual inference when estimating causal effects. We expect that the ITE will provide much more precise estimates.

5. Results

5.1. Case Study 1: The IGF Signaling System

Generating BEL Causal Model. The BEL representation of the IGF system was manually curated using PyBEL [40], to match the existing ODE and SDE. The BEL representation of the IGF system specified all the node types as in category abundance. All the relationships between parents and children nodes were of type increase, except for the parent node Inline graphic $A k t$ , where the relationship was of type decrease.

Observational Data. We mimicked the process of collecting observational data by simulating kinase activity from the corresponding ODE and SDE. The initial number of particles for the receptor was 37 for Inline graphic $E G F$ and 5 for $I G F$ . The deterministic simulation numerically solved the ODE using the deSolve [53] R package. The stochastic simulation used the Gillespie algorithm [29] from the smfsb [54] R package.

Appropriateness of Model Assumptions. SCM-based estimates of functional assignments with sigmoid approximations were well within the range of the SDE-based data (as shown for Inline graphic $R a f$ and $M e k$ in Fig. 5). Similar results were obtained for estimates of $R a s$ , $P I 3 K$ , $A K T$ , $R a f$ , and $E r k$ . The fitted functional assignment had little curvature. This indicates that a more complicated function with more parameters, such as Hill equation, was unnecessary in this case.

Fig. 5. — *Case Study 1: IGF Model* Scatter Plot of $M e k$ Versus $R a f$ . Blue points are the data points generated by SDE. Yellow points are the estimates from SCM. The red line is the fitted sigmoid curve in Algorithm 2 line 12.

To further evaluate the plausibility of the assumptions, Fig. 6 shows the histograms of the SDE-generated abundances of root nodes, which were not affected by functional assignments in SCM. The shape of the histograms indicate that the assumption of Normal distribution was plausible.

Accuracy of Causal Effects. Figs. 7c and 7d show that the average treatment effects (ATEs) on Inline graphic $E r k$ of fixing $M e k$ and $R a f$ , based on forward simulation of ODE, SDE and SCM, were consistent. Figs. 7a and 7b show that the based on counterfactual inference has a smaller variance than the ATE. Since counterfactual inference reduces nuisance variation by sharing stochastic components in contexts with and without intervention, it increases the statistical efficiency of the estimation.

Fig. 7. — *Case Study 1: Estimated causal effects of the IGF signaling pathway using algorithm 3.* The ODE and SDE represent the true underlying dynamics of the IGF signaling pathway. The ODE and SDE-based forward simulation can only estimate the average treatment effect. These estimates are viewed as ground truth. In contrast, an SCM can estimate both the average treatment effect (ATE) and the individual treatment effect (ITE). (a) Comparison of ITE vs ATE for $E r k$ when $M e k$ is fixed. (b) Comparison of ITE vs ATE for $E r k$ when $R a s$ is fixed. (c) Comparison of SCM, SDE and ODE estimates of the ATE for $E r k$ when $M e k$ is fixed. (d) Comparison of SCM, SDE and ODE estimates of the ATE on $E r k$ when $R a s$ is fixed.

The individual treatment effect on Inline graphic $E r k$ by fixing $M e k$ was much stronger than the ITE on $E r k$ by fixing $R a s$ for the following reason. While $M e k$ directly influences $E r k$ (i.e., there is a single path from $M e k$ to $E r k$ ), $R a s$ has two pathways to $E r k$ . The path through $A K T$ has an inhibiting (deactivation) effect on Inline graphic $R a f$ , and estimated negative weights in the sigmoid function in eq. (15). The alternative path, a cascade from $R a s$ to $E r k$ , has the opposite (activating) effect on $E r k$ . The two paths mitigate the overall causal effect of $R a s$ on $E r k$ .

5.2. Case Study 2: Host Response to Viral Infection

Generating BEL Causal Model. The steps of the proposed Algorithm 1 produced the qualitative causal model in Fig. 8, and the corresponding BEL causal model Inline graphic $B$ , as follows. In accordance with the inputs to Algorithm 1, we defined the knowledge base $K$ as the Covid-19 knowledge network automatically assembled from the Covid-19 document corpus using the INDRA workflow. We defined the cause $X^{c}$ as sIL6R $α$ , the effect $X^{e}$ as cytokine storm, and the covariates Inline graphic $X^{z}$ as SARS-CoV- 2 and Toci. Therefore the causal query of interest was defined as $Q$ = sIL6R α, CytokineStorm, SARS-CoV-2,Toci}}.

Algorithm 1 line 2 generated all pathways from Interleukin 6 to Cytokine Release Syndrome, resulting in Inline graphic $k i n (p ($ sIL6R α)) $) \to k i n (p$ (IL6-STAT3)) $) \to b p (IL 6 - AMP)$ (CytokineStorm), where bp() is a biological process. Next, line 5 generated all pathways from Tocilizumab to Interleukin 6: $a ($ Toci $) k i n (p (($ sIL6R $α)))$ , where $a ()$ is the dosage level of Tocilizumab. We then generated all pathways from severe acute respiratory syndrome coronavirus 2 to Interleukin 6 receptor: Inline graphic $p o p ($ SARS-CoV-2 $) c a t$ (ACE2) $a$ (Angiotensin II) $\to k i n ($ p(AGTR1) $) \to k i n (p$ (ADAM17) $) \to k i n (p ($ sIL6R $α$ )), where $p o p ()$ is the viral load of SARS-CoV-2 and $c a t ()$ is the normal catalytic activity of Angiotensin Converting Enzyme 2.

Line 8 found no new branches from Tocilizumab to Cytokine Release Syndrome. Finally, we generated all pathways from severe acute respiratory syndrome coronavirus 2 to Cytokine Release Syndrome, which resulted in three new branches Inline graphic $p o p ($ SARS-CoV-2 $) \to k i n$ (p(PRR)) $\to k i n$ (p(NF- $κ$ B)) $\to b p$ ((IL6-AMP)) $, k i n$ (p(ADAM17)) $\to p$ ((EGF)) $\to k i n$ (p(EGFR)) $\to k i n$ (p(NF- $κ$ B)) $, a n d k i n$ (pEGFR) $\to k i n$ (p(TNF $α$ )) $\to k i n$ (p(NF- $κ$ B)).

Observational Data. We simulated observational data from a “ground-truth” sigmoidal structural causal model, where the topology reflects the causal structure in Fig. 8, and the parameters reflect our prior qualitative knowledge of the IL6-AMP pathway. The root nodes SARS-CoV-2 and Tocilizumab were sampled from a Normal distribution with mean of 50 and standard deviation of 10. The non-root nodes were sampled from a sigmoid function as in eq. (15). Since we have prior qualitative knowledge that IL6-AMP is only activated due to simultaneous activation of NF- Inline graphic $κ$ B and IL6-STAT3, we set the threshold for activation above what could be achieved by NF- $κ$ B or IL6-STAT3 alone. Since we also know that Toci is a strong inhibitor of sIL6R $α$ , we set the inhibition coefficient to a large negative number. The parameters of the sigmoid function were chosen to ensure that the variables were in the desired range of 0–100. Finally, we randomly generated two new individuals Inline graphic $D^{n e w}$ with Cytokine Release Syndrome $> 65$ to represent severely ill patients. The first patient had a higher viral load of SARS-CoV-2 and received a lower dose of Toci. The second patient had a lower viral load of and received a higher dose of Toci.

Estimation of Individual-Level Treatment Effect. Fig. 9 evaluates the SCM-based estimates of the individual treatment effect of withholding treatment from two COVID-19 patients who were severely ill. The distribution of the individual treatment effect obtained with the SCM trained using Algorithm 2 was consistent with, but had a slightly larger variance then, the distribution of ITE obtained with the “ground truth” SCM with known weights. Even though both patients had the same severity of illness prior to the intervention, patient B was estimated to have a more severe cytokine storm after Toci was withheld.

Fig. 10 further compared the individual treatment effect obtained with the SCM trained using Algorithm 2 with the average treatment effect estimated from the same model using forward simulation. The distribution of the individual treatment effect was patient-specific and had smaller variance, thus illustrating the statistical efficiency of counterfactual inference.

6. Discussion

We proposed a general approach that leverages structured qualitative prior knowledge, automatically generates a quantitative SCM, and enables answers to counterfactual research questions. In both case studies, the use of the Biological Expression Language allowed us to leverage large repositories of structured biological knowledge to specify an SCM and perform counterfactual inference in an automated manner, which would otherwise require a substantial manual effort. The application to the IGF signaling system demonstrated the appropriateness of the underlying assumptions, and the accuracy of the results when compared to ODE- and SDE-based forward simulation. The application to a study of host response to SARS-CoV-2 infection demonstrated the feasibility, versatility and usefulness of this approach as applied to an urgent public health issue. In particular, the approach can help determine the amount of Tocilizumab (Toci) required to reduce the severity of each individual's cytokine storm. Furthermore, in situations where treatment options are limited (as is the case SARS-CoV-2), counterfactual estimates enable a more precise conclusion regarding who would likely live without receiving the treatment, who would likely die even if they did receive the treatment, and who would likely live only after receiving the treatment.

The approach opens multiple directions for future research. In particular, future work can extend the configurability of the BEL2SCM algorithm by incorporating the rich type information in BEL, mapping parent-child type signatures to functional forms such as post-nonlinear models, neural networks, mass action kinetics and Hill equations, and incorporating additional data types such as binary variables, categorical variables, and continuous variables with constraints on their domains. In some cases, the variables in the model may not be directly observable, but may nonetheless be characterized by means of detectable molecular signatures. For example, even if interferon signaling may not be directly observable using transcriptomics measurements, it may still be possible to infer the activity of interferon signaling by an upregulation of interferon stimulated genes (ISG). Future work will focus on leveraging molecular signature databases to infer the activity of variables in the model, and on learning and/or evaluating the models using experimental data [55].

We also note that experimentalists typically formulate biological processes as linear pathways (e.g., from Inline graphic $S_{1}$ to $E r k$ in the MAPK example) that can be effectively perturbed and measured in a laboratory setting. Yet such boundaries of biological processes are quite arbitrary, and are therefore highly susceptible to confounders. One way to address this issue is to search the knowledge graph for all common causes of variables in the causal model, use an identification algorithm [56] to find the minimal valid adjustment set of the augmented model, and then prune all common causes that do not contribute to that set. This approach will require us to tackle the issues of parameter and causal identifiability in the presence of confounders.

In addition to unobserved confounders, the validity of causal inferences can be threatened by feedback loops, model misspecification, missing data, and out-of-sample distributions. To address the possibility of feedback loops, we must consider the time scale at which these feedbacks reach steady-state: fast timescale feedback loops can be addressed with the chain graph interpretation of SCMs [57], [58]; intermediate timescale feedbacks can be addressed with non-recursive structural causal models [5]; slow timescale feedback loops can be handled by unrolling the structure of the SCM as is done with dynamic Bayesian networks [59], or simply by representing the entire feedback loop as a biological process, as we did with IL6-AMP. In the case of model misspecification, we will investigate the ability of counterfactual inference to improve the estimation [43]. For missing data, we can leverage causal inference recoverability algorithms that have been published recently [60], and for handling out-of-sample distributions, we can leverage recent results applying causal inference to the problem of external validity [61]. Future work will focus on addressing these threats to validity when applied to real biological data.

Acknowledgments

This work was supported by funds from the PNNL Mathematics and Artificial Reasoning Systems Laboratory Directed Research and Development Initiative. Knowledge curation environments were provided by BioDati.com and Causaly.com. The authors would also like to acknowledge Jessica Stothers and Rose Glavin at CoronaWhy.org and Marek Ostaszewski at the COVID-19 Disease Map Initiative for providing valuable feedback about the IL6-AMP model. Jeremy Zucker, Kaushal Paneri, Sara Mohammad-Taheri contributed equally to this work.

Biographies

graphic file with name zucke-3050680.gif

Jeremy Zucker is currently the principal investigator for the MARS causal inference for viral pathogenesis project. He has more than 15 years of experience developing causal models to obtain actionable insights from systems biology data to advance knowledge in the study of metabolic engineering, circadian rhythms, evolution, human health and infectious disease.

graphic file with name paner-3050680.gif

Kaushal Paneri received the master's degree in data science from Northeastern university. He is a data scientist at Microsoft, currently working on counterfactual platform for Bing Ads Marketplace Optimization. His prominent research interests include causality, optimization and machine learning.

graphic file with name taher-3050680.gif

Sara Mohammad-Taheri received the bachelor's and master's degree in mathematics from the Sharif University of Technology. She is currently working toward the PhD degree in computers science with Northeastern University's Khoury College of Computer Sciences, advised by professor Olga Vitek. Her research interest includes causal inference techniques in computational biology and causal discovery of biomolecular data. She is also interested in developing statistical and computational methods and open source software for systems-wide molecular investigations of biological organisms including quantitative genomics, proteomics etc. She is a member of the statistical methods for studies of biomolecular systems group.

graphic file with name bharg-3050680.gif

Somya Bhargava received the master's degree in data science from Northeastern University. She is currently working with Embedded Healthcare. She's been working in Healthcare industry and is experienced in using natural language processing, machine learning, statistical analysis and causal inference for researching for new products and enhancing existing ones.

graphic file with name kolam-3050680.gif

Pallavi Kolambkar received the bachelor's degree in computer science, and the master's degree in computer applications, from India. She is majored in data science from Northeastern University and is currently working with Tesla. She has worked with companies from different domains to explore and visualize different dynamics of data.

graphic file with name bakke-3050680.gif

Craig Bakker received a PhD degree in engineering from the University of Cambridge, where his research focused on optimization algorithms, differential geometry, and computational methods for model decomposition. Following this, he did postdoctoral research in climate change, food security, and economic modelling at Johns Hopkins University. He is currently a research scientist with the Pacific Northwest National Laboratory. He works in game theory, machine learning, and optimal control.

graphic file with name teuto-3050680.gif

Jeremy Teuton received the PhD degree in cell and molecular biology (virology). He is an experienced interdisciplinary researcher and project leader. He is proficient in experimental design, trouble shooting, data analysis, and interdisciplinary application of scientific principles and approaches including cyber security and signal detection/classification. He excels in challenging environments, where problem-solving skills and experience in adapting technologies, systems and processes/approaches can be of most use.

graphic file with name hoyt-3050680.gif

Charles Tapley Hoyt received the PhD degree in computational life sciences from the University of Bonn. His research interests cover the interface of biocuration, knowledge graphs, and machine learning with systems biology, networks biology, and drug discovery. He is an advocate of open source software, reproducibility, and open science. His open source projects PyBEL and PyKEEN are used by several academic and industrial groups.

graphic file with name oxfor-3050680.gif

Kristie Oxford is a virologist, with expertise in host-pathogen interactions. Her research at Pacific Northwest National Laboratory (PNNL) primarily involves characterizing and interpreting host biomolecular responses to viral infection. She and her team analyze systems biology data from cells infected in vitro and in vivo with mammalian viruses representing many genera and families, in order to understand mechanisms of disease and to identify targets for medical countermeasures. The systems approach interrogates the host transcriptomic response to infection from microarray or RNA sequencing data as well as the proteomic, lipidomic, and metabolomic response from high resolution mass spectrometry analysis. She and her team have studied host-virus interactions from thousands of samples representing more than 12 human viruses, identifying gene, protein, and metabolite candidates for medical intervention and/or mechanistic studies.

graphic file with name ness-3050680.gif

Robert Ness received the PhD degree in mathematical statistics from Purdue University, and then he worked as a research engineer in various AI startups. He didn't start in machine learning. He started his career by becoming fluent in Mandarin Chinese and moving to Tibet to do developmental economics fieldwork. He later obtained a graduate degree from Johns Hopkins School of Advanced International Studies. After switching to the tech industry, his interests shifted to modeling data. He has published in journals and venues across these spaces, including Research in Computational Molecular Biology and NeurIPS, on topics including causal inference, probabilistic modeling, sequential decision processes, and dynamic models of complex systems. In addition to startup work, currently he is a machine learning professor with Northeastern University.

graphic file with name vitek-3050680.gif

Olga Vitek received the PhD degree in statistics from Purdue University. She is currently a professor with the Khoury College of Computer Sciences at Northeastern University. Her research interests include statistical science, machine learning, mass spectrometry and systems biology. Statistical methods and open-source software MSstats and Cardinal developed in her lab are used in academia and industry, and were recently recognized with the Chan Zuckerberg Essential Open Source Software for Science Award. She is a senior member of the International Society for Computational Biology, and an elected member of the Council of HUPO and of the board of directors of USHUPO. She is a member of the editorial advisory board of Molecular and Cellular Proteomics and of Journal of Proteome Research.

Funding Statement

This work was supported by funds from the PNNL Mathematics and Artificial Reasoning Systems Laboratory Directed Research and Development Initiative.

Contributor Information

Jeremy Zucker, Email: jeremy.zucker@pnnl.gov.

Kaushal Paneri, Email: kaushalpaneri@gmail.com.

Sara Mohammad-Taheri, Email: mohammadtaheri.s@northeastern.edu.

Somya Bhargava, Email: bhargavasomyav2@gmail.com.

Pallavi Kolambkar, Email: kolambkar.p@husky.neu.edu.

Craig Bakker, Email: craig.bakker@pnnl.gov.

Jeremy Teuton, Email: Jeremy.Teuton@pnnl.gov.

Charles Tapley Hoyt, Email: charles.hoyt@envedatx.com.

Kristie Oxford, Email: kristie.oxford@pnnl.gov.

Robert Ness, Email: robertness@gmail.com.

Olga Vitek, Email: o.vitek@northeastern.edu.

References

[1].Pezeshki A., Ovsyannikova I. G., McKinney B. A., Poland G. A., and Kennedy R. B., “The role of systems biology approaches in determining molecular signatures for the development of more effective vaccines,” Expert Rev. Vaccines, vol. 18, 2019, Art. no. 253. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Pedragosa M., et al. , “Linking cell dynamics with gene coexpression networks to characterize key events in chronic virus infections,” Front. Immunol., vol. 10, 2019, Art. no. 1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Nguyen V. K., Klawonn F., Mikolajczyk R., and Hernandez-Vargas E. A., “Analysis of practical identifiability of a viral infection model,” PloS One, vol. 11, 2016, Art. no. e0167568. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Arazi A., Pendergraft W. F., Ribeiro R. M., Perelson A. S., and Hacohen N., “Human systems immunology: Hypothesis-based modeling and unbiased data-driven approaches,” Seminars Immunol., vol. 25, 2013, Art. no. 193. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Pearl J., Causality: Models, Reasoning and Inference. Cambridge, MA, USA: Cambridge Univ. Press, 2013. [Google Scholar]
[6].Peters J., Janzing D., and Schölkopf B., Elements of Causal Inference: Foundations and Learning Algorithms. Cambridge, MA, USA: MIT press, 2017. [Google Scholar]
[7].Allen J. F., Swift M., and De Beaumont W., “Deep semantic analysis of text,” Proc. Conf. Semantics Text Process., 2008, vol. 1, Art. no. 343. [Google Scholar]
[8].McDonald D. D., “Issues in the Representation of Real Texts: The Design of KRISP,” in Proc. Natural Lang. Process. Knowl. Representation: Lang. Knowl. Knowl. Lang., 2000, pp. 77–110. [Google Scholar]
[9].Valenzuela-Escárcega M. A., et al. , “Large-scale automated machine reading discovers new cancer-driving mechanisms,” Database, vol. 2018, 2018, Art. no. 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Gyori B. M., Bachman J. A., Subramanian K., Muhlich J. L., Galescu L., and Sorger P. K., “From word models to executable models of signaling networks using automated assembly,” Mol. Syst. Biol., vol. 13, 2017, Art. no. 954. [DOI] [PMC free article] [PubMed] [Google Scholar]
[11].Hoyt C. T., et al. , “Re-curation and rational enrichment of knowledge graphs in biological expression language,” Database, vol. 2019, 2019, Art. no. baz068. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Cerami E. G., et al. , “Pathway Commons, a web resource for biological pathway data,” Nucl. Acids Res., vol. 39, pp. D685–D690, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Fabregat A., et al. , “The Reactome pathway knowledgebase,” Nucl. Acids Res., vol. 46, pp. D649–D655, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Kanehisa M., Furumichi M., Tanabe M., Sato Y., and Morishima K., “KEGG: New perspectives on genomes, pathways, diseases and drugs,” Nucl. Acids Res., vol. 45, pp. D353–D361, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Perfetto L., et al. , “SIGNOR: A database of causal relationships between biological entities,” Nucl. Acids Res., vol. 44, pp. D548–D554, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[16].Slenter D. N., et al. , “WikiPathways: A multifaceted pathway database bridging metabolomics to other omics research,” Nucl. Acids Res., vol. 46, pp. D661–D667, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[17].Demir E., et al. , “The BioPAX community standard for pathway data sharing,” Nat. Biotechnol., vol. 28, 2010, Art. no. 1308. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Hucka M., et al. , “The Systems Biology Markup Language (SBML): Language specification for level 3 version 2 core,” J. Integrative Bioinf., 2018, Art. no. 20170081. [DOI] [PMC free article] [PubMed]
[19].Le Novere N., et al. , “The systems biology graphical notation,” Nat. Biotechnol., vol. 27, pp. 735–741, 2009. [DOI] [PubMed] [Google Scholar]
[20].Slater T., “Recent advances in modeling languages for pathway maps and computable biological networks,” Drug Discov. Today, vol. 19, pp. 193–198, 2014. [DOI] [PubMed] [Google Scholar]
[21].Machamer P., Darden L., and Craver C. F., “Thinking about mechanisms,” Philosophy Sci., vol. 67, 2000, Art. no. 1. [Google Scholar]
[22].Li Y., Roberts J., AkhavanAghdam Z., and Hao N., “Mitogen-activated protein kinase (MAPK) dynamics determine cell fate in the yeast mating response,” The J. Biol. Chem., vol. 292, pp. 20354–20361, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Chen L., Wang R., Li C., and Aihara K., Modeling Biomolecular Networks in Cells: Structures and Dynamics. Berlin, Germany: Springer, 2010. [Google Scholar]
[24].Gratie D., Iancu B., and Petre I., “ODE analysis of biological systems,” in International School on Formal Methods for the Design of Computer, Communication and Software Systems. Berlin, Germany: Springer; 2013, Art. no. 29. [Google Scholar]
[25].Bianconi F., Baldelli E., Ludovini V., Crino L., Flacco A., and Valigi P., “Computational model of EGFR and IGF1R pathways in lung cancer: A systems biology approach for translational oncology,” Biotechnol. Adv., vol. 30, pp. 142–153, 2012. [DOI] [PubMed] [Google Scholar]
[26].Kim E. K. and Choi E.-J., “Pathological roles of MAPK signaling pathways in human diseases,” Biochimica et Biophysica Acta - Mol. Basis Disease, vol. 1802, pp. 396–405, 2010. [DOI] [PubMed] [Google Scholar]
[27].Ness R., Paneri K., and Vitek O., “Integrating Markov processes with structural causal modeling enables counterfactual inference in complex systems,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 14211. [Google Scholar]
[28].Paneri K., “Integrating Markov process and structural causal models enables counterfactual inference in complex systems,” Northeastern Univ., 2019.
[29].Gillespie D. T., “Exact stochastic simulation of coupled chemical reactions,” The J. Phys. Chem., vol. 81, pp. 2340–2361, 1977. [Google Scholar]
[30].Jha S. K. and Langmead C. J., “Exploring behaviors of stochastic differential equation models of biological systems using change of measures,” BMC Bioinf., vol. 13, 2012, Art. no. S8. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Alon U., An Introduction to Systems Biology: Design Principles of Biological Circuits. Boca Raton, FL, USA: CRC press, 2019. [Google Scholar]
[32].Bongers S. and Mooij J. M., “From random differential equations to structural causal models: The stochastic case,” ArXiv, vol. abs/1803.08784, 2018. [Google Scholar]
[33].Jerrum M., Sinclair A., and Hochbaum D. S., “The markov chain monte carlo method: An approach to approximate counting and integration,” Approximation Algorithms NP-hard problems, PWS Publishing, 1996.
[34].Gelfand A. E., “Gibbs sampling,” J. Amer. Statist. Assoc., vol. 95, 2000, Art. no. 1300. [Google Scholar]
[35].Hoffman M. D. and Gelman A., “The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo,” J. Mach. Learn. Res., vol. 15, pp. 1593–1623, 2014. [Google Scholar]
[36].Hoffman M. D., Blei D. M., Wang C., and Paisley J., “Stochastic variational inference,” The J. Mach. Learn. Res., vol. 14, pp. 1303–1347, 2013. [Google Scholar]
[37].Blom T., Bongers S., and Mooij J. M., “Beyond structural causal models: Causal constraints models,” in Proc. 35th Conf. Uncertainty Artif. Intell., 2019, pp. 585–594. [Google Scholar]
[38].Madan S., et al. , “The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track,” Database, J. Biol. Databases Curation, vol. 2019, 2019, Art. no. baz084. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Hoyt C. T., et al. , “Integration of structured biological data sources using biological expression language,” BioRxiv, Cold Spring Harbor Lab., pp. 631812, 2019.
[40].Hoyt C. T., Konotopez A., Ebeling C., and Wren J., “PyBEL: A computational framework for biological expression language,” Bioinformatics, vol. 34, pp. 703/704, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Mi H., et al. , “Systems biology graphical notation: Activity flow language level 1 version 1.2,” J. Integrative Bioinf., vol. 12, 2015, Art. no. 265. [DOI] [PubMed] [Google Scholar]
[42].Rezende D. J., Mohamed S., and Wierstra D., “Stochastic backpropagation and variational inference in deep latent gaussian models,” in Proc. Int. Conf. Mach. Learn., vol. 2, 2014. [Google Scholar]
[43].Ness R., Paneri K., and Vitek O., “Integrating Markov processes with structural causal modeling enables counterfactual inference in complex systems,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 14234. [Google Scholar]
[44].Bingham E., et al. , “Pyro: Deep Universal Probabilistic Programming,” J. Mach. Learn. Res., vol. 20, pp. 1–6, 2018. [Google Scholar]
[45].Ulhaq Z. S. and Soraya G. V., “Interleukin-6 as a potential biomarker of COVID-19 progression,” Med. Mal. Infect., vol. 50, pp. 382/383, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].Hirano T. and Murakami M., “COVID-19: A new virus, but a familiar receptor and cytokine release syndrome,” Immunity, vol. 52, pp. 731–733, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Murakami M. and Hirano T., “The pathological and physiological roles of IL-6 amplifier activation,” Int. J. Biol. Sci., vol. 8, pp. 1267–1280, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Ogura H., et al. , “Interleukin-17 promotes autoimmunity by triggering a positive-feedback loop via interleukin-6 induction,” Immunity, vol. 29, pp. 628–636, 2008. [DOI] [PubMed] [Google Scholar]
[49].Oldfield V., Dhillon S., and Plosker G. L., “Tocilizumab: A review of its use in the management of rheumatoid arthritis,” Drugs, vol. 69, pp. 609–632, 2009. [DOI] [PubMed] [Google Scholar]
[50].Zhang C., Wu Z., Li J.-W., Zhao H., and Wang G.-Q., “Cytokine release syndrome in severe COVID-19: Interleukin-6 receptor antagonist Tocilizumab may be the key to reduce mortality,” Int. J. Antimicrob. Agents, vol. 55, 2020, Art. no. 105954. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Coomes E. A. and Haghbayan H., “Interleukin-6 in COVID-19: A systematic review and meta-analysis,” MedRxiv, Cold Spring Harbor Lab. Press, 2020. [DOI] [PMC free article] [PubMed]
[52].Xu X., et al. , “Effective Treatment of Severe COVID - 19 Patients with Tocilizumab,” Proc. Nat. Acad. Sci. USA, vol. 117, pp. 10970–10975, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Soetaert K. E. R., Petzoldt T., and Setzer R. W., “Solving differential equations in R: Package deSolve,” J. Statist. Softw., vol. 33, pp. 77–83, 2010. [Google Scholar]
[54].Wilkinson D., “Smfsb-stochastic modelling for systems biology,” R Package Version, vol. 1, 2018. [Google Scholar]
[55].Liu A., Trairatphisan P., Gjerga E., Didangelos A., Barratt J., and Saez-Rodriguez J., “From expression footprints to causal pathways: Contextualizing large signaling networks with CARNIVAL,” Syst. Biol. Appl., vol. 5, 2019, Art. no. 40. [DOI] [PMC free article] [PubMed] [Google Scholar]
[56].Tikka S. and Karvanen J., “Identifying causal effects with the R package causal effect,” J. Statist. Softw., vol. 76, 2017, Art. no. 1. [Google Scholar]
[57].Lauritzen S. L. and Richardson T. S., “Chain graph models and their causal interpretations,” J. Roy. Statist. Soc.: Series B, vol. 64, pp. 321–348, 2002. [Google Scholar]
[58].Sherman E. and Shpitser I., “Identification and estimation of causal effects from dependent data,” Proc. Int. Conf. Neural Inf. Process. Syst., 2018, vol. 2018, Art. no. 9446. [PMC free article] [PubMed] [Google Scholar]
[59].Koller D. and Friedman N., Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. Cambridge, MA, USA: MIT Press, 2009. [Google Scholar]
[60].Nabi R., Bhattacharya R., and Shpitser I., “Full law identification in graphical models of missing data: Completeness results,” 2020, arXiv:2004.04872. [PMC free article] [PubMed]
[61].Bareinboim E. and Pearl J., “Causal inference and the data-fusion problem,” Proc. Nat. Acad. Sci. USA, vol. 113, 2016, Art. no. 7345. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref1] [1].Pezeshki A., Ovsyannikova I. G., McKinney B. A., Poland G. A., and Kennedy R. B., “The role of systems biology approaches in determining molecular signatures for the development of more effective vaccines,” Expert Rev. Vaccines, vol. 18, 2019, Art. no. 253. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] [2].Pedragosa M., et al. , “Linking cell dynamics with gene coexpression networks to characterize key events in chronic virus infections,” Front. Immunol., vol. 10, 2019, Art. no. 1002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] [3].Nguyen V. K., Klawonn F., Mikolajczyk R., and Hernandez-Vargas E. A., “Analysis of practical identifiability of a viral infection model,” PloS One, vol. 11, 2016, Art. no. e0167568. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] [4].Arazi A., Pendergraft W. F., Ribeiro R. M., Perelson A. S., and Hacohen N., “Human systems immunology: Hypothesis-based modeling and unbiased data-driven approaches,” Seminars Immunol., vol. 25, 2013, Art. no. 193. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] [5].Pearl J., Causality: Models, Reasoning and Inference. Cambridge, MA, USA: Cambridge Univ. Press, 2013. [Google Scholar]

[ref6] [6].Peters J., Janzing D., and Schölkopf B., Elements of Causal Inference: Foundations and Learning Algorithms. Cambridge, MA, USA: MIT press, 2017. [Google Scholar]

[ref7] [7].Allen J. F., Swift M., and De Beaumont W., “Deep semantic analysis of text,” Proc. Conf. Semantics Text Process., 2008, vol. 1, Art. no. 343. [Google Scholar]

[ref8] [8].McDonald D. D., “Issues in the Representation of Real Texts: The Design of KRISP,” in Proc. Natural Lang. Process. Knowl. Representation: Lang. Knowl. Knowl. Lang., 2000, pp. 77–110. [Google Scholar]

[ref9] [9].Valenzuela-Escárcega M. A., et al. , “Large-scale automated machine reading discovers new cancer-driving mechanisms,” Database, vol. 2018, 2018, Art. no. 1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] [10].Gyori B. M., Bachman J. A., Subramanian K., Muhlich J. L., Galescu L., and Sorger P. K., “From word models to executable models of signaling networks using automated assembly,” Mol. Syst. Biol., vol. 13, 2017, Art. no. 954. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] [11].Hoyt C. T., et al. , “Re-curation and rational enrichment of knowledge graphs in biological expression language,” Database, vol. 2019, 2019, Art. no. baz068. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] [12].Cerami E. G., et al. , “Pathway Commons, a web resource for biological pathway data,” Nucl. Acids Res., vol. 39, pp. D685–D690, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] [13].Fabregat A., et al. , “The Reactome pathway knowledgebase,” Nucl. Acids Res., vol. 46, pp. D649–D655, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] [14].Kanehisa M., Furumichi M., Tanabe M., Sato Y., and Morishima K., “KEGG: New perspectives on genomes, pathways, diseases and drugs,” Nucl. Acids Res., vol. 45, pp. D353–D361, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] [15].Perfetto L., et al. , “SIGNOR: A database of causal relationships between biological entities,” Nucl. Acids Res., vol. 44, pp. D548–D554, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] [16].Slenter D. N., et al. , “WikiPathways: A multifaceted pathway database bridging metabolomics to other omics research,” Nucl. Acids Res., vol. 46, pp. D661–D667, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] [17].Demir E., et al. , “The BioPAX community standard for pathway data sharing,” Nat. Biotechnol., vol. 28, 2010, Art. no. 1308. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] [18].Hucka M., et al. , “The Systems Biology Markup Language (SBML): Language specification for level 3 version 2 core,” J. Integrative Bioinf., 2018, Art. no. 20170081. [DOI] [PMC free article] [PubMed]

[ref19] [19].Le Novere N., et al. , “The systems biology graphical notation,” Nat. Biotechnol., vol. 27, pp. 735–741, 2009. [DOI] [PubMed] [Google Scholar]

[ref20] [20].Slater T., “Recent advances in modeling languages for pathway maps and computable biological networks,” Drug Discov. Today, vol. 19, pp. 193–198, 2014. [DOI] [PubMed] [Google Scholar]

[ref21] [21].Machamer P., Darden L., and Craver C. F., “Thinking about mechanisms,” Philosophy Sci., vol. 67, 2000, Art. no. 1. [Google Scholar]

[ref22] [22].Li Y., Roberts J., AkhavanAghdam Z., and Hao N., “Mitogen-activated protein kinase (MAPK) dynamics determine cell fate in the yeast mating response,” The J. Biol. Chem., vol. 292, pp. 20354–20361, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] [23].Chen L., Wang R., Li C., and Aihara K., Modeling Biomolecular Networks in Cells: Structures and Dynamics. Berlin, Germany: Springer, 2010. [Google Scholar]

[ref24] [24].Gratie D., Iancu B., and Petre I., “ODE analysis of biological systems,” in International School on Formal Methods for the Design of Computer, Communication and Software Systems. Berlin, Germany: Springer; 2013, Art. no. 29. [Google Scholar]

[ref25] [25].Bianconi F., Baldelli E., Ludovini V., Crino L., Flacco A., and Valigi P., “Computational model of EGFR and IGF1R pathways in lung cancer: A systems biology approach for translational oncology,” Biotechnol. Adv., vol. 30, pp. 142–153, 2012. [DOI] [PubMed] [Google Scholar]

[ref26] [26].Kim E. K. and Choi E.-J., “Pathological roles of MAPK signaling pathways in human diseases,” Biochimica et Biophysica Acta - Mol. Basis Disease, vol. 1802, pp. 396–405, 2010. [DOI] [PubMed] [Google Scholar]

[ref27] [27].Ness R., Paneri K., and Vitek O., “Integrating Markov processes with structural causal modeling enables counterfactual inference in complex systems,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 14211. [Google Scholar]

[ref28] [28].Paneri K., “Integrating Markov process and structural causal models enables counterfactual inference in complex systems,” Northeastern Univ., 2019.

[ref29] [29].Gillespie D. T., “Exact stochastic simulation of coupled chemical reactions,” The J. Phys. Chem., vol. 81, pp. 2340–2361, 1977. [Google Scholar]

[ref30] [30].Jha S. K. and Langmead C. J., “Exploring behaviors of stochastic differential equation models of biological systems using change of measures,” BMC Bioinf., vol. 13, 2012, Art. no. S8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] [31].Alon U., An Introduction to Systems Biology: Design Principles of Biological Circuits. Boca Raton, FL, USA: CRC press, 2019. [Google Scholar]

[ref32] [32].Bongers S. and Mooij J. M., “From random differential equations to structural causal models: The stochastic case,” ArXiv, vol. abs/1803.08784, 2018. [Google Scholar]

[ref33] [33].Jerrum M., Sinclair A., and Hochbaum D. S., “The markov chain monte carlo method: An approach to approximate counting and integration,” Approximation Algorithms NP-hard problems, PWS Publishing, 1996.

[ref34] [34].Gelfand A. E., “Gibbs sampling,” J. Amer. Statist. Assoc., vol. 95, 2000, Art. no. 1300. [Google Scholar]

[ref35] [35].Hoffman M. D. and Gelman A., “The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo,” J. Mach. Learn. Res., vol. 15, pp. 1593–1623, 2014. [Google Scholar]

[ref36] [36].Hoffman M. D., Blei D. M., Wang C., and Paisley J., “Stochastic variational inference,” The J. Mach. Learn. Res., vol. 14, pp. 1303–1347, 2013. [Google Scholar]

[ref37] [37].Blom T., Bongers S., and Mooij J. M., “Beyond structural causal models: Causal constraints models,” in Proc. 35th Conf. Uncertainty Artif. Intell., 2019, pp. 585–594. [Google Scholar]

[ref38] [38].Madan S., et al. , “The extraction of complex relationships and their conversion to biological expression language (BEL) overview of the BioCreative VI (2017) BEL track,” Database, J. Biol. Databases Curation, vol. 2019, 2019, Art. no. baz084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] [39].Hoyt C. T., et al. , “Integration of structured biological data sources using biological expression language,” BioRxiv, Cold Spring Harbor Lab., pp. 631812, 2019.

[ref40] [40].Hoyt C. T., Konotopez A., Ebeling C., and Wren J., “PyBEL: A computational framework for biological expression language,” Bioinformatics, vol. 34, pp. 703/704, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref41] [41].Mi H., et al. , “Systems biology graphical notation: Activity flow language level 1 version 1.2,” J. Integrative Bioinf., vol. 12, 2015, Art. no. 265. [DOI] [PubMed] [Google Scholar]

[ref42] [42].Rezende D. J., Mohamed S., and Wierstra D., “Stochastic backpropagation and variational inference in deep latent gaussian models,” in Proc. Int. Conf. Mach. Learn., vol. 2, 2014. [Google Scholar]

[ref43] [43].Ness R., Paneri K., and Vitek O., “Integrating Markov processes with structural causal modeling enables counterfactual inference in complex systems,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, Art. no. 14234. [Google Scholar]

[ref44] [44].Bingham E., et al. , “Pyro: Deep Universal Probabilistic Programming,” J. Mach. Learn. Res., vol. 20, pp. 1–6, 2018. [Google Scholar]

[ref45] [45].Ulhaq Z. S. and Soraya G. V., “Interleukin-6 as a potential biomarker of COVID-19 progression,” Med. Mal. Infect., vol. 50, pp. 382/383, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] [46].Hirano T. and Murakami M., “COVID-19: A new virus, but a familiar receptor and cytokine release syndrome,” Immunity, vol. 52, pp. 731–733, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] [47].Murakami M. and Hirano T., “The pathological and physiological roles of IL-6 amplifier activation,” Int. J. Biol. Sci., vol. 8, pp. 1267–1280, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] [48].Ogura H., et al. , “Interleukin-17 promotes autoimmunity by triggering a positive-feedback loop via interleukin-6 induction,” Immunity, vol. 29, pp. 628–636, 2008. [DOI] [PubMed] [Google Scholar]

[ref49] [49].Oldfield V., Dhillon S., and Plosker G. L., “Tocilizumab: A review of its use in the management of rheumatoid arthritis,” Drugs, vol. 69, pp. 609–632, 2009. [DOI] [PubMed] [Google Scholar]

[ref50] [50].Zhang C., Wu Z., Li J.-W., Zhao H., and Wang G.-Q., “Cytokine release syndrome in severe COVID-19: Interleukin-6 receptor antagonist Tocilizumab may be the key to reduce mortality,” Int. J. Antimicrob. Agents, vol. 55, 2020, Art. no. 105954. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref51] [51].Coomes E. A. and Haghbayan H., “Interleukin-6 in COVID-19: A systematic review and meta-analysis,” MedRxiv, Cold Spring Harbor Lab. Press, 2020. [DOI] [PMC free article] [PubMed]

[ref52] [52].Xu X., et al. , “Effective Treatment of Severe COVID - 19 Patients with Tocilizumab,” Proc. Nat. Acad. Sci. USA, vol. 117, pp. 10970–10975, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref53] [53].Soetaert K. E. R., Petzoldt T., and Setzer R. W., “Solving differential equations in R: Package deSolve,” J. Statist. Softw., vol. 33, pp. 77–83, 2010. [Google Scholar]

[ref54] [54].Wilkinson D., “Smfsb-stochastic modelling for systems biology,” R Package Version, vol. 1, 2018. [Google Scholar]

[ref55] [55].Liu A., Trairatphisan P., Gjerga E., Didangelos A., Barratt J., and Saez-Rodriguez J., “From expression footprints to causal pathways: Contextualizing large signaling networks with CARNIVAL,” Syst. Biol. Appl., vol. 5, 2019, Art. no. 40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref56] [56].Tikka S. and Karvanen J., “Identifying causal effects with the R package causal effect,” J. Statist. Softw., vol. 76, 2017, Art. no. 1. [Google Scholar]

[ref57] [57].Lauritzen S. L. and Richardson T. S., “Chain graph models and their causal interpretations,” J. Roy. Statist. Soc.: Series B, vol. 64, pp. 321–348, 2002. [Google Scholar]

[ref58] [58].Sherman E. and Shpitser I., “Identification and estimation of causal effects from dependent data,” Proc. Int. Conf. Neural Inf. Process. Syst., 2018, vol. 2018, Art. no. 9446. [PMC free article] [PubMed] [Google Scholar]

[ref59] [59].Koller D. and Friedman N., Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. Cambridge, MA, USA: MIT Press, 2009. [Google Scholar]

[ref60] [60].Nabi R., Bhattacharya R., and Shpitser I., “Full law identification in graphical models of missing data: Completeness results,” 2020, arXiv:2004.04872. [PMC free article] [PubMed]

[ref61] [61].Bareinboim E. and Pearl J., “Causal inference and the data-fusion problem,” Proc. Nat. Acad. Sci. USA, vol. 113, 2016, Art. no. 7345. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Leveraging Structured Biological Knowledge for Counterfactual Inference: A Case Study of Viral Pathogenesis

Jeremy Zucker

Kaushal Paneri

Sara Mohammad-Taheri

Somya Bhargava

Pallavi Kolambkar

Craig Bakker

Jeremy Teuton

Charles Tapley Hoyt

Kristie Oxford

Robert Ness

Olga Vitek

Abstract

1. Introduction

2. Background

Fig. 1.

Fig. 2.

3. Methods

3.1. Notation, Definitions, and Assumptions

3.2. Querying a Knowledge Graph to Obtain a Qualitative Causal Model

Algorithm 1. Causal query to Biological Expression Language (QUERY2BEL) algorithm

3.3. Compiling a Qualitative Causal Model to a Quantitative Structural Causal Model

Algorithm 2. Biological Expression Language to Structural Causal Models (BEL2SCM) algorithm

Fig. 3.

3.4. Counterfactual Inference Procedure

3.5. Implementation

Algorithm 3. Estimate causal effect on XE upon intervening on XC

4. Case Studies

4.1. Case Study 1: The IGF Signaling System

Fig. 4.

4.2. Case Study 2: Host Response to Viral Infection

5. Results

5.1. Case Study 1: The IGF Signaling System

Fig. 5.

Fig. 6.

Fig. 7.

5.2. Case Study 2: Host Response to Viral Infection

Fig. 8.

Fig. 9.

Fig. 10.

6. Discussion

Acknowledgments

Biographies

Funding Statement

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Algorithm 3. Estimate causal effect on $X^{E}$ upon intervening on $X^{C}$