Summary
Directed acyclic graphs (DAGs) have been used to describe causal relationships between variables. The standard method for determining such relations uses interventional data. For complex systems with high-dimensional data, however, such interventional data are often not available. Therefore, it is desirable to estimate causal structure from observational data without subjecting variables to interventions. Observational data can be used to estimate the skeleton of a DAG and the directions of a limited number of edges. We develop a Bayesian framework to estimate a DAG using surrogate interventional data, where the interventions are applied to a set of external variables, and thus such interventions are considered to be surrogate interventions on the variables of interest. Our work is motivated by expression quantitative trait locus (eQTL) studies, where the variables of interest are the expression of genes, the external variables are DNA variations, and interventions are applied to DNA variants during the process of a randomly selected DNA allele being passed to a child from either parent. Our method, surrogate intervention recovery of a DAG (
), first constructs a DAG skeleton using penalized regressions and the subsequent partial correlation tests, and then estimates the posterior probabilities of all the edge directions after incorporating DNA variant data. We demonstrate the utilities of
by simulation and an application to an eQTL study for 550 breast cancer patients.
Keywords: Directed acyclic graphs, eQTL, Surrogate intervention
1. Introduction
A directed acyclic graph (DAG) is a useful tool for studying conditional independence relations among a set of random variables, where each vertex represents a random variable and each directed edge represents a conditional dependence relation, with a restriction of no directed cycles. We consider a DAG
for a set of random variables
, with vertices
corresponding to variables
, and a collection of directed edges
. We assume that
follows a multivariate Gaussian distribution
. We denote the parent vertices of vertex
by
and the corresponding random variables by
, and then can decompose the likelihood of
, based on the Markov property that
is independent of all the remaining variables given its parents
:
![]() |
(1.1) |
Observational data provide the conditionally independent relations among the random variables. Since a set of conditional independence restrictions may be compatible with multiple DAGs, observational data can identify the skeleton of a DAG (i.e., all the edges in a DAG without directions) and the directions of a limited number of edges. The collection of all the DAGs corresponding to the same set of conditional independence restrictions constitute a Markov equivalence class, and thus observational data can only identify a Markov equivalence class, but cannot distinguish the DAGs within the Markov equivalence class. For example, the DAGs
,
, and
all represent the (conditional) independence/dependence that
,
,
, and
, where
indicates that
and
are dependent, and
indicates that
is independent of
given
. Therefore, these three DAGs constitute a Markov equivalence class and cannot be distinguished by observational data.
In this article, we relate the probabilistic DAG that describes conditional independences among genes to causal DAG representation of genes under assumptions of causal Markov and causal faithful assumptions (Spirtes and others, 2000). We also need to assume that there are no unmeasured confounders for relations among
. For example, in the causal DAG,
, the structure,
can be interpreted, as
is the common cause of effects
and
. Extensive efforts have been devoted to estimate causally sufficient DAG structures using observational data. These efforts have produced, for example, the PC algorithm (named after its authors, Peter and Clark) (Spirtes and others, 2000), the PC-stable algorithm that resolves the order-dependency of the PC algorithm (Colombo and Maathuis, 2012), the greedy equivalence search algorithm (Chickering, 2003), and the penalized regression methods (Schmidt and others, 2007; Ha and others, 2016). In this article, we aim to estimate DAGs in another setting, a “surrogate experiment” (Bareinboim and Pearl, 2012). In a surrogate experiment, the variables of interest can be influenced by another group of variables that may be subject to interventions, and thus the interventions do not directly apply to the variables of interest. We refer to such interventions as surrogate interventions.
Our method is motivated by the problem of studying gene–gene networks using gene expression quantitative trait locus (eQTL) data. In an eQTL study, both gene expression and the genotypes of DNA variants are collected in a common set of samples. To simplify the discussion in this article, we assume the DNA variants are single nucleotide polymorphisms (SNPs), though other types of DNA variants, such as copy number variants or indels (small insertions or deletions), also can be used.
We use a DAG to describe the gene–gene network, where each vertex is the expression of a gene, and each edge indicates a causal relation between two genes. Interventions on SNP genotypes can influence gene expression, and thus are surrogate interventions. Direct intervention on an SNP genotype is often impossible. The randomization of the SNP alleles passing to daughter cells during meiosis, however, is analogous to a randomized experiment (Li and others, 2006). This is the so-called Mendelian randomization (MR), which has been applied to instrumental variable analysis (IVA), where genetic markers (SNPs) are instrumental variables to assess the effect of an exposure (e.g., gene expression) on an outcome (Burgess and others, 2017). The IVA for MR is designed to facilitate a randomized trial to control for confounding variables, except that the random allocation of the exposure is performed by nature, rather than by an investigator. The key assumption is that the SNPs have an effect on the exposure, but not on an outcome, except through the exposure. In this article, we have made assumption that there is no unobserved confoundings, and thus MR mainly helps us to justify the association between a SNP and a gene has a causal interpretation SNP
gene, which is consistent with biological principal that DNA can affect gene expression but gene expression can rarely influence DNA sequence.
Previous studies have used eQTL data to dissect the causal relations among three variables, including an eQTL, a gene expression trait, and a third variable, which can be a clinical phenotype (Schadt and others, 2005), another gene expression trait (Kulp and Jagalur, 2006; Chen and others, 2007), or the activity of a transcription factor (Sun and others, 2007). Neto and others (2008) first estimated the skeleton of a DAG, and then used eQTLs to orient the edges in the network. Then they extended this method to jointly estimate causal networks of gene expression traits and the underlying genetic architecture using Bayesian model averaging and a modified Metropolis–Hastings algorithm (Neto and others, 2010). Hageman and others (2011) jointly estimated gene–gene networks and eQTLs using a Bayesian method, while placing constraints on the network through a structural prior. Another approach for graphical model estimation is the use of structural equation models that permit both cyclic and acyclic graphs. Li and others (2006) employed a score-based model selection method. Logsdon and Mezey (2010) estimated network skeletons by applying an adaptive lasso regression for each gene expression trait, and then transformed the skeleton into a DAG or a directed cyclic graph, based on eQTL perturbations. Cai and others (2013) extended the work of Logsdon and Mezey (2010) by providing the initial parameter estimates for the adaptive lasso from penalized regressions using a lasso penalty.
Despite the success of previous work, using high-dimensional gene expression data with a limited sample size to estimate DAGs remains challenging. We propose a high-dimensional DAG estimation method, surrogate intervention recovery of a DAG (
) that integrates an upstream external dataset to learn causal structure. The
incorporates recent methodological developments on skeleton estimation and eQTL mapping. Specifically, we employ the PenPC method (Ha and others, 2016) to estimate the network skeleton, which uses penalized regression and shows substantial advantages over existing methods, such as the PC-stable algorithm for estimating high-dimensional DAGs (Colombo and Maathuis, 2012). We also employ a recently developed method to identify cis-eQTLs, using allele-specific expression (ASE) of RNA-seq data (Sun, 2012). A local or distant eQTL identified by standard eQTL analysis using total expression, rather than ASE, may impact gene expression indirectly by modifying the expression of other genes or biological processes. In contrast, a cis-eQTL identified from ASE must have a direct influence on gene expression (Doss and others, 2005). Through simulation studies under various settings, we demonstrate the improved performance of
from other existing methods, even when the small proportion of genes in the network have eQTLs. We illustrate the prognostic utility of our method using The Cancer Genome Atlas (TCGA) breast cancer data. The computational time of
is discussed in Section S1 of the supplementary material available at Biostatistics online.
2. Methods
2.1. Notations
We study the causal relations of
variables
by a DAG
. We assume that
, where the distribution,
, is Markov and faithful to the causal DAG,
. We assume Gaussian distribution on
that is Markov and faithful to the true unknown underlying causal DAG (Spirtes and others, 2000). We further assume that we observe all the relevant variables, and that there are no hidden confounding variables for the relations among
. Let
be an additional set of variables, such that they are direct causes of the variables
and subject to interventions. Let
be the set of vertices that are associated with at least one variable in
. For example, in eQTL studies,
is the set of genes with at least one eQTL. For any
,
denotes the
variables that directly influence
(
for all
). Let
be the sample size and denote the
observed data matrix of
by
that is centered for each column to have mean 0. Following the notations in Bareinboim and Pearl (2012), we denote the interventional values on variables
by
. The sufficient conditions for interventions on
to be surrogate experiments require that for each
,
has no direct effect on variables
for
(Pearl, 2000). In this study, we assume there is no hidden variable, and thus this assumption is testable. We did not pursue such testing because we restrict the eQTLs to be cis-eQTLs, and cis-regulatory elements in human genome are rarely shared across genes.
2.2. Background: inductive causation algorithm
We briefly review the framework of the inductive causation algorithm (ICA). Using observational data, the Markov equivalence class of a DAG is identifiable. Two DAGs are Markov equivalent, if and only if they have the same skeleton and the same v-structure (Andersson and others, 1997). The skeleton of a DAG
is an undirected graph that is obtained by removing the directions of all the edges in the DAG. We denote the skeleton of
by
where
. A v-structure is an ordered triplet of vertices
, such that
contains the directed edges
and
, and
and
are not adjacent in
. Given a Markov equivalence class, a directed edge is compelled if this edge exists in every DAG in the equivalence class; it is reversible otherwise. The edges participating in v-structures of a DAG are compelled edges. A Markov equivalence class can be represented by a completed partially directed acyclic graph (CPDAG), which is a partially directed acyclic graph (PDAG) consisting of directed edges for all compelled edges in the equivalence class, and undirected edges for all reversible edges in the equivalence class (Chickering, 2002).
The ICA to estimate a CPDAG from observational data consists of three steps: (1) estimation of the skeleton
, usually by conditional independence testing, (2) v-structure identification, and (3) completion of the PDAG obtained from (1) and (2) to obtain the CPDAG (Pearl, 2009). Both Steps (2) and (3) are deterministic. More specifically, at Step (2), a triplet
is assigned a v-structure
, if
,
,
, and
is not included in the conditioning set that makes
and
independent. The completion in Step (3) is to orient the undirected edges as much as possible, with restrictions of no directed cycles and no extra v-structures, which can be done by applying the rules in Meek (1995).
In our proposed method, we directly use the surrogate intervention for the edge orientation of the skeleton estimated by the Step (1). For the estimation of the skeleton, the PC algorithm (Spirtes and others, 2000; Kalisch and Bühlmann, 2007) starts from a completely connected graph where all pairs of genes are connected, and iteratively removes edges by testing zero partial correlations, starting from zero-order (marginal) up to
order, with sample size
with a significance level
. The PC algorithm requires an exhaustive search, especially for edges that are present in the skeleton; the corresponding pair of variables are tested for the conditional independences given all possible subsets of the other variables. Thus, estimating high-dimensional DAGs without prior knowledge on their structures is a challenging problem. The PenPC algorithm (Ha and others, 2016) for DAG skeleton estimation employs a neighborhood selection method to choose a Markov blanket of each vertex, and then a greatly reduced number of partial-correlation tests are performed to remove false positive edges between co-parents of v-structures. The PenPC algorithm improves computational efficiency over the PC algorithm in the following areas: (1) PenPC reduces the search space within the selected Markov blanket for the subsequent partial-correlation tests; (2) the neighborhood selection approach that converts the GGM into a more tractable node-wise multiple regression model enables parallel computing, and the computational time can be substantially reduced, if multicore processors are available; and (3) for the second stage of PenPC, we further reduced the number of partial correlation tests to declare an edge in the skeleton by removing redundant conditioning sets for testing. Moreover, for the neighborhood selection, we employed the log penalty
(Mazumder and others, 2011), which significantly improves the accuracy of the Markov blanket search for high-dimensional problems (Sun and others, 2010), and we provided theoretical justifications of the estimation consistency of high-dimensional DAG skeletons (Ha and others, 2016).
2.3. Edge orientation given surrogate experiments
Now we consider orienting the edges in the skeleton
. To simplify the notations, we assume each variable
is causally affected by a set of external variables
, while
can be an empty set. Given the external variables, the factorization of the joint probability density becomes
![]() |
(2.1) |
where
includes
fixed intervention values. We will orient edges based on this augmented likelihood function. Each conditional distribution in the product term of equation (2.2) can be expressed by the following linear regressions: given
,
![]() |
(2.2) |
where
is
sub-matrix of
corresponding to the parent set of
from
,
is the corresponding regression coefficients,
is the regression coefficients for the intervention values
, and
. We assume linearity for relations between genes and eQTLs, because both PC-stable and PenPC algorithms are based on linearity assumption between a node and its parents, and the linear relations between gene expressions and eQTLs are often assumed as reasonable approximations in eQTL studies (Kendziorski and Wang, 2006; Ritchie and others, 2015).
To orient the skeleton
, we aim to estimate the posterior probability of
versus
for an undirected edge
in
by Bayesian model averaging (Hoeting and others, 1999). Suppose
is a DAG that can be constructed from the skeleton
by adding directions. Let
be the set of parents of
in graph
, which can be an empty set if
has no parent in graph
. Given
, let
be the all parameters including
for all
in the model (2.2). Denoting
as the data including
and
, and letting
be an indicator function,
![]() |
(2.3) |
where
for
are all possible DAGs, given the skeleton
. Given
,
is either 0 or 1, thus
. The posterior of a DAG,
is computed by
![]() |
assuming all the
’s have the same prior probability. This is reasonable, since they all have the same skeleton. The marginal probability can be expressed as the following integral:
![]() |
(2.4) |
Instead of using the marginal probability in (2.4), we replace
by
, where
is the maximum a posteriori (MAP) estimate under uniform priors,
(i.e., maximum likelihood estimate). Therefore,
![]() |
(2.5) |
where
![]() |
As an alternative approach, we consider using a conjugate normal-inverse gamma prior for the parameters
. The model in (2.2) is re-expressed as: for each node
,
![]() |
where
and
. Let
. We consider normal-gamma priors: for each
,
![]() |
(2.6) |
where
,
, and
are fixed positive values. The variance structure of
is defined according to Zellner’s g-prior. A larger
provides more uncertainty that the coefficients are zero, and we employ the unit information prior by setting
. The marginal probability in (2.4) under the normal-inverse gamma priors can be derived as
![]() |
where
and
.
We call our methods, which use MAP and normal-gamma priors,
and
, respectively. Using the marginal likelihood,
in
is the standard Bayesian approach for model selection and penalize the likelihood with respect to the increasing complexity of models. In our framework, however, we assume that the skeleton is sparse (i.e., the number of parents for each node is low), and a limited number of eQTLs with strong gene signals are used for inferring the directions of the skeleton. In simulation studies (Section 2.4), we compared the performance of those two approaches and found no significant difference, even when there were multiple eQTLs.
Finally, we use posterior probabilities of the two possible directions for an edge in
. We select the direction
, if the ratio
for a cutoff
, with a constraint of no directed cycle. Let
. The values
can be considered as estimates of the local false discovery rate (FDR) (Efron and Tibshirani, 2002), if the direction
is called a discovery. Given a desired FDR level
, the set of directed edges,
, for
and
are called discoveries. The expected FDR for a given
is
![]() |
2.4. Sampling DAGs
The model averaging across all possible
DAGs in equation (2.3) is not computationally feasible for a network skeleton with a large number of edges. For example, if there are 10 undirected edges in a skeleton, all possible DAGs have maximum number of
directed structures without the acyclic assumption. Therefore, when there are a large number of undirected edges in the skeleton, instead of evaluating all the possible DAGs, we evaluate a large number of high-likelihood DAGs. To construct a high-likelihood DAG from a skeleton, we sequentially orient undirected edges one-by-one, starting from edges including eQTLs by posteriors evaluated under local graphs. Let the current PDAG for
be
. For an undirected edge
in
, a directed local graph for
is defined by
. Denoting
as the data including
and
. The posteriors for edge directions are computed by
![]() |
where the priors are assumed to be
. If two nodes are connected by an undirected edge, and at least one of these two genes has eQTLs, we refer to such an undirected edge as a starting edge. We identify a high-likelihood DAG in two steps:
Step 1. Set
as the skeleton. For each of the starting edges, update the direction of
in
by randomly selecting a direction
or
with posterior probabilities
and
, subject to acyclic constraints among the directions of all starting edges. Note that
and
are null sets, as
is the skeleton in the first step.
Step 2. Sample one of the nodes that have eQTLs. Orient the edges for the neighboring nodes (i.e., connected by undirected edges), take for example
of the selected node, based on their posterior probabilities of
and
, subject to acyclic constraint. Then set
as the neighboring nodes of
, and then do the same procedure. Repeat the same procedure until
is an empty set. This procedure is performed for each connected component of the skeleton.
As an illustrative example, a skeleton among 4 nodes
with one eQTL,
for
is displayed in Figure 1. The starting edges from the skeleton are,
and
. In Step (1), we randomly sample the direction of each starting edge by their posteriors under the local graphs,
and
. If the orientation for the starting edges is
, in Step (2), we orient edges
under the local graph
, and
under the local graph
. The orientation
is not allowed as the acyclic constraint. For computing
, MAP and normal-gamma priors are used for
and
, respectively. The QTL-directed dependency graph (QDG) (Neto and others, 2008) method uses an logarithm of the odds score to compare the two possible orientations based on the local graphs
versus
, and iteratively updates the edge orientation until convergence without acyclic restriction. Thus, their algorithm is only dependent on local graphs, not the global graph structure. Although their local-graph based approach is similar to the procedure of sampling DAGs of
, the posterior
from our framework is computed by averaging over the high-likelihood DAGs, weighted by their posterior model probability
(i.e., the global model posterior). The randomness of our DAG generation procedure comes from sampling directions of the starting edges in Step 1, weighted by their local posterior probabilities; the orientation in Step 2 can be different, depending on the node we start from, as we sequentially orient the edges by moving to the neighboring nodes based on local posteriors. In other words, using local posteriors may provide different DAGs, if we use a different node to start. Note that the number of DAGs that we can generate may be limited for the acyclic constraints. From a simulation study (Section S1), however, we found that
with different
provided stable estimated networks.
Fig. 1.

An example of a skeleton for
with an eQTL
for 
2.5. Simulation
We evaluate the performance of the
under various simulation settings by generating DAGs with different levels of sparsity, and single or multiple surrogate interventional variables for a gene. We simulate random DAGs following the Erdős and Rényi (1960) (ER) model, with a connection probability
, which is the probability that one vertex is connected to any other vertex and control for the sparsity of the graphs, and with a Barabási and Albert (1999) (BA) model with one edge to add in each time step (Ha and others, 2016). Denote the total number of eQTLs across
genes by
, and the number of genes that have at least one eQTL by
. We generate random DAGs, as well as gene expression and eQTL genotype datasets as follows:
Simulate the
eQTL genotype data matrix
that includes
random vectors distributed as
, where
is a predetermined minor allele frequency, and
0, 1, or 2 with probabilities
,
, and
, respectively.Construct a
matrix
and a
matrix
where all elements are zero. The former is regression coefficients of eQTL effect sizes and the latter is regression coefficients for gene–gene associations.Generate a random graph from the ER model with the parameter
, or the BA model with one edge to add at each time. The graph structure determines the zero structure of
, and the nonzero elements are replaced by independent realizations of random variable
, where
, an exponential distribution with parameter
. The sign of each nonzero element is randomly assigned with probability 0.5.For randomly chosen
number of rows of
, we assign the
eQTLs by randomly selecting nonzero elements and filling in values by realizations of random variable
, where
with parameter
that depends on the number of total eQTLs across all
genes. The sign of each nonzero element is randomly assigned with probability 0.5.- Generate
gene expression data matrix,
by
where
. We simulate
for
sequentially, and after simulating each
, we scale it so that it has variance 1 before simulating
.
We fix the data dimension for the simulation study by
and
to match the dimension with the TCGA breast cancer data in Section 2.6. We consider eight methods in total for comparison: (1) ICA from PC-stable and PenPC algorithms, (2) the QTL-directed dependency graph (QDG) method (Neto and others, 2008), (3)
with skeletons from PC-stable algorithm, (4)
and
with skeletons from PenPC algorithm, and (5)
and
with skeletons from PenPC algorithm when randomly selected
genes have eQTLs. QDG employs a two-step approach to first estimate the skeleton using a PC-stable algorithm, and then orient the edges using eQTL data. A limitation of the QDG algorithm is that all genes in the network should have at least one eQTL (
), while
allows
. To accommodate the limitation of the QDG algorithm, and compare
and QDG, we generate the simulation datasets under
. Then in the methods (5), we randomly select
genes among
to have eQTLs. The PC-stable algorithm includes one tuning parameter,
, which is the significance level for the partial correlation tests. The PenPC algorithm has two steps for neighborhood selection, and a modified PC-stable algorithm. For the first step, the tuning parameters are selected by extended Bayesian Information criterion (Ha and others, 2016), and then the extra tuning parameter for the second step is
, which is the same tuning parameter in the PC-stable algorithm. For
, we have hyper-parameters,
and
in priors (2.6). Because we standardize the datasets before analyzing, we set
for all
. For
, we decide the directions of edges based on the posterior probabilities by the ratio
, where
is a predetermined cutoff, and we set
.
To evaluate the structural difference between an estimated graph and the true graph, we use the structural Hamming distance (SHD) that directly compares the structure of the learned network, including PDAG and CPDAG, and the original network that also include PDAG and CPDAG (Tsamardinos and others, 2006). SHD is defined by the number of the following operators required to make the two graphs match: add or delete an undirected edge, and add, remove, or reverse the orientation of an edge [Algorithm 4 in (Tsamardinos and others, 2006)]. For a true edge,
as an example,
,
,
, and no edge between
and
provide SHD values of 1 (add), 0, 1 (reverse), 1 (delete), respectively. Figure 2 displays the SHD between the estimated graphs and the true DAG as
values, that is the tuning parameter of PC-stable and PenPC algorithms and controls for the sparsity of skeletons. We consider four scenarios in Figure 2: (a) ER model with
and
(all genes have single eQTL); (b) ER model with
and
; (c) ER model with
and
and
(all genes have at least one eQTL, i.e., multiple eQTLs); and (d) BA model and
. We set
and
in the data generation Step 4. Our simulation results are based on averages over 100 replications of the data generation Steps 1–5.
Fig. 2.
Structural Hamming distance (SHD) according to the significance level (
) for various simulation settings for
,
: (a) ER model with
, (b) ER model with
, (c) ER model with
and multiple eQTLs (
eQTLs are randomly assigned to the
nodes), and (d) BA model.
Comparison between PC-stable and PenPC. The CPDAG obtained from ICA with PenPC had lower SHD across the
values, except that PenPC showed slightly higher SHD in the case of BA model with
.
Comparison between QDG and
. Since the QDG algorithm uses the PC-stable algorithm for the skeleton estimation, we adjusted
to be comparable to QDG by using PC-stable for skeletons. Across all simulation settings, the SHD values are significantly decreased by using our
, compared to using QDG. The reduction in SHD is highest when the networks are dense (Figure 2-b).
Comparison between
and
. We have two different types of methods,
and
, by the prior specifications. Across all simulation settings, the two methods showed similar performances, although
showed slightly higher SHD than
when we used only
genes with eQTLs in the dense graph case (Figure 2b).
Comparison between QDG and
with
. When we used the eQTLs for only
genes, the estimated graph was still more accurate than that achieved by other methods, including QDG, across all
values.
Overall,
with the PenPC algorithm showed the best performance across the various simulation settings, and
and
showed similar accuracy in structure estimation. The main advantage of using
over the QDG algorithm is the applicability when only a subset of the genes in a network have eQTLs with accuracy gains. We also evaluated the performance of
when eQTLs for a gene are in linkage disequilibrium (LD), and our methods showed better performance than ICA algorithms (Section S1 of the supplementary material available at Biostatistics online).
2.6. Application to TCGA breast cancer data
Our method is applied to a breast cancer study in The Cancer Genome Atlas (TCGA). We use RNA-seq data from tumor tissues obtained from 550 female Caucasian patients, corresponding to the pathways PI3K/AKT, P53, cell cycle, apoptosis, mTOR, MAPK, RAS, and ERBB, which we selected based on literatures, TCGA breast cancer study (Network and others, 2012) and pan-cancer study (Akbani and others, 2014). We selected 764 genes that are included in at least one of the pathways using the KEGG database http://www.genome.jp/kegg. The expression of each gene within each sample is measured by the total read count (TReC), and we use the log-transformed TReC (logTReC) in this study. A
residual data matrix is obtained after removing the effects of several important covariates by linear regression: the 75th percentile of logTReC (which captures read depth), plate, institution, age, and six genotype principal components. Our goal is to estimate a network among those 764 genes that are included in the eight important pathways in cancer progression.
Applying the PenPC algorithm, we obtained the GGM (Step 1) with 13 061 undirected edges and the skeleton (Step 2) with 2255 undirected edges, using the p-value cutoff
. The degree of the skeleton ranged from 1 to 14 and NCK2 and RXRA genes had the highest degrees. Then we oriented the undirected edges in the skeleton using cis-eQTLs for 76 genes (10%), where we kept the most significant cis-eQTL per gene. The cis-eQTLs are identified by eQTL mapping using both TReC and ASE (Sun, 2012). After applying
, we oriented 2092 edges (93%).
We evaluated the prognostic utility of the estimated network in selecting gene signatures. Using gene-specific causal modules, which consist of a gene and its parent set, based on the network from
, we identified which genes had a significant effect on patients’ survival times. The unsupervised causal structural learning by
provides a PDAG. The main idea is that when we model a gene in relation to patient survival times, we adjust for its parent genes, which are obtained from the PDAG. If the DAG is known, the parent genes for all genes are obviously determined. Using the unique gene modules for each gene, we can find the effect of a gene on the survival time by including the expression of its parent genes in a Cox proportional hazards model. However, the remaining undirected edges in the PDAG generate uncertainty in determining the parent genes, because the undirected edges imply either direction. To obtain reasonable estimates of the causal effect, while accounting for such uncertainty, similar to the method of Maathuis and others (2009), we estimate the lower bound of the effect size of the gene on survival time. This lower bound is calculated by switching the directions for the undirected edges with constraints on no extra v-structure and directed cycle, and obtaining the minimum absolute value of the coefficients. In particular, if all neighboring genes of
are directed, then we have an obvious module for gene
, and the module will be used to compute the prognostic effect of the gene using the Cox proportional hazards model, with the genes in the module as covariates. For genes with neighboring genes that have undirected edges to the gene, we compute multiple candidate modules for the gene in the equivalent class of the PDAG; this results in multiple prognostic effects for the gene, denoted by
for gene
and
, DAGs that are in the Markov equivalence class of the PDAG. We use
, where
for the prognostic gene ranking. The detailed algorithm for constructing multiple modules for a gene is described in Maathuis and others (2009).
The resulting PDAG from
is displayed in Figure 3, where the nodes are weighted and colored by the effect sizes,
and signs,
. To evaluate the performance of the edge orientation, we compared
, which exploits the external cis-eQTLs, with the ICA using the PenPC algorithm (Section 2.2). The network obtained from
produced stronger effect sizes for 38 genes than that obtained from the ICA. Our method identified the collagen gene family, including types 1, 4, and 6, which have been shown to play important roles in breast cancer progression by altering the extracellular matrix architecture and composition (Barsky and others, 1983; Kauppila and others, 1998; Burnier and others, 2011; Fang and others, 2014). Table 1 shows the top 10 genes identified by
, the effect sizes, concordance indices (c-index) (Therneau and Grambsch, 2013) to test data using 10-fold cross-validation, and the parent sets from
and ICA. The signs of the effect sizes were consistently estimated between
and ICA, except for the top gene, COL4A2, which showed 1147.03% of the relative increase in effect size from
compared to ICA. Moreover, the c-indices of the
and the ICA for the gene COL4A2 were respectively 0.58 and 0.51. The superior prognostic power of our method for the gene COL4A2 comes from the different choices of directions for the neighboring genes THBS2 and COL4A1, which were selected as parents in the network from
, while they were selected as children of COL4A2 in the network from CPDAG. Similarly, the COL1A1 gene showed a 1230.65% increase in effect size by selecting the COL1A2 gene as the parent gene using
, while the signs were the same. Overall,
showed better prognostic power than ICA across the top genes, based on the c-indices evaluated from test data (Table 1).
Fig. 3.
TCGA breast cancer-specific networks for PI3K/AKT, P53, cell cycle, apoptosis, mTOR, MAPK, RAS, and ERBB pathways. The node sizes are weighted by the effect sizes (absolute values of the coefficients of the gene) in relation to patients’ survival times; green (red) nodes indicate positive (negative) effects.
Table 1.
Top 10 cancer genes (BRCA) we identified within the data set of 764 genes that are included in the 8 major pathways, ranked by the
model: effects on survival time, C-index (SD) to test data using 10-fold cross-validation and the parent genes, from
and ICA, and relative increase in effect size resulting from
compared to ICA
|
ICA | |||||||
|---|---|---|---|---|---|---|---|---|
| Gene | Effect
|
C-index (SD) | Parents | Effect
|
C-index (SD) | Parents |
|
|
| 1 | COL4A2 |
1.19 |
0.58 (0.036) | THBS2, COL4A1, FLNA | 0.1 | 0.51 (0.038) | FLNA | 1147.03% |
| 2 | LAMB1 |
0.68 |
0.60 (0.082) | LAMA1, SEPT4, LAMA4 |
0.68 |
0.60 (0.082) | LAMA1, SEPT4, LAMA4 | 0% |
| 3 | CDC20 |
0.66 |
0.58 (0.016) | MKNK1, ORC1, PLK1 |
0.74 |
0.53 (0.085) | MKNK1, HDAC1, CDK7, PLK1 |
11.8% |
| 4 | COL6A2 | 0.61 | 0.57 (0.08) | COL6A1 | 0.61 | 0.57 (0.08) | COL6A1 | 0% |
| 5 | COL1A1 |
0.61 |
0.52 (0.08) | COL1A2 |
0.05 |
0.50 (0.09) | 1230.65% | |
| 6 | PKMYT1 |
0.6 |
0.64 (0.04) | E2F2, RB1, PLK1 |
0.73 |
0.54 (0.01) | E2F2, HSP90AB1, TSC2, RB1, SESN3, CHEK1, PLK1 |
18.29% |
| 7 | CTSB | 0.59 | 0.61 (0.08) | RRAGB, FN1, PIK3R3, BCL2A1, CTSS | 0.61 | 0.58 (0.02) | CTSZ, FN1, PIK3R3, BCL2A1, CTSS |
3.8% |
| 8 | ORC6 | 0.58 | 0.62 (0.01) | CDC45, RBL2, FLNB, SIAH1 | 0.58 | 0.62 (0.01) | CDC45, RBL2, FLNB, SIAH1 | 0% |
| 9 | MAP3K12 | 0.52 | 0.58 (0.095) | LAMB2, SYNGAP1 | 0.49 | 0.52 (0.067) | EIF4B, PLA2G4C, MAPK8IP1, CACNA2D4, LAMB2, SYNGAP1 | 6.92% |
| 10 | BUB1B | 0.51 | 0.57 (0.047) | CCNB2 | 0.26 | 0.52 (0.041) | BRCA1, ESPL1 | 99.65% |
3. Discussion
Estimation of a DAG based on observational data is a challenging problem, because the conditional independence relations implied by the distribution satisfying the Markov property may represent several DAGs. We have developed a method to estimate the DAG when there is an additional set of variables, which are subject to interventions and are direct causes of the variables in the DAG. Simulation studies demonstrate the satisfactory performances of our method. We apply our method to construct a regulatory network from high-dimensional gene expression data, where we use genotype data of DNA polymorphisms as surrogate interventional data. Using the regulatory gene modules based on the graph structure using
, we showed the prognostic performance of our method.
Our method is based on the assumptions that gene expressions
follows multivariate Gaussian distribution with no unobserved confounding variables. The faithfulness (general than Gaussian assumption) and no hidden confounders are fundamental assumptions, which are required for the identification of causal DAGs. Both assumptions are infeasible to check, however, we evaluated our approach in the framework of prognostic modeling using real data. For the external variables
, we impose no distributional assumption, but each DNA variant on a gene has no direct effect on other genes for a sufficient condition for admitting
as surrogate variables. This assumption on
implies that regulatory site-sharing between genes is not allowed, although genetic variants may shape multiple phenotypes (pleiotropy) (Tong and others, 2017).
Our method provides a flexible modeling framework by incorporating other types of DNA data, including methylation, copy number, and mutation as surrogate interventions for gene expressions. With the appropriately selected surrogate interventions using the biological hierarchy and filtering steps,
can be extended to other integrative modeling frameworks incorporating genomic, epigenomic, transcriptomic, and proteomic data.
Supplementary Material
Acknowledgments
Conflict of Interest: None declared.
4. Software
Software in the form of R code, together with a sample input data set and complete documentation, are available at https://github.com/MinJinHa/sirDAG.
References
- Akbani R., Ng P. K. S., Werner H. M. J., Shahmoradgoli M., Zhang F., Ju Z., Liu W., Yang J.-Y., Yoshihara K., Li J and others (2014). A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nature Communications 5, 3887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andersson S. A., Madigan D. and Perlman M. D. (1997). A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics 25, 505–541. [Google Scholar]
- Barabási A.-L. and Albert R. (1999). Emergence of scaling in random networks. Science 286, 509–512. [DOI] [PubMed] [Google Scholar]
- Bareinboim E. and Pearl J. (2012). Causal inference by surrogate experiments: z-identifiability. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (pp. 113–120). AUAI Press. [Google Scholar]
- Barsky S. H., Togo S., Garbisa S. and Liotta L. A. (1983). Type IV collagenase immunoreactivity in invasive breast carcinoma. The Lancet 321, 296–297. [DOI] [PubMed] [Google Scholar]
- Burgess S., Small D. S. and Thompson S. G. (2017). A review of instrumental variable estimators for Mendelian randomization. Statistical Methods in Medical Research 26, 2333–2355. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burnier J. V., Wang N., Michel R. P., Hassanain M., Li S., Lu Y., Metrakos P., Antecka E., Burnier M. N., Ponton A. and others (2011). Type IV collagen-initiated signals provide survival and growth cues required for liver metastasis. Oncogene 30, 3766. [DOI] [PubMed] [Google Scholar]
- Cai X., Bazerque J. A. and Giannakis G. B. (2013). Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Computational Biology 9, e1003068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen L. S., Emmert-Streib F., Storey J. D. and others (2007). Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biology 8, R219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chickering D. M. (2002). Learning equivalence classes of Bayesian-network structures. The Journal of Machine Learning Research 2, 445–498. [Google Scholar]
- Chickering D. M. (2003). Optimal structure identification with greedy search. The Journal of Machine Learning Research 3, 507–554. [Google Scholar]
- Colombo D. and Maathuis M. H. (2012). A modification of the PC algorithm yielding order-independent skeletons. CoRR, abs/1211.3295. [Google Scholar]
- Doss S., Schadt E. E., Drake T. A. and Lusis A. J. (2005). Cis-acting expression quantitative trait loci in mice. Genome Research 15, 681–691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B. and Tibshirani R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology 23, 70–86. [DOI] [PubMed] [Google Scholar]
- Erdős P. and Rényi A. (1960). On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5, 17–61. [Google Scholar]
- Fang M., Yuan J., Peng C. and Li Y. (2014). Collagen as a double-edged sword in tumor progression. Tumor Biology 35, 2871–2882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ha M. J., Sun W. and Xie J. (2016). PenPC: a two-step approach to estimate the skeletons of high-dimensional directed acyclic graphs. Biometrics 72, 146–155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hageman R. S., Leduc M. S., Korstanje R., Paigen B. and Churchill G. A. (2011). A Bayesian framework for inference of the genotype–phenotype map for segregating populations. Genetics 187, 1163–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoeting J. A., Madigan D., Raftery A. E. and Volinsky C. T. (1999). Bayesian model averaging: a tutorial. Statistical Science, 1, 382–401. [Google Scholar]
- Kalisch M. and Bühlmann P. (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm. The Journal of Machine Learning Research 8, 613–636. [Google Scholar]
- Kauppila S., Stenbäck F., Risteli J., Jukkola A. and Risteli L. (1998). Aberrant type I and type III collagen gene expression in human breast cancer in vivo. The Journal of Pathology 186, 262–268. [DOI] [PubMed] [Google Scholar]
- Kendziorski C. and Wang P. (2006). A review of statistical methods for expression quantitative trait loci mapping. Mammalian Genome 17, 509–517. [DOI] [PubMed] [Google Scholar]
- Kulp D. C. and Jagalur M. (2006). Causal inference of regulator-target pairs by gene mapping of expression phenotypes. BMC Genomics 7, 125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li R., Tsaih S.-W., Shockley K., Stylianou I. M., Wergedal J., Paigen B. and Churchill G. A. (2006). Structural model analysis of multiple quantitative traits. PLoS Genetics 2, e114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Logsdon B. A. and Mezey J. (2010). Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations. PLoS Computational Biology 6, e1001014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maathuis M. H., Kalisch M. and Bühlmann P. (2009). Estimating high-dimensional intervention effects from observational data. The Annals of Statistics 37, 3133–3164. [Google Scholar]
- Mazumder R., Friedman J. H. and Hastie T. (2011). Sparsenet: coordinate descent with nonconvex penalties. Journal of the American Statistical Association 106, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meek C. (1995). Causal inference and causal explanation with background knowledge. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., pp. 403–410. [Google Scholar]
- Neto E. C., Ferrara C. T., Attie A. D. and Yandell B. S. (2008). Inferring causal phenotype networks from segregating populations. Genetics 179, 1089–1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neto E. C., Keller M. P., Attie A. D. and Yandell B. S. (2010). Causal graphical models in systems genetics: a unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. The Annals of Applied Statistics 4, 320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cancer Genome Atlas Network (2012). Comprehensive molecular portraits of human breast tumors. Nature 490, 61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearl J. (2000). Causality: Models, Reasoning and Inference, Volume 29 Cambridge University Press. [Google Scholar]
- Pearl J. (2009). Causality: Models, Reasoning and Inference. Cambridge University Press. [Google Scholar]
- Ritchie M. E., Phipson B., Wu D., Hu Y., Law C. W., Shi W. and Smyth G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47–e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schadt E. E., Lamb J., Yang X., Zhu J., Edwards S., Guhathakurta D., Sieberts S. K., Monks S., Reitman M., Zhang C. and others (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics 37, 710–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidt M., Niculescu-Mizil A. and Murphy K. (2007). Learning graphical model structure using L1-regularization paths. Proceedings of the 22nd national conference on Artificial intelligence-Volume 2 (pp. 1278–1283). AAAI Press. [Google Scholar]
- Spirtes P., Glymour C. N. and Scheines R. (2000). Causation, Prediction and Search, Volume 81 MIT Press, Cambridge, MA. [Google Scholar]
- Sun W. (2012). A statistical framework for eqtl mapping using RNA-seq data. Biometrics 68, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun W., Ibrahim J. G. and Zou F. (2010). Genomewide multiple-loci mapping in experimental crosses by iterative adaptive penalized regression. Genetics 185, 349–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun W., Yu T. and Li K.-C. (2007). Detection of eQTL modules mediated by activity levels of transcription factors. Bioinformatics 23, 2290–2297. [DOI] [PubMed] [Google Scholar]
- Therneau T. M. and Grambsch P. M. (2013). Modeling survival data: extending the Cox model. Springer Science & Business Media. [Google Scholar]
- Tong P., Monahan J. and Prendergast J. G. D. (2017). Shared regulatory sites are abundant in the human genome and shed light on genome evolution and disease pleiotropy. PLoS Genetics 13, e1006673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsamardinos I., Brown L. E. and Aliferis C. F. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65, 31–78. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.































