Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2018 Dec 31;21(4):659–675. doi: 10.1093/biostatistics/kxy080

Estimation of high-dimensional directed acyclic graphs with surrogate intervention

Min Jin Ha 1,, Wei Sun 2
PMCID: PMC7776804  PMID: 30596892

Summary

Directed acyclic graphs (DAGs) have been used to describe causal relationships between variables. The standard method for determining such relations uses interventional data. For complex systems with high-dimensional data, however, such interventional data are often not available. Therefore, it is desirable to estimate causal structure from observational data without subjecting variables to interventions. Observational data can be used to estimate the skeleton of a DAG and the directions of a limited number of edges. We develop a Bayesian framework to estimate a DAG using surrogate interventional data, where the interventions are applied to a set of external variables, and thus such interventions are considered to be surrogate interventions on the variables of interest. Our work is motivated by expression quantitative trait locus (eQTL) studies, where the variables of interest are the expression of genes, the external variables are DNA variations, and interventions are applied to DNA variants during the process of a randomly selected DNA allele being passed to a child from either parent. Our method, surrogate intervention recovery of a DAG (Inline graphic), first constructs a DAG skeleton using penalized regressions and the subsequent partial correlation tests, and then estimates the posterior probabilities of all the edge directions after incorporating DNA variant data. We demonstrate the utilities of Inline graphic by simulation and an application to an eQTL study for 550 breast cancer patients.

Keywords: Directed acyclic graphs, eQTL, Surrogate intervention

1. Introduction

A directed acyclic graph (DAG) is a useful tool for studying conditional independence relations among a set of random variables, where each vertex represents a random variable and each directed edge represents a conditional dependence relation, with a restriction of no directed cycles. We consider a DAG Inline graphic for a set of random variables Inline graphic, with vertices Inline graphic corresponding to variables Inline graphic, and a collection of directed edges Inline graphic. We assume that Inline graphic follows a multivariate Gaussian distribution Inline graphic. We denote the parent vertices of vertex Inline graphic by Inline graphic and the corresponding random variables by Inline graphic, and then can decompose the likelihood of Inline graphic, based on the Markov property that Inline graphic is independent of all the remaining variables given its parents Inline graphic:

graphic file with name M16.gif (1.1)

Observational data provide the conditionally independent relations among the random variables. Since a set of conditional independence restrictions may be compatible with multiple DAGs, observational data can identify the skeleton of a DAG (i.e., all the edges in a DAG without directions) and the directions of a limited number of edges. The collection of all the DAGs corresponding to the same set of conditional independence restrictions constitute a Markov equivalence class, and thus observational data can only identify a Markov equivalence class, but cannot distinguish the DAGs within the Markov equivalence class. For example, the DAGs Inline graphic, Inline graphic, and Inline graphic all represent the (conditional) independence/dependence that Inline graphic, Inline graphic, Inline graphic, and Inline graphic, where Inline graphic indicates that Inline graphic and Inline graphic are dependent, and Inline graphic indicates that Inline graphic is independent of Inline graphic given Inline graphic. Therefore, these three DAGs constitute a Markov equivalence class and cannot be distinguished by observational data.

In this article, we relate the probabilistic DAG that describes conditional independences among genes to causal DAG representation of genes under assumptions of causal Markov and causal faithful assumptions (Spirtes and others, 2000). We also need to assume that there are no unmeasured confounders for relations among Inline graphic. For example, in the causal DAG, Inline graphic, the structure, Inline graphic can be interpreted, as Inline graphic is the common cause of effects Inline graphic and Inline graphic. Extensive efforts have been devoted to estimate causally sufficient DAG structures using observational data. These efforts have produced, for example, the PC algorithm (named after its authors, Peter and Clark) (Spirtes and others, 2000), the PC-stable algorithm that resolves the order-dependency of the PC algorithm (Colombo and Maathuis, 2012), the greedy equivalence search algorithm (Chickering, 2003), and the penalized regression methods (Schmidt and others, 2007; Ha and others, 2016). In this article, we aim to estimate DAGs in another setting, a “surrogate experiment” (Bareinboim and Pearl, 2012). In a surrogate experiment, the variables of interest can be influenced by another group of variables that may be subject to interventions, and thus the interventions do not directly apply to the variables of interest. We refer to such interventions as surrogate interventions.

Our method is motivated by the problem of studying gene–gene networks using gene expression quantitative trait locus (eQTL) data. In an eQTL study, both gene expression and the genotypes of DNA variants are collected in a common set of samples. To simplify the discussion in this article, we assume the DNA variants are single nucleotide polymorphisms (SNPs), though other types of DNA variants, such as copy number variants or indels (small insertions or deletions), also can be used.

We use a DAG to describe the gene–gene network, where each vertex is the expression of a gene, and each edge indicates a causal relation between two genes. Interventions on SNP genotypes can influence gene expression, and thus are surrogate interventions. Direct intervention on an SNP genotype is often impossible. The randomization of the SNP alleles passing to daughter cells during meiosis, however, is analogous to a randomized experiment (Li and others, 2006). This is the so-called Mendelian randomization (MR), which has been applied to instrumental variable analysis (IVA), where genetic markers (SNPs) are instrumental variables to assess the effect of an exposure (e.g., gene expression) on an outcome (Burgess and others, 2017). The IVA for MR is designed to facilitate a randomized trial to control for confounding variables, except that the random allocation of the exposure is performed by nature, rather than by an investigator. The key assumption is that the SNPs have an effect on the exposure, but not on an outcome, except through the exposure. In this article, we have made assumption that there is no unobserved confoundings, and thus MR mainly helps us to justify the association between a SNP and a gene has a causal interpretation SNP Inline graphic gene, which is consistent with biological principal that DNA can affect gene expression but gene expression can rarely influence DNA sequence.

Previous studies have used eQTL data to dissect the causal relations among three variables, including an eQTL, a gene expression trait, and a third variable, which can be a clinical phenotype (Schadt and others, 2005), another gene expression trait (Kulp and Jagalur, 2006; Chen and others, 2007), or the activity of a transcription factor (Sun and others, 2007). Neto and others (2008) first estimated the skeleton of a DAG, and then used eQTLs to orient the edges in the network. Then they extended this method to jointly estimate causal networks of gene expression traits and the underlying genetic architecture using Bayesian model averaging and a modified Metropolis–Hastings algorithm (Neto and others, 2010). Hageman and others (2011) jointly estimated gene–gene networks and eQTLs using a Bayesian method, while placing constraints on the network through a structural prior. Another approach for graphical model estimation is the use of structural equation models that permit both cyclic and acyclic graphs. Li and others (2006) employed a score-based model selection method. Logsdon and Mezey (2010) estimated network skeletons by applying an adaptive lasso regression for each gene expression trait, and then transformed the skeleton into a DAG or a directed cyclic graph, based on eQTL perturbations. Cai and others (2013) extended the work of Logsdon and Mezey (2010) by providing the initial parameter estimates for the adaptive lasso from penalized regressions using a lasso penalty.

Despite the success of previous work, using high-dimensional gene expression data with a limited sample size to estimate DAGs remains challenging. We propose a high-dimensional DAG estimation method, surrogate intervention recovery of a DAG (Inline graphic) that integrates an upstream external dataset to learn causal structure. The Inline graphic incorporates recent methodological developments on skeleton estimation and eQTL mapping. Specifically, we employ the PenPC method (Ha and others, 2016) to estimate the network skeleton, which uses penalized regression and shows substantial advantages over existing methods, such as the PC-stable algorithm for estimating high-dimensional DAGs (Colombo and Maathuis, 2012). We also employ a recently developed method to identify cis-eQTLs, using allele-specific expression (ASE) of RNA-seq data (Sun, 2012). A local or distant eQTL identified by standard eQTL analysis using total expression, rather than ASE, may impact gene expression indirectly by modifying the expression of other genes or biological processes. In contrast, a cis-eQTL identified from ASE must have a direct influence on gene expression (Doss and others, 2005). Through simulation studies under various settings, we demonstrate the improved performance of Inline graphic from other existing methods, even when the small proportion of genes in the network have eQTLs. We illustrate the prognostic utility of our method using The Cancer Genome Atlas (TCGA) breast cancer data. The computational time of Inline graphic is discussed in Section S1 of the supplementary material available at Biostatistics online.

2. Methods

2.1. Notations

We study the causal relations of Inline graphic variables Inline graphic by a DAG Inline graphic. We assume that Inline graphic, where the distribution, Inline graphic, is Markov and faithful to the causal DAG, Inline graphic. We assume Gaussian distribution on Inline graphic that is Markov and faithful to the true unknown underlying causal DAG (Spirtes and others, 2000). We further assume that we observe all the relevant variables, and that there are no hidden confounding variables for the relations among Inline graphic. Let Inline graphic be an additional set of variables, such that they are direct causes of the variables Inline graphic and subject to interventions. Let Inline graphic be the set of vertices that are associated with at least one variable in Inline graphic. For example, in eQTL studies, Inline graphic is the set of genes with at least one eQTL. For any Inline graphic, Inline graphic denotes the Inline graphic variables that directly influence Inline graphic (Inline graphic for all Inline graphic). Let Inline graphic be the sample size and denote the Inline graphic observed data matrix of Inline graphic by Inline graphic that is centered for each column to have mean 0. Following the notations in Bareinboim and Pearl (2012), we denote the interventional values on variables Inline graphic by Inline graphic. The sufficient conditions for interventions on Inline graphic to be surrogate experiments require that for each Inline graphic, Inline graphic has no direct effect on variables Inline graphic for Inline graphic (Pearl, 2000). In this study, we assume there is no hidden variable, and thus this assumption is testable. We did not pursue such testing because we restrict the eQTLs to be cis-eQTLs, and cis-regulatory elements in human genome are rarely shared across genes.

2.2. Background: inductive causation algorithm

We briefly review the framework of the inductive causation algorithm (ICA). Using observational data, the Markov equivalence class of a DAG is identifiable. Two DAGs are Markov equivalent, if and only if they have the same skeleton and the same v-structure (Andersson and others, 1997). The skeleton of a DAG Inline graphic is an undirected graph that is obtained by removing the directions of all the edges in the DAG. We denote the skeleton of Inline graphic by Inline graphic where Inline graphic. A v-structure is an ordered triplet of vertices Inline graphic, such that Inline graphic contains the directed edges Inline graphic and Inline graphic, and Inline graphic and Inline graphic are not adjacent in Inline graphic. Given a Markov equivalence class, a directed edge is compelled if this edge exists in every DAG in the equivalence class; it is reversible otherwise. The edges participating in v-structures of a DAG are compelled edges. A Markov equivalence class can be represented by a completed partially directed acyclic graph (CPDAG), which is a partially directed acyclic graph (PDAG) consisting of directed edges for all compelled edges in the equivalence class, and undirected edges for all reversible edges in the equivalence class (Chickering, 2002).

The ICA to estimate a CPDAG from observational data consists of three steps: (1) estimation of the skeleton Inline graphic, usually by conditional independence testing, (2) v-structure identification, and (3) completion of the PDAG obtained from (1) and (2) to obtain the CPDAG (Pearl, 2009). Both Steps (2) and (3) are deterministic. More specifically, at Step (2), a triplet Inline graphic is assigned a v-structure Inline graphic, if Inline graphic, Inline graphic, Inline graphic, and Inline graphic is not included in the conditioning set that makes Inline graphic and Inline graphic independent. The completion in Step (3) is to orient the undirected edges as much as possible, with restrictions of no directed cycles and no extra v-structures, which can be done by applying the rules in Meek (1995).

In our proposed method, we directly use the surrogate intervention for the edge orientation of the skeleton estimated by the Step (1). For the estimation of the skeleton, the PC algorithm (Spirtes and others, 2000; Kalisch and Bühlmann, 2007) starts from a completely connected graph where all pairs of genes are connected, and iteratively removes edges by testing zero partial correlations, starting from zero-order (marginal) up to Inline graphic order, with sample size Inline graphic with a significance level Inline graphic. The PC algorithm requires an exhaustive search, especially for edges that are present in the skeleton; the corresponding pair of variables are tested for the conditional independences given all possible subsets of the other variables. Thus, estimating high-dimensional DAGs without prior knowledge on their structures is a challenging problem. The PenPC algorithm (Ha and others, 2016) for DAG skeleton estimation employs a neighborhood selection method to choose a Markov blanket of each vertex, and then a greatly reduced number of partial-correlation tests are performed to remove false positive edges between co-parents of v-structures. The PenPC algorithm improves computational efficiency over the PC algorithm in the following areas: (1) PenPC reduces the search space within the selected Markov blanket for the subsequent partial-correlation tests; (2) the neighborhood selection approach that converts the GGM into a more tractable node-wise multiple regression model enables parallel computing, and the computational time can be substantially reduced, if multicore processors are available; and (3) for the second stage of PenPC, we further reduced the number of partial correlation tests to declare an edge in the skeleton by removing redundant conditioning sets for testing. Moreover, for the neighborhood selection, we employed the log penalty Inline graphic (Mazumder and others, 2011), which significantly improves the accuracy of the Markov blanket search for high-dimensional problems (Sun and others, 2010), and we provided theoretical justifications of the estimation consistency of high-dimensional DAG skeletons (Ha and others, 2016).

2.3. Edge orientation given surrogate experiments

Now we consider orienting the edges in the skeleton Inline graphic. To simplify the notations, we assume each variable Inline graphic is causally affected by a set of external variables Inline graphic, while Inline graphic can be an empty set. Given the external variables, the factorization of the joint probability density becomes

graphic file with name M100.gif (2.1)

where Inline graphic includes Inline graphic fixed intervention values. We will orient edges based on this augmented likelihood function. Each conditional distribution in the product term of equation (2.2) can be expressed by the following linear regressions: given Inline graphic,

graphic file with name M104.gif (2.2)

where Inline graphic is Inline graphic sub-matrix of Inline graphic corresponding to the parent set of Inline graphic from Inline graphic, Inline graphic is the corresponding regression coefficients, Inline graphic is the regression coefficients for the intervention values Inline graphic, and Inline graphic. We assume linearity for relations between genes and eQTLs, because both PC-stable and PenPC algorithms are based on linearity assumption between a node and its parents, and the linear relations between gene expressions and eQTLs are often assumed as reasonable approximations in eQTL studies (Kendziorski and Wang, 2006; Ritchie and others, 2015).

To orient the skeleton Inline graphic, we aim to estimate the posterior probability of Inline graphic versus Inline graphic for an undirected edge Inline graphic in Inline graphic by Bayesian model averaging (Hoeting and others, 1999). Suppose Inline graphic is a DAG that can be constructed from the skeleton Inline graphic by adding directions. Let Inline graphic be the set of parents of Inline graphic in graph Inline graphic, which can be an empty set if Inline graphic has no parent in graph Inline graphic. Given Inline graphic, let Inline graphic be the all parameters including Inline graphic for all Inline graphic in the model (2.2). Denoting Inline graphic as the data including Inline graphic and Inline graphic, and letting Inline graphic be an indicator function,

graphic file with name M134.gif (2.3)

where Inline graphic for Inline graphic are all possible DAGs, given the skeleton Inline graphic. Given Inline graphic, Inline graphic is either 0 or 1, thus Inline graphic. The posterior of a DAG, Inline graphic is computed by

graphic file with name M142.gif

assuming all the Inline graphic’s have the same prior probability. This is reasonable, since they all have the same skeleton. The marginal probability can be expressed as the following integral:

graphic file with name M144.gif (2.4)

Instead of using the marginal probability in (2.4), we replace Inline graphic by Inline graphic, where Inline graphic is the maximum a posteriori (MAP) estimate under uniform priors, Inline graphic (i.e., maximum likelihood estimate). Therefore,

graphic file with name M149.gif (2.5)

where

graphic file with name M150.gif

As an alternative approach, we consider using a conjugate normal-inverse gamma prior for the parameters Inline graphic. The model in (2.2) is re-expressed as: for each node Inline graphic,

graphic file with name M153.gif

where Inline graphic and Inline graphic. Let Inline graphic. We consider normal-gamma priors: for each Inline graphic,

graphic file with name M158.gif (2.6)

where Inline graphic, Inline graphic, and Inline graphic are fixed positive values. The variance structure of Inline graphic is defined according to Zellner’s g-prior. A larger Inline graphic provides more uncertainty that the coefficients are zero, and we employ the unit information prior by setting Inline graphic. The marginal probability in (2.4) under the normal-inverse gamma priors can be derived as

graphic file with name M165.gif

where Inline graphic and Inline graphic.

We call our methods, which use MAP and normal-gamma priors, Inline graphic and Inline graphic, respectively. Using the marginal likelihood, Inline graphic in Inline graphic is the standard Bayesian approach for model selection and penalize the likelihood with respect to the increasing complexity of models. In our framework, however, we assume that the skeleton is sparse (i.e., the number of parents for each node is low), and a limited number of eQTLs with strong gene signals are used for inferring the directions of the skeleton. In simulation studies (Section 2.4), we compared the performance of those two approaches and found no significant difference, even when there were multiple eQTLs.

Finally, we use posterior probabilities of the two possible directions for an edge in Inline graphic. We select the direction Inline graphic, if the ratio Inline graphic for a cutoff Inline graphic, with a constraint of no directed cycle. Let Inline graphic. The values Inline graphic can be considered as estimates of the local false discovery rate (FDR) (Efron and Tibshirani, 2002), if the direction Inline graphic is called a discovery. Given a desired FDR level Inline graphic, the set of directed edges, Inline graphic, for Inline graphic and Inline graphic are called discoveries. The expected FDR for a given Inline graphic is

graphic file with name M184.gif

2.4. Sampling DAGs

The model averaging across all possible Inline graphic DAGs in equation (2.3) is not computationally feasible for a network skeleton with a large number of edges. For example, if there are 10 undirected edges in a skeleton, all possible DAGs have maximum number of Inline graphic directed structures without the acyclic assumption. Therefore, when there are a large number of undirected edges in the skeleton, instead of evaluating all the possible DAGs, we evaluate a large number of high-likelihood DAGs. To construct a high-likelihood DAG from a skeleton, we sequentially orient undirected edges one-by-one, starting from edges including eQTLs by posteriors evaluated under local graphs. Let the current PDAG for Inline graphic be Inline graphic. For an undirected edge Inline graphic in Inline graphic, a directed local graph for Inline graphic is defined by Inline graphic. Denoting Inline graphic as the data including Inline graphic and Inline graphic. The posteriors for edge directions are computed by

graphic file with name M196.gif

where the priors are assumed to be Inline graphic. If two nodes are connected by an undirected edge, and at least one of these two genes has eQTLs, we refer to such an undirected edge as a starting edge. We identify a high-likelihood DAG in two steps:

Step 1. Set Inline graphic as the skeleton. For each of the starting edges, update the direction of Inline graphic in Inline graphic by randomly selecting a direction Inline graphic or Inline graphic with posterior probabilities Inline graphic and Inline graphic, subject to acyclic constraints among the directions of all starting edges. Note that Inline graphic and Inline graphic are null sets, as Inline graphic is the skeleton in the first step.

Step 2. Sample one of the nodes that have eQTLs. Orient the edges for the neighboring nodes (i.e., connected by undirected edges), take for example Inline graphic of the selected node, based on their posterior probabilities of Inline graphic and Inline graphic, subject to acyclic constraint. Then set Inline graphic as the neighboring nodes of Inline graphic, and then do the same procedure. Repeat the same procedure until Inline graphic is an empty set. This procedure is performed for each connected component of the skeleton.

As an illustrative example, a skeleton among 4 nodes Inline graphic with one eQTL, Inline graphic for Inline graphic is displayed in Figure 1. The starting edges from the skeleton are, Inline graphic and Inline graphic. In Step (1), we randomly sample the direction of each starting edge by their posteriors under the local graphs, Inline graphic and Inline graphic. If the orientation for the starting edges is Inline graphic, in Step (2), we orient edges Inline graphic under the local graph Inline graphic, and Inline graphic under the local graph Inline graphic. The orientation Inline graphic is not allowed as the acyclic constraint. For computing Inline graphic, MAP and normal-gamma priors are used for Inline graphic and Inline graphic, respectively. The QTL-directed dependency graph (QDG) (Neto and others, 2008) method uses an logarithm of the odds score to compare the two possible orientations based on the local graphs Inline graphic versus Inline graphic, and iteratively updates the edge orientation until convergence without acyclic restriction. Thus, their algorithm is only dependent on local graphs, not the global graph structure. Although their local-graph based approach is similar to the procedure of sampling DAGs of Inline graphic, the posterior Inline graphic from our framework is computed by averaging over the high-likelihood DAGs, weighted by their posterior model probability Inline graphic (i.e., the global model posterior). The randomness of our DAG generation procedure comes from sampling directions of the starting edges in Step 1, weighted by their local posterior probabilities; the orientation in Step 2 can be different, depending on the node we start from, as we sequentially orient the edges by moving to the neighboring nodes based on local posteriors. In other words, using local posteriors may provide different DAGs, if we use a different node to start. Note that the number of DAGs that we can generate may be limited for the acyclic constraints. From a simulation study (Section S1), however, we found that Inline graphic with different Inline graphic provided stable estimated networks.

Fig. 1.

Fig. 1.

An example of a skeleton for Inline graphic with an eQTL Inline graphic for Inline graphic

2.5. Simulation

We evaluate the performance of the Inline graphic under various simulation settings by generating DAGs with different levels of sparsity, and single or multiple surrogate interventional variables for a gene. We simulate random DAGs following the Erdős and Rényi (1960) (ER) model, with a connection probability Inline graphic, which is the probability that one vertex is connected to any other vertex and control for the sparsity of the graphs, and with a Barabási and Albert (1999) (BA) model with one edge to add in each time step (Ha and others, 2016). Denote the total number of eQTLs across Inline graphic genes by Inline graphic, and the number of genes that have at least one eQTL by Inline graphic. We generate random DAGs, as well as gene expression and eQTL genotype datasets as follows:

  1. Simulate the Inline graphic eQTL genotype data matrix Inline graphic that includes Inline graphic random vectors distributed as Inline graphic, where Inline graphic is a predetermined minor allele frequency, and Inline graphic 0, 1, or 2 with probabilities Inline graphic, Inline graphic, and Inline graphic, respectively.

  2. Construct a Inline graphic matrix Inline graphic and a Inline graphic matrix Inline graphic where all elements are zero. The former is regression coefficients of eQTL effect sizes and the latter is regression coefficients for gene–gene associations.

  3. Generate a random graph from the ER model with the parameter Inline graphic, or the BA model with one edge to add at each time. The graph structure determines the zero structure of Inline graphic, and the nonzero elements are replaced by independent realizations of random variable Inline graphic, where Inline graphic, an exponential distribution with parameter Inline graphic. The sign of each nonzero element is randomly assigned with probability 0.5.

  4. For randomly chosen Inline graphic number of rows of Inline graphic, we assign the Inline graphic eQTLs by randomly selecting nonzero elements and filling in values by realizations of random variable Inline graphic, where Inline graphic with parameter Inline graphic that depends on the number of total eQTLs across all Inline graphic genes. The sign of each nonzero element is randomly assigned with probability 0.5.

  5. Generate Inline graphic gene expression data matrix, Inline graphic by
    graphic file with name M272.gif
    where Inline graphic. We simulate Inline graphic for Inline graphic sequentially, and after simulating each Inline graphic, we scale it so that it has variance 1 before simulating Inline graphic.

We fix the data dimension for the simulation study by Inline graphic and Inline graphic to match the dimension with the TCGA breast cancer data in Section 2.6. We consider eight methods in total for comparison: (1) ICA from PC-stable and PenPC algorithms, (2) the QTL-directed dependency graph (QDG) method (Neto and others, 2008), (3) Inline graphic with skeletons from PC-stable algorithm, (4) Inline graphic and Inline graphic with skeletons from PenPC algorithm, and (5) Inline graphic and Inline graphic with skeletons from PenPC algorithm when randomly selected Inline graphic genes have eQTLs. QDG employs a two-step approach to first estimate the skeleton using a PC-stable algorithm, and then orient the edges using eQTL data. A limitation of the QDG algorithm is that all genes in the network should have at least one eQTL (Inline graphic), while Inline graphic allows Inline graphic. To accommodate the limitation of the QDG algorithm, and compare Inline graphic and QDG, we generate the simulation datasets under Inline graphic. Then in the methods (5), we randomly select Inline graphic genes among Inline graphic to have eQTLs. The PC-stable algorithm includes one tuning parameter, Inline graphic, which is the significance level for the partial correlation tests. The PenPC algorithm has two steps for neighborhood selection, and a modified PC-stable algorithm. For the first step, the tuning parameters are selected by extended Bayesian Information criterion (Ha and others, 2016), and then the extra tuning parameter for the second step is Inline graphic, which is the same tuning parameter in the PC-stable algorithm. For Inline graphic, we have hyper-parameters, Inline graphic and Inline graphic in priors (2.6). Because we standardize the datasets before analyzing, we set Inline graphic for all Inline graphic. For Inline graphic, we decide the directions of edges based on the posterior probabilities by the ratio Inline graphic, where Inline graphic is a predetermined cutoff, and we set Inline graphic.

To evaluate the structural difference between an estimated graph and the true graph, we use the structural Hamming distance (SHD) that directly compares the structure of the learned network, including PDAG and CPDAG, and the original network that also include PDAG and CPDAG (Tsamardinos and others, 2006). SHD is defined by the number of the following operators required to make the two graphs match: add or delete an undirected edge, and add, remove, or reverse the orientation of an edge [Algorithm 4 in (Tsamardinos and others, 2006)]. For a true edge, Inline graphic as an example, Inline graphic, Inline graphic, Inline graphic, and no edge between Inline graphic and Inline graphic provide SHD values of 1 (add), 0, 1 (reverse), 1 (delete), respectively. Figure 2 displays the SHD between the estimated graphs and the true DAG as Inline graphic values, that is the tuning parameter of PC-stable and PenPC algorithms and controls for the sparsity of skeletons. We consider four scenarios in Figure 2: (a) ER model with Inline graphic and Inline graphic (all genes have single eQTL); (b) ER model with Inline graphic and Inline graphic; (c) ER model with Inline graphic and Inline graphic and Inline graphic (all genes have at least one eQTL, i.e., multiple eQTLs); and (d) BA model and Inline graphic. We set Inline graphic and Inline graphic in the data generation Step 4. Our simulation results are based on averages over 100 replications of the data generation Steps 1–5.

Fig. 2.

Fig. 2.

Structural Hamming distance (SHD) according to the significance level (Inline graphic) for various simulation settings for Inline graphic, Inline graphic: (a) ER model with Inline graphic, (b) ER model with Inline graphic, (c) ER model with Inline graphic and multiple eQTLs (Inline graphic eQTLs are randomly assigned to the Inline graphic nodes), and (d) BA model.

Comparison between PC-stable and PenPC. The CPDAG obtained from ICA with PenPC had lower SHD across the Inline graphic values, except that PenPC showed slightly higher SHD in the case of BA model with Inline graphic.

Comparison between QDG and Inline graphic. Since the QDG algorithm uses the PC-stable algorithm for the skeleton estimation, we adjusted Inline graphic to be comparable to QDG by using PC-stable for skeletons. Across all simulation settings, the SHD values are significantly decreased by using our Inline graphic, compared to using QDG. The reduction in SHD is highest when the networks are dense (Figure 2-b).

Comparison between Inline graphic and Inline graphic. We have two different types of methods, Inline graphic and Inline graphic, by the prior specifications. Across all simulation settings, the two methods showed similar performances, although Inline graphic showed slightly higher SHD than Inline graphic when we used only Inline graphic genes with eQTLs in the dense graph case (Figure 2b).

Comparison between QDG and Inline graphic with Inline graphic. When we used the eQTLs for only Inline graphic genes, the estimated graph was still more accurate than that achieved by other methods, including QDG, across all Inline graphic values.

Overall, Inline graphic with the PenPC algorithm showed the best performance across the various simulation settings, and Inline graphic and Inline graphic showed similar accuracy in structure estimation. The main advantage of using Inline graphic over the QDG algorithm is the applicability when only a subset of the genes in a network have eQTLs with accuracy gains. We also evaluated the performance of Inline graphic when eQTLs for a gene are in linkage disequilibrium (LD), and our methods showed better performance than ICA algorithms (Section S1 of the supplementary material available at Biostatistics online).

2.6. Application to TCGA breast cancer data

Our method is applied to a breast cancer study in The Cancer Genome Atlas (TCGA). We use RNA-seq data from tumor tissues obtained from 550 female Caucasian patients, corresponding to the pathways PI3K/AKT, P53, cell cycle, apoptosis, mTOR, MAPK, RAS, and ERBB, which we selected based on literatures, TCGA breast cancer study (Network and others, 2012) and pan-cancer study (Akbani and others, 2014). We selected 764 genes that are included in at least one of the pathways using the KEGG database http://www.genome.jp/kegg. The expression of each gene within each sample is measured by the total read count (TReC), and we use the log-transformed TReC (logTReC) in this study. A Inline graphic residual data matrix is obtained after removing the effects of several important covariates by linear regression: the 75th percentile of logTReC (which captures read depth), plate, institution, age, and six genotype principal components. Our goal is to estimate a network among those 764 genes that are included in the eight important pathways in cancer progression.

Applying the PenPC algorithm, we obtained the GGM (Step 1) with 13 061 undirected edges and the skeleton (Step 2) with 2255 undirected edges, using the p-value cutoff Inline graphic. The degree of the skeleton ranged from 1 to 14 and NCK2 and RXRA genes had the highest degrees. Then we oriented the undirected edges in the skeleton using cis-eQTLs for 76 genes (10%), where we kept the most significant cis-eQTL per gene. The cis-eQTLs are identified by eQTL mapping using both TReC and ASE (Sun, 2012). After applying Inline graphic, we oriented 2092 edges (93%).

We evaluated the prognostic utility of the estimated network in selecting gene signatures. Using gene-specific causal modules, which consist of a gene and its parent set, based on the network from Inline graphic, we identified which genes had a significant effect on patients’ survival times. The unsupervised causal structural learning by Inline graphic provides a PDAG. The main idea is that when we model a gene in relation to patient survival times, we adjust for its parent genes, which are obtained from the PDAG. If the DAG is known, the parent genes for all genes are obviously determined. Using the unique gene modules for each gene, we can find the effect of a gene on the survival time by including the expression of its parent genes in a Cox proportional hazards model. However, the remaining undirected edges in the PDAG generate uncertainty in determining the parent genes, because the undirected edges imply either direction. To obtain reasonable estimates of the causal effect, while accounting for such uncertainty, similar to the method of Maathuis and others (2009), we estimate the lower bound of the effect size of the gene on survival time. This lower bound is calculated by switching the directions for the undirected edges with constraints on no extra v-structure and directed cycle, and obtaining the minimum absolute value of the coefficients. In particular, if all neighboring genes of Inline graphic are directed, then we have an obvious module for gene Inline graphic, and the module will be used to compute the prognostic effect of the gene using the Cox proportional hazards model, with the genes in the module as covariates. For genes with neighboring genes that have undirected edges to the gene, we compute multiple candidate modules for the gene in the equivalent class of the PDAG; this results in multiple prognostic effects for the gene, denoted by Inline graphic for gene Inline graphic and Inline graphic, DAGs that are in the Markov equivalence class of the PDAG. We use Inline graphic, where Inline graphic for the prognostic gene ranking. The detailed algorithm for constructing multiple modules for a gene is described in Maathuis and others (2009).

The resulting PDAG from Inline graphic is displayed in Figure 3, where the nodes are weighted and colored by the effect sizes, Inline graphic and signs, Inline graphic. To evaluate the performance of the edge orientation, we compared Inline graphic, which exploits the external cis-eQTLs, with the ICA using the PenPC algorithm (Section 2.2). The network obtained from Inline graphic produced stronger effect sizes for 38 genes than that obtained from the ICA. Our method identified the collagen gene family, including types 1, 4, and 6, which have been shown to play important roles in breast cancer progression by altering the extracellular matrix architecture and composition (Barsky and others, 1983; Kauppila and others, 1998; Burnier and others, 2011; Fang and others, 2014). Table 1 shows the top 10 genes identified by Inline graphic, the effect sizes, concordance indices (c-index) (Therneau and Grambsch, 2013) to test data using 10-fold cross-validation, and the parent sets from Inline graphic and ICA. The signs of the effect sizes were consistently estimated between Inline graphic and ICA, except for the top gene, COL4A2, which showed 1147.03% of the relative increase in effect size from Inline graphic compared to ICA. Moreover, the c-indices of the Inline graphic and the ICA for the gene COL4A2 were respectively 0.58 and 0.51. The superior prognostic power of our method for the gene COL4A2 comes from the different choices of directions for the neighboring genes THBS2 and COL4A1, which were selected as parents in the network from Inline graphic, while they were selected as children of COL4A2 in the network from CPDAG. Similarly, the COL1A1 gene showed a 1230.65% increase in effect size by selecting the COL1A2 gene as the parent gene using Inline graphic, while the signs were the same. Overall, Inline graphic showed better prognostic power than ICA across the top genes, based on the c-indices evaluated from test data (Table 1).

Fig. 3.

Fig. 3.

TCGA breast cancer-specific networks for PI3K/AKT, P53, cell cycle, apoptosis, mTOR, MAPK, RAS, and ERBB pathways. The node sizes are weighted by the effect sizes (absolute values of the coefficients of the gene) in relation to patients’ survival times; green (red) nodes indicate positive (negative) effects.

Table 1.

Top 10 cancer genes (BRCA) we identified within the data set of 764 genes that are included in the 8 major pathways, ranked by the Inline graphic model: effects on survival time, C-index (SD) to test data using 10-fold cross-validation and the parent genes, from Inline graphic and ICA, and relative increase in effect size resulting from Inline graphic compared to ICA

    Inline graphic ICA  
  Gene EffectInline graphic C-index (SD) Parents EffectInline graphic C-index (SD) Parents Inline graphic
1 COL4A2 Inline graphic1.19 0.58 (0.036) THBS2, COL4A1, FLNA 0.1 0.51 (0.038) FLNA 1147.03%
2 LAMB1 Inline graphic0.68 0.60 (0.082) LAMA1, SEPT4, LAMA4 Inline graphic0.68 0.60 (0.082) LAMA1, SEPT4, LAMA4 0%
3 CDC20 Inline graphic0.66 0.58 (0.016) MKNK1, ORC1, PLK1 Inline graphic0.74 0.53 (0.085) MKNK1, HDAC1, CDK7, PLK1 Inline graphic11.8%
4 COL6A2 0.61 0.57 (0.08) COL6A1 0.61 0.57 (0.08) COL6A1 0%
5 COL1A1 Inline graphic0.61 0.52 (0.08) COL1A2 Inline graphic0.05 0.50 (0.09)   1230.65%
6 PKMYT1 Inline graphic0.6 0.64 (0.04) E2F2, RB1, PLK1 Inline graphic0.73 0.54 (0.01) E2F2, HSP90AB1, TSC2, RB1, SESN3, CHEK1, PLK1 Inline graphic18.29%
7 CTSB 0.59 0.61 (0.08) RRAGB, FN1, PIK3R3, BCL2A1, CTSS 0.61 0.58 (0.02) CTSZ, FN1, PIK3R3, BCL2A1, CTSS Inline graphic3.8%
8 ORC6 0.58 0.62 (0.01) CDC45, RBL2, FLNB, SIAH1 0.58 0.62 (0.01) CDC45, RBL2, FLNB, SIAH1 0%
9 MAP3K12 0.52 0.58 (0.095) LAMB2, SYNGAP1 0.49 0.52 (0.067) EIF4B, PLA2G4C, MAPK8IP1, CACNA2D4, LAMB2, SYNGAP1 6.92%
10 BUB1B 0.51 0.57 (0.047) CCNB2 0.26 0.52 (0.041) BRCA1, ESPL1 99.65%

3. Discussion

Estimation of a DAG based on observational data is a challenging problem, because the conditional independence relations implied by the distribution satisfying the Markov property may represent several DAGs. We have developed a method to estimate the DAG when there is an additional set of variables, which are subject to interventions and are direct causes of the variables in the DAG. Simulation studies demonstrate the satisfactory performances of our method. We apply our method to construct a regulatory network from high-dimensional gene expression data, where we use genotype data of DNA polymorphisms as surrogate interventional data. Using the regulatory gene modules based on the graph structure using Inline graphic, we showed the prognostic performance of our method.

Our method is based on the assumptions that gene expressions Inline graphic follows multivariate Gaussian distribution with no unobserved confounding variables. The faithfulness (general than Gaussian assumption) and no hidden confounders are fundamental assumptions, which are required for the identification of causal DAGs. Both assumptions are infeasible to check, however, we evaluated our approach in the framework of prognostic modeling using real data. For the external variables Inline graphic, we impose no distributional assumption, but each DNA variant on a gene has no direct effect on other genes for a sufficient condition for admitting Inline graphic as surrogate variables. This assumption on Inline graphic implies that regulatory site-sharing between genes is not allowed, although genetic variants may shape multiple phenotypes (pleiotropy) (Tong and others, 2017).

Our method provides a flexible modeling framework by incorporating other types of DNA data, including methylation, copy number, and mutation as surrogate interventions for gene expressions. With the appropriately selected surrogate interventions using the biological hierarchy and filtering steps, Inline graphic can be extended to other integrative modeling frameworks incorporating genomic, epigenomic, transcriptomic, and proteomic data.

Supplementary Material

kxy080_Supplementary_Materials

Acknowledgments

Conflict of Interest: None declared.

4. Software

Software in the form of R code, together with a sample input data set and complete documentation, are available at https://github.com/MinJinHa/sirDAG.

References

  1. Akbani R., Ng P. K. S., Werner H. M. J., Shahmoradgoli M., Zhang F., Ju Z., Liu W., Yang J.-Y., Yoshihara K., Li J  and others (2014). A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nature Communications 5, 3887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Andersson S. A., Madigan D. and Perlman M. D. (1997). A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics 25, 505–541. [Google Scholar]
  3. Barabási A.-L. and Albert R. (1999). Emergence of scaling in random networks. Science 286, 509–512. [DOI] [PubMed] [Google Scholar]
  4. Bareinboim E. and Pearl J. (2012). Causal inference by surrogate experiments: z-identifiability. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (pp. 113–120). AUAI Press. [Google Scholar]
  5. Barsky S. H., Togo S., Garbisa S. and Liotta L. A. (1983). Type IV collagenase immunoreactivity in invasive breast carcinoma. The Lancet 321, 296–297. [DOI] [PubMed] [Google Scholar]
  6. Burgess S., Small D. S. and Thompson S. G. (2017). A review of instrumental variable estimators for Mendelian randomization. Statistical Methods in Medical Research 26, 2333–2355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Burnier J. V., Wang N., Michel R. P., Hassanain M., Li S., Lu Y., Metrakos P., Antecka E., Burnier M. N., Ponton A.  and others (2011). Type IV collagen-initiated signals provide survival and growth cues required for liver metastasis. Oncogene 30, 3766. [DOI] [PubMed] [Google Scholar]
  8. Cai X., Bazerque J. A. and Giannakis G. B. (2013). Inference of gene regulatory networks with sparse structural equation models exploiting genetic perturbations. PLoS Computational Biology 9, e1003068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen L. S., Emmert-Streib F., Storey J. D.  and others (2007). Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biology 8, R219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chickering D. M. (2002). Learning equivalence classes of Bayesian-network structures. The Journal of Machine Learning Research 2, 445–498. [Google Scholar]
  11. Chickering D. M. (2003). Optimal structure identification with greedy search. The Journal of Machine Learning Research 3, 507–554. [Google Scholar]
  12. Colombo D. and Maathuis M. H. (2012). A modification of the PC algorithm yielding order-independent skeletons. CoRR, abs/1211.3295. [Google Scholar]
  13. Doss S., Schadt E. E., Drake T. A. and Lusis A. J. (2005). Cis-acting expression quantitative trait loci in mice. Genome Research 15, 681–691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Efron B. and Tibshirani R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology 23, 70–86. [DOI] [PubMed] [Google Scholar]
  15. Erdős P. and Rényi A. (1960). On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5, 17–61. [Google Scholar]
  16. Fang M., Yuan J., Peng C. and Li Y. (2014). Collagen as a double-edged sword in tumor progression. Tumor Biology 35, 2871–2882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Ha M. J., Sun W. and Xie J. (2016). PenPC: a two-step approach to estimate the skeletons of high-dimensional directed acyclic graphs. Biometrics 72, 146–155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hageman R. S., Leduc M. S., Korstanje R., Paigen B. and Churchill G. A. (2011). A Bayesian framework for inference of the genotype–phenotype map for segregating populations. Genetics 187, 1163–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Hoeting J. A., Madigan D., Raftery A. E. and Volinsky C. T. (1999). Bayesian model averaging: a tutorial. Statistical Science, 1, 382–401. [Google Scholar]
  20. Kalisch M. and Bühlmann P. (2007). Estimating high-dimensional directed acyclic graphs with the PC-algorithm. The Journal of Machine Learning Research 8, 613–636. [Google Scholar]
  21. Kauppila S., Stenbäck F., Risteli J., Jukkola A. and Risteli L. (1998). Aberrant type I and type III collagen gene expression in human breast cancer in vivo. The Journal of Pathology 186, 262–268. [DOI] [PubMed] [Google Scholar]
  22. Kendziorski C. and Wang P. (2006). A review of statistical methods for expression quantitative trait loci mapping. Mammalian Genome 17, 509–517. [DOI] [PubMed] [Google Scholar]
  23. Kulp D. C. and Jagalur M. (2006). Causal inference of regulator-target pairs by gene mapping of expression phenotypes. BMC Genomics 7, 125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li R., Tsaih S.-W., Shockley K., Stylianou I. M., Wergedal J., Paigen B. and Churchill G. A. (2006). Structural model analysis of multiple quantitative traits. PLoS Genetics 2, e114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Logsdon B. A. and Mezey J. (2010). Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations. PLoS Computational Biology 6, e1001014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Maathuis M. H., Kalisch M. and Bühlmann P. (2009). Estimating high-dimensional intervention effects from observational data. The Annals of Statistics 37, 3133–3164. [Google Scholar]
  27. Mazumder R., Friedman J. H. and Hastie T. (2011). Sparsenet: coordinate descent with nonconvex penalties. Journal of the American Statistical Association 106, 1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Meek C. (1995). Causal inference and causal explanation with background knowledge. In: Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., pp. 403–410. [Google Scholar]
  29. Neto E. C., Ferrara C. T., Attie A. D. and Yandell B. S. (2008). Inferring causal phenotype networks from segregating populations. Genetics 179, 1089–1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Neto E. C., Keller M. P., Attie A. D. and Yandell B. S. (2010). Causal graphical models in systems genetics: a unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. The Annals of Applied Statistics 4, 320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Cancer Genome Atlas Network (2012). Comprehensive molecular portraits of human breast tumors. Nature 490, 61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Pearl J. (2000). Causality: Models, Reasoning and Inference, Volume 29 Cambridge University Press. [Google Scholar]
  33. Pearl J. (2009). Causality: Models, Reasoning and Inference. Cambridge University Press. [Google Scholar]
  34. Ritchie M. E., Phipson B., Wu D., Hu Y., Law C. W., Shi W. and Smyth G. K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43, e47–e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Schadt E. E., Lamb J., Yang X., Zhu J., Edwards S., Guhathakurta D., Sieberts S. K., Monks S., Reitman M., Zhang C.  and others (2005). An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genetics 37, 710–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Schmidt M., Niculescu-Mizil A. and Murphy K. (2007). Learning graphical model structure using L1-regularization paths. Proceedings of the 22nd national conference on Artificial intelligence-Volume 2 (pp. 1278–1283). AAAI Press. [Google Scholar]
  37. Spirtes P., Glymour C. N. and Scheines R. (2000). Causation, Prediction and Search, Volume 81 MIT Press, Cambridge, MA. [Google Scholar]
  38. Sun W. (2012). A statistical framework for eqtl mapping using RNA-seq data. Biometrics 68, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Sun W., Ibrahim J. G. and Zou F. (2010). Genomewide multiple-loci mapping in experimental crosses by iterative adaptive penalized regression. Genetics 185, 349–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Sun W., Yu T. and Li K.-C. (2007). Detection of eQTL modules mediated by activity levels of transcription factors. Bioinformatics 23, 2290–2297. [DOI] [PubMed] [Google Scholar]
  41. Therneau T. M. and Grambsch P. M. (2013). Modeling survival data: extending the Cox model. Springer Science & Business Media. [Google Scholar]
  42. Tong P., Monahan J. and Prendergast J. G. D. (2017). Shared regulatory sites are abundant in the human genome and shed light on genome evolution and disease pleiotropy. PLoS Genetics 13, e1006673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Tsamardinos I., Brown L. E. and Aliferis C. F. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65, 31–78. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxy080_Supplementary_Materials

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES