Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2013 May 23;9(5):e1003068. doi: 10.1371/journal.pcbi.1003068

Inference of Gene Regulatory Networks with Sparse Structural Equation Models Exploiting Genetic Perturbations

Xiaodong Cai 1,*, Juan Andrés Bazerque 2, Georgios B Giannakis 2
Editor: Satoru Miyano3
PMCID: PMC3662697  PMID: 23717196

Abstract

Integrating genetic perturbations with gene expression data not only improves accuracy of regulatory network topology inference, but also enables learning of causal regulatory relations between genes. Although a number of methods have been developed to integrate both types of data, the desiderata of efficient and powerful algorithms still remains. In this paper, sparse structural equation models (SEMs) are employed to integrate both gene expression data and cis-expression quantitative trait loci (cis-eQTL), for modeling gene regulatory networks in accordance with biological evidence about genes regulating or being regulated by a small number of genes. A systematic inference method named sparsity-aware maximum likelihood (SML) is developed for SEM estimation. Using simulated directed acyclic or cyclic networks, the SML performance is compared with that of two state-of-the-art algorithms: the adaptive Lasso (AL) based scheme, and the QTL-directed dependency graph (QDG) method. Computer simulations demonstrate that the novel SML algorithm offers significantly better performance than the AL-based and QDG algorithms across all sample sizes from 100 to 1,000, in terms of detection power and false discovery rate, in all the cases tested that include acyclic or cyclic networks of 10, 30 and 300 genes. The SML method is further applied to infer a network of 39 human genes that are related to the immune function and are chosen to have a reliable eQTL per gene. The resulting network consists of 9 genes and 13 edges. Most of the edges represent interactions reasonably expected from experimental evidence, while the remaining may just indicate the emergence of new interactions. The sparse SEM and efficient SML algorithm provide an effective means of exploiting both gene expression and perturbation data to infer gene regulatory networks. An open-source computer program implementing the SML algorithm is freely available upon request.

Author Summary

Deciphering the structure of gene regulatory networks is crucial for understanding gene functions and cellular dynamics, as well as system-level modeling of individual genes and cellular functions. Computational methods exploiting gene expression and other types of data generated from high-throughput experiments provide an efficient and low-cost means of inferring gene networks. Sparse structural equation models are employed to: i) integrate both gene expression and genetic perturbation data for inference of gene networks; and, ii) develop an efficient sparsity-aware inference algorithm. Computer simulations corroborate that the novel algorithm markedly outperforms state-of-the-art alternatives. The algorithm is further applied to infer a real human gene network unveiling possible interactions between several genes. Since gene networks can be perturbed not only by genetic variations but also by other means such as gene copy number changes, gene knockdown or controlled gene over-expression, this paper's method can be applied to a number of practical scenarios.

Introduction

Genes in living organisms do not function in isolation, but may interact with each other and act together forming intricate networks [1]. Deciphering the structure of gene regulatory networks is crucial for understanding gene functions and cellular dynamics, as well as for system-level modeling of individual genes and cellular functions. Although physical interactions among individual genes can be experimentally deduced (e.g., by identifying transcription factors and their regulatory target genes or discovering protein-protein interactions), such experimental approach is time-consuming and labor intensive. Given the explosive number of combinations of genes involved in any possible gene interaction, such an approach may not be practically feasible to reconstruct or “reverse engineer” gene networks. On the other hand, technological advances allow for high-throughput measurement of gene expression levels to be carried out efficiently and in a cost-effective manner. These genome-wide expression data reflect the state of the underlying network in a specific condition and provide valuable information that can be fruitfully exploited to infer the network structure.

Indeed, a number of computational methods have been developed to infer gene networks from gene expression data. One class leverages a similarity measure, such as the correlation or mutual information present in pairs of genes, to construct a so-termed co-expression or relevance network [2], [3]. Another approach relies on Gaussian graphical models with edges being present (absent) if the corresponding gene pairs are conditionally dependent (respectively independent), given expression levels of all other genes [4], [5]. While the approach based on Gaussian graphical models entails undirected graphs, directed acyclic graphs (DAGs) or Bayesian networks have also been employed to infer the dependency structure among genes [6], [7]. The fourth approach employs linear regression models and associated inference methods to find the dependency among genes and to infer gene networks [8][11]. Finally, while these approaches use gene expression data in the steady-state, several methods exploiting time-series expression data have also been reported; see e.g., [12], [13] and references therein.

Recently, gene expression data from gene-knockout experiments have been combined with time series comprising gene expression data with perturbations to considerably improve the accuracy of network inference [14]. When a gene is knocked out or silenced, expression levels of other genes are perturbed. Different from using gene expression levels of the original network alone, comparing gene expression levels in the perturbed network with those in the original network reveals extra information about the underlying network structure. Gene perturbations can be performed with other experimental approaches such as controlled gene over-expression and treatment of cells with certain chemical compounds [8], [9]. However, these gene perturbation experiments may not be feasible for all genes or organisms. To overcome this hurdle, one can exploit naturally occurring genetic variations that can be viewed as perturbations to gene networks [15]. More importantly, such genetic variations enable inference of the causal relationship between different genes or between genes and certain phenotypes.

Several approaches are available to capitalize on both genetic variations and gene expression data for inference of gene networks. The first approach models a gene network as a Bayesian network, and then infers the network by incorporating prior information about the network obtained from expression quantitative trait loci (eQTLs) [16][18]. In the second approach, a likelihood test is employed to search for a casual model that “best” explains the observed gene expression and eQTL data [19][23]. The third approach relies on the structural equation model (SEM) to infer gene [24][27] or phenotype networks [28][34]. While these approaches focus on inference of gene networks incorporating information from eQTL, another approach employs both phenotype and QTL genotype data to jointly decipher the phenotype network and identify eQTLs that are causal for each phenotype [35]. Logsdon and Mezey [26] proposed an adaptive Lasso (AL) [36] based algorithm to infer gene networks modeled with an SEM. They compared the performance of a number of methods using simulated directed acyclic or cyclic networks. Their simulations showed that the AL-based algorithm outperformed all other methods tested. Despite its superiority over other methods, the AL-based algorithm does not fully exploit the structure of the SEM. Therefore, it is expected that a more systematic inference algorithm may significantly improve the performance of the SEM-based approach.

Motivated by the fact that gene networks or more general biochemical networks are sparse [8], [37][39], a sparse SEM is advocated in this paper to infer gene networks from both gene expression and eQTL data. Incorporating network sparsity constraints, a sparsity-aware maximum likelihood (SML) algorithm is developed for network topology inference. The core technique used is to maximize the likelihood function regularized by the Inline graphic-norm of the parameter vector determining the network structure. The Inline graphic-norm controls complexity of the SEM, and thus yields a sparse network. The key innovative element of the SML algorithm is a block coordinate ascent method derived to maximize the Inline graphic-regularized likelihood function, which makes the SML algorithm computationally efficient. The simulations provided demonstrate that the novel SML algorithm offers significantly better performance than the two state-of-the-art algorithms: the AL [26], and the QDG algorithm [21]. The SML algorithm is further applied to infer a human network of 39 human genes related to the immune function.

Results

Sparse SEM model for gene regulatory networks

Consider expression levels of Inline graphic genes from Inline graphic individuals measured using e.g., microarray or RNA-seq. Let Inline graphic denote the Inline graphic vector collecting the expression levels of these Inline graphic genes of individual Inline graphic. Suppose that a set of perturbations to these genes has been also observed. These perturbations can be due to naturally occurring genetic variations near or within the genes, gene copy number changes, gene knockdown by RNAi or controlled gene over-expression. In this paper, focus is placed on genetic variations observed at eQTLs, although the network model and the inference method described in the next section are also applicable to cases where other perturbations are available. As in [26], it is assumed that each gene has at least one cis-eQTL so that the structure of the underlying gene network is uniquely identifiable. Let Inline graphic denote the genotype of Inline graphic eQTLs of individual Inline graphic. The goal is to infer the network structure of the Inline graphic genes from the available gene expression measurements Inline graphic, Inline graphic, and eQTL observations Inline graphic, Inline graphic.

As in [25], [26], the gene network is postulated to obey the SEM

graphic file with name pcbi.1003068.e018.jpg (1)

where Inline graphic matrix Inline graphic contains unknown parameters defining the network structure; Inline graphic matrix Inline graphic captures the effect of each eQTL; Inline graphic vector Inline graphic accounts for possible model bias; and Inline graphic vector Inline graphic captures the residual error, which is modeled as a zero-mean Gaussian vector with covariance Inline graphic, where Inline graphic denotes the Inline graphic identity matrix. It is assumed that no self-loops are present per gene, which implies that the diagonal entries of Inline graphic are zero. As mentioned in [26], lack of self-loops and a diagonal covariance matrix of Inline graphic are commonly assumed in almost all graph-based network inference methods. It is further assumed that the loci of Inline graphic eQTLs have been determined using an existing eQTL method, but the effective size of each eQTL is unknown. Therefore, Inline graphic has Inline graphic unknown entries whose locations are known and Inline graphic remaining zero entries (for instance Inline graphic is a diagonal matrix when Inline graphic).

The network inference task is to estimate Inline graphic unknown entries of Inline graphic, and as a byproduct, the Inline graphic unknown entries of Inline graphic. Without any knowledge about the network, no restriction is imposed on the structure specified by Inline graphic. Therefore, the network is considered as a general directed graph that can possibly be a directed cyclic graph (DCG) or a DAG. Network inference is challenging since the number of unknowns to be estimated is very large for a moderately large Inline graphic. Note that under the assumption that each gene has at least one cis-eQTL, the “Recovery” Theorem in [26] guarantees that the network is identifiable for both DCGs and DAGs.

As discussed in [8], [37][39], gene regulatory networks or more general biochemical networks are sparse meaning that a gene directly regulates or is regulated by a small number of genes relative to the total number of genes in the network. Taking into account sparsity, only a relatively small number of the entries of Inline graphic are nonzero. These nonzero entries determine the network structure and the regulatory effect of one gene on other genes. The SEM in (1) under the aforementioned sparsity assumption will be henceforth referred to as the sparse SEM. Exploiting the sparsity inherent to the network, an efficient and powerful algorithm for network inference will be developed in the ensuing section.

Sparsity-aware inference method

Upon defining Inline graphic, Inline graphic, and Inline graphic, the SEM in (1) can be compactly written as Inline graphic, where 1 is the Inline graphic vector of all-ones. Given Inline graphic and Inline graphic, the log-likelihood function can be written as

graphic file with name pcbi.1003068.e052.jpg (2)

where Inline graphic denotes matrix determinant, and Inline graphic denotes the Frobenius norm.

As mentioned earlier, Inline graphic is a sparse matrix having most entries equal to zero. In order to obtain a sparse estimate of Inline graphic, the natural approach is to maximize the log likelihood regularized by the weighed Inline graphic term Inline graphic, where Inline graphic denotes the Inline graphicth entry of Inline graphic. In a linear regression model, it is well known that the Inline graphic-regularized least-squares estimation also known as Lasso [40] can yield a sparse estimate of the regression coefficient vector. Similarly, the Inline graphic-regularized maximum likelihood (ML) approach used here is expected to shrink most of the entries of Inline graphic toward zero, thereby yielding a sparse matrix. It is easy to show that maximizing Inline graphic with respect to (w.r.t.) Inline graphic yields Inline graphic, where Inline graphic and Inline graphic. Upon defining Inline graphic, Inline graphic, Inline graphic, Inline graphic, and substituting Inline graphic for Inline graphic in (2), the proposed Inline graphic-penalized ML estimation approach yields

graphic file with name pcbi.1003068.e077.jpg (3)

where Inline graphic denotes the set of row and column indices of the entries of Inline graphic known to be zero. As assumed earlier, each phenotype has at least one cis-eQTL that has been identified, which implies that the locations of nonzero entries of Inline graphic or equivalently the set Inline graphic is known. However, our sparse SEM and inference method are also applicable to more general cases where some or all phenotypes have cis-eQTLs that have not been identified. In these cases, the locations of nonzero entries of Inline graphic corresponding to the unidentified cis-eQTLs are unknown. We can form a weighted Inline graphic-norm of the entries of Inline graphic excluding those corresponding to the identified cis-eQTL and then add a penalty term involving this Inline graphic-norm to the objective function in (3). This new optimization problem can be solved efficiently using a method modified from the one solving (3), as it is described in the supporting text S1.

Weights Inline graphic in the penalty term are introduced to improve estimation accuracy in line with the AL [36]. They are selected as Inline graphic, where Inline graphic is found using a preliminary estimate of Inline graphic obtained via ridge regression as

graphic file with name pcbi.1003068.e090.jpg (4)

The sparsity-controlling parameters Inline graphic in (3) and Inline graphic in (4) are selected via cross validation (CV), while Inline graphic is estimated as the sample variance of the error using Inline graphic and Inline graphic. In adaptive Lasso based linear regression [36], Zou suggested using the ordinary least squares (OLS) estimate to determine the weights; if the OLS estimate does not exist due to, e.g., collinearity, Zou suggested the estimate obtained from ridge regression, although it remains to show if the ridge regression estimate is consistent in this case and if the resulting adaptive Lasso yields the desired oracle properties. If OLS is used for estimating Inline graphic and Inline graphic in the SEM, the solution usually does not exist since the number of unknowns is typically larger than the number of samples. However, even in this case the solution can always be obtained from ridge regression as in (4). Moreover, every entry of the solution is typically nonzero, which yields a finite weight for every variable, and thus every variable will be included in the following Inline graphic-penalized ML procedure. An alternative approach is to replace the weighed Inline graphic-norm in (3) with an unweighted Inline graphic-norm to obtain a preliminary estimate of Inline graphic and then calculate the weights from this preliminary estimate, as in [26]. However, the unweighted Inline graphic-penalized ML procedure may shrink many variables to zero and exclude them from the weighted Inline graphic-penalized ML estimator, possibly yielding a biased estimate. For this reason, the inference method in this paper uses ridge regression to determine Inline graphic, with the additional advantage of (4) admitting a closed-form solution.

A block diagram of the novel inference algorithm, abbreviated as the sparsity-aware maximum likelihood (SML) algorithm, is depicted in Figure 1. The first and third blocks in Figure 1 perform cross-validation to select optimal parameters Inline graphic and Inline graphic to be used in (3) and (4), respectively (see the description of the cross-validation procedure in the supporting text S1.) The third block produces weights Inline graphic and error-variance estimate Inline graphic after solving (4). Finally, the fourth block takes data Inline graphic and Inline graphic together with Inline graphic, Inline graphic and Inline graphic and solves (3) to yield Inline graphic, representing the SML estimator for Inline graphic in (1) and revealing the genetic-interaction network. As it will be described in the Methods section, (4) is separable across rows of Inline graphic and Inline graphic, and each row of Inline graphic and Inline graphic becomes available in closed form [cf. (8)–(9)]. The Inline graphic-regularized ML problem (3) is solved efficiently using a novel block coordinate ascent iterative scheme given by (11)–(16) in the Methods section. Precise description of the overall SML algorithm is also presented in the Methods section as Algorithm 1, which was used to yield an executable computer program.

Figure 1. Block diagram of the sparsity-aware maximum likelihood (SML) algorithm.

Figure 1

The first and third blocks perform cross-validation to select optimal parameters Inline graphic and Inline graphic to be used in (3) and (4), respectively. The second block produces weights Inline graphic and error-variance estimate Inline graphic after solving (4). Finally, the fourth block takes data Inline graphic and Inline graphic together with Inline graphic, Inline graphic and Inline graphic and solves (3) to yield Inline graphic, which represents the SML estimator for Inline graphic in (1) revealing the genetic-interaction network. A more detailed description of the SML algorithm is given in Algorithm 1 in the Methods section.

Table Algorithm 1. Algorithm 1. SML.

1: Select the optimal value of Inline graphic in (4), Inline graphic, via cross validation
2: Solve (4) with Inline graphic for Inline graphic and Inline graphic
3: Estimate Inline graphic as the sample variance of E = Inline graphic
4: Compute weights Inline graphic
5: Compute Inline graphic via (S2) Inline graphic
6: Compute Inline graphic via (S9)
7: Select the optimal value of Inline graphic, Inline graphic, via cross validation
8: for Inline graphic do
9: Compute Inline graphic via (S4)
10:  Initialize Inline graphic, Inline graphic, Inline graphic and Inline graphic
11:  while errInline graphic do
12:   for Inline graphic,Inline graphic do
13:    Obtain Inline graphic by computing its row via (7) with Inline graphic
14:   end for
15:   for Inline graphic,Inline graphic do
16:    for Inline graphic,Inline graphic do
17:     if Inline graphic then
18:       Compute cofactor of Inline graphic, Inline graphic
19:       if Inline graphic then
20:        Compute Inline graphic via (12)
21:       else
22:        Compute Inline graphic via (16)
23:      end if
24:     end if
25:    end for
26:   end for
27:   Compute Inline graphic
28:   Set Inline graphic and Inline graphic
29:  end while
30:  Compute Inline graphic via (S1) Inline graphic
31: end for
32: Output Inline graphic and Inline graphic.

Simulation studies and performance comparison of inference algorithms

In their simulation studies, Logsdon and Mezey [26] compared the performance of their AL-based algorithm with that of several other algorithms including the PC-algorithm [41], [42], the QDG algorithm [21], the QTLnet algorithm [35], and the NEO algorithm [22]. In two out of four simulation setups, the AL outperformed all other algorithms; and in the other two simulation setups, the AL and QDG algorithms exhibited comparable performance, but consistently outperformed the other two algorithms. Logsdon and Mezey [26] also considered other existing algorithms [25], [43], but these were deemed either computationally too demanding [43] or prohibitively complex [25]. For these reasons, the AL and QDG algorithms are regarded as state-of-the-art in the field. Their performance was compared against this paper's SML algorithm.

Following the setup of Logsdon and Mezey [26], two types of acyclic gene networks were simulated first: one with 10 genes and another with 30 genes. Specifically, a random DAG of 10 or 30 nodes with an expected Inline graphic edges per node was generated by creating directed edges between two randomly picked nodes. Care was taken to avoid any cycle in the simulated graph. If an edge from node Inline graphic to node Inline graphic was emerging, Inline graphic was generated from a random variable uniformly distributed over the interval Inline graphic or Inline graphic; otherwise, Inline graphic. The genotype per eQTL was simulated from an F2 cross. Values 1 and 3 were assigned to two homozygous genotypes, respectively, and 2 to the heterozygous genotype. Hence, Inline graphic was generated as a ternary random variable taking values Inline graphic with corresponding probabilities Inline graphic. Matrix Inline graphic was the Inline graphic identity matrix, Inline graphic was sampled from a Gaussian distribution with zero mean and variance Inline graphic, and Inline graphic was set to zero. Finally, Inline graphic was calculated from Inline graphic

For each type of gene network, 100 realizations or replicates of the network were generated, and then the SML, the AL and the QDG algorithms were run to infer the network topology. When running the SML algorithm, 10-fold CV was employed to determine the optimal values of parameters Inline graphic and Inline graphic and then use these values to infer the network. An edge from gene Inline graphic to Inline graphic was deemed present if Inline graphic. The AL algorithm also automatically ran using CV to determine the values of its parameters. For 100 replicates of the network, Inline graphic counted the total number of edges, Inline graphic denoted the total number of edges detected by the inference algorithm. Among Inline graphic detected edges, Inline graphic stands for the number of true edges presented in the simulated networks, and Inline graphic for the number of false edges. The power of detection (PD) was then found as Inline graphic, and the false discovery rate (FDR) as Inline graphic. The PD and the FDR of the SML, AL, and QDG algorithms for different sample sizes are depicted in Figure 2. It is seen from Figures 2 (a) and (c) that the PD of the SML algorithm exceeds 0.9 for both networks across all sample sizes, whereas the PD of the AL algorithm is about 0.65 for Inline graphic and 0.35 for Inline graphic. The PD of the QDG algorithm is even lower ranging from 0.22 to 0.33. As shown in Figures 2 (b) and (d), the FDR of the SML algorithm is on the order of Inline graphic for most sample sizes, and is much lower than that of the AL and QDG algorithms, which is about 0.3 for Inline graphic and over the range from 0.31 to 0.6 for Inline graphic.

Figure 2. Performance of SML, AL and QDG algorithms for directed acyclic networks of Inline graphic [(a) and (b)] or 30 [(c) and (d)] genes.

Figure 2

Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with different sample sizes (Inline graphic to 1,000).

Two types of cyclic networks were subsequently simulated: one with 10 genes and the other with 30 genes. The average number of edges per gene is again equal to 3. The same procedures used in simulating acyclic networks described earlier were employed, except that DCGs instead of DAGs were simulated. Again, 100 replicates for each type of the networks were randomly generated. The PD and the FDR of three algorithms are depicted in Figure 3. As shown in Figure 3 (a) and (c), the PD of the SML algorithm is between 0.83 and 0.9, whereas the PD of the AL algorithm is about 0.52 for Inline graphic and 0.29 for Inline graphic, and the PD of the QDG algorithm is between 0.16 and 0.28. As shown in Figures 3 (b) and (d), the FDR of the SML algorithm is Inline graphic, which is much smaller than that of the AL and QDG algorithms over the range from 0.33 to 0.68. For the convenience of comparison, the results in Figures 2 and 3 at sample size 500 are summarized in Table 1.

Figure 3. Performance of SML, AL and QDG algorithms for directed cyclic networks of Inline graphic [(a) and (b)] or 30 [(c) and (d)] genes.

Figure 3

Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with different sample sizes (Inline graphic to 1,000).

Table 1. Performance of SML, AL and QDG algorithms.

Network Inline graphic PD FDR
SML AL QDG SML AL QDG
DAG 10 0.9887 0.6564 0.3014 0.0007 0.2586 0.2991
30 0.9891 0.3544 0.3232 0.0010 0.4548 0.3403
DCG 10 0.8872 0.5330 0.2677 0.0067 0.3268 0.3783
30 0.8931 0.2941 0.2254 0.0020 0.6086 0.5047

Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with a sample size of Inline graphic.

As confirmed by Figures 2 and 3, the SML algorithm offers much better performance in terms of PD and FDR than the AL and QDG algorithms. However, these results were obtained for gene networks of small size. To test performance of the SML algorithm for networks of relatively large size, an acyclic network of 300 genes was simulated with an expected Inline graphic edge per node, and 10 replicates of the network were randomly generated. PD and FDR of the SML and AL algorithms obtained from these replicates are depicted in Figure 4. The PD of SML exceeds Inline graphic across all sample sizes from 100 to 1,000, whereas that of the AL algorithm is about 0.04 for sample sizes from 100 to 500, and gradually increases to 0.42 at the sample size of 1,000. The FDR of SML stays below Inline graphic for sample sizes from 400 to 1,000, whereas the FDR of the AL algorithm is on the order of Inline graphic for the same sample size. When the sample size is relatively small (in the range from 100 to 300), the FDR of SML is higher than that of the AL algorithm, but it is still relatively small (Inline graphic). Note that the AL algorithm essentially does not work for sample sizes Inline graphic, since its power is too small. All simulation results show that the novel SML algorithm significantly outperforms the AL and QDG algorithms in terms of PD and FDR.

Figure 4. Performance of the SML and AL algorithms for directed acyclic networks of Inline graphic genes [(a) power of detection, and (b) false discover rate].

Figure 4

Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 10 replicates of the network with different sample sizes (Inline graphic to 1,000).

An extra set of simulations assessing the stability of SML is described in the section of “Stability of model selection under CV with different folds” in supporting text S1, and in Figures S1 and S2. As an alternative to CV, stability selection (STS) [44] provides a means of selecting an appropriate sparsity level to guarantee that the FDR is less than a theoretical upper bound. The STS procedure was applied to the SML algorithm as described in the supporting text S1, and was used with the selection probability cutoff Inline graphic and an upper bound or target FDR = 0.1 in simulations for the networks in Figures 2[(c) and (d)] and 3 [(c) and (d)]. As shown in Figure S3, the FDR of the STS is indeed much smaller than the target FDR and almost uniform across different sample sizes, but the PD of the STS is smaller than that of CV. In fact, the FDR of the STS is on the same order as that of the CV except at the sample size of 100 for the DAG. As seen from these simulation results, although the STS guarantees a FDR upper bound, this upper bound is loose for the simulation setups tested, which may sacrifice detection power. Nevertheless, the STS procedure can select a set of stable variables as described in [44] and verified by our simulations.

So far, all the simulated data were generated with noise variance Inline graphic. Next, the performance of SML was analyzed for simulated networks of 30 genes, when Inline graphic was increased to 0.05 and Inline graphic was changed from 3 to 1 or 5. Reducing Inline graphic from 3 to 1 improved the performance of SML for most of the sample sizes, as it is depicted in Figure 5, withstanding the increase in the noise variance. Increasing Inline graphic at constant Inline graphic, or increasing Inline graphic at constant Inline graphic degraded the performance, most notably in the later case. Comparing Figure 5 with Figures 2 and 3 [(c) and (d)] demonstrates that in both cases the SML estimates still achieve higher detection power and lower FDR than those estimates obtained with the AL algorithm for Inline graphic and Inline graphic.

Figure 5. Performance of the SML algorithms for DAGs [(a) and (b)] or DCGs [(c) and (d)] of Inline graphic genes with an expected number of nodes per node Inline graphic and error variance Inline graphic.

Figure 5

PD and FDR were obtained from 100 replicates of the network with different sample sizes (Inline graphic to 1,000).

Inference of a network of immune-related human genes

Pickrell et al. [45] used RNA-Seq technology to sequence RNA from 69 lymphoblastoid cell lines derived from unrelated Nigerian individuals extensively genotyped by the International HapMap Project [46]. For each gene, they evaluated possible associations between its gene expression level calculated from RNA-Seq reads and all 3.8 million single nucleotide polymorphisms (SNPs) using the genotypes from phases II and III of the HapMap Project. At FDR = 0.1, they identified 929 genes or putative new exons that have eQTLs within 200 kb of the gene or the exon. From these 929 genes, 39 genes that are related to immune functions were selected manually by an expert as mentioned in the Acknowledgements section; expression levels and the genotypes of the eQTLs of these 39 genes in 69 individuals were used to infer the underlying regulatory network.

Pickrell et al. normalized expression values using quantile normalization before performing eQTL mapping. They also provided a data set that contains the number of reads mapped to each of 929 genes. This data set was obtained and the number of reads for each of 39 genes was normalized with the length of the gene to yield expression value. Such kind of values may better reflect the real expression values than the values normalized with quantile normalization, and thus they were used to infer the network. To ensure the quality of the data, the SAS ROBUSTREG procedure was applied to 69 expression values of each of 39 genes to detect outliers. The default M estimation method of the ROBUSTREG procedure was employed and the outliers were detected at a significance level of 0.05. Several gene expression values were identified as outliers since they are much larger than the remaining values that were classified as non-outliers. The outliers were replaced with the largest non-outlier. More sophisticated means of revealing and imputing outliers are possible using robust statistical schemes; see e.g., [47]. The genotypes of the eQTLs of the 39 genes were downloaded from HapMap database using the SNP IDs for the eQTL provided by Pickrell et al.. About 12% genotypes are missing. These missing genotypes were imputed using the program IMPUTE2 [48]. The name and a brief description of each gene were obtained from DAVID [49] using the Ensembl gene IDs provided by Pickrell et al. Information of these 39 genes including their Ensembl gene IDs and names, a brief description of each gene, and HapMap SNP IDs of the associated eQTLs can be found in Table S1 in the supporting information.

The SML algorithm was run with the expression levels and genotypes of eQTLs of these 39 genes. An edge from gene Inline graphic to Inline graphic was detected if Inline graphic. To improve the reliability of the detected edges, the SML algorithm was run with stability selection at an FDR Inline graphic using 100 random subsamples, yielding 13 directional edges as shown in Figure 6. The frequency of each edge detected in 100 runs is given in Table S2. It is interesting to see from Figure 6 that only 9 genes are involved in the network, and the remaining 30 genes are not connected with any other genes and thus not shown in the figure. AL and QDG algorithms were also run with stability selection at an FDR Inline graphic using 100 random subsamples. The edges detected by AL and QDG algorithms and their frequencies are included in Table 4. The AL algorithm detected only one edge that was not detected by the SML algorithm. The QDG yielded 3 edges, one of which was also detected by the SML algorithm. The relatively small number of edges detected by three algorithms was likely due to relatively low signal-to-noise ratio (SNR) in this data set. The estimated noise variance was Inline graphic and the estimated SNR was Inline graphic dB, which was much lower than that (about 25.8 dB) in the case of Inline graphic in Figure 5. However, comparing the results of three algorithms shows that our SML algorithm detected more edges than the other two algorithms at the same FDR due to its higher detection power as confirmed also by the simulations. When the FDR was increased to Inline graphic, the SML algorithm with stability selection yielded a network of 16 genes that have 42 edges as shown in Figure S4 in the supporting information. Since only 39 genes were used to construct the network, an edge between two genes may not necessarily imply a direct regulatory effect, but may reflect the fact that two genes are either directly linked or very close to each other in the real network that consists of all genes. Particularly, if two genes are co-regulated by another gene which is not included in the 39 genes, these two genes may have a unidirectional or bidirectional edge.

Figure 6. The network of 39 human genes inferred from gene expression and eQTL data with the SML algorithm.

Figure 6

The 39 genes related to the immune function were chosen from [45] to have a reliable eQTL per gene. The SML algorithm was run with stability selection and edges were detected at an Inline graphic. See Table 3 for the IDs and description of 39 genes. IGH in this figure corresponds to gene ID ENSG00000211897. A Inline graphic edge stands for inhibitory effect and a Inline graphic edge stands for activating effect.

Most edges in Figure 6 are between major histocompatibility complex (MHC) genes (HLA-A, HLA-DPA1, HLA-DQA2, HLA-DQB1, HLA-DRB4 and HLA-DRB5), which is expected since these genes may interact with each other and/or be co-regulated. FCRLA is a member of Fc receptor-like family of genes. It is expressed in B cells and interacts with IgG and IgM [50], [51]. IGH, encoding the heavy chain of immunoglobulin, characterizes the B-cell origin of the samples. Hence, it is not surprising to see an edge between FCRLA and IGH. Interleukin-4-induced gene 1 (IL4I1) was first described in the mouse [52] and subsequently characterized in human B cells [53]. Human IL4I1 is expressed by antigen-presenting cells [54], which may allude to the edge between HLA-A and IL4I1, but this may be speculative since there is no edges between IL4I1 and MHC class II genes in the network. The edges between IGH and HLA-A and between IGH and HLA-DRB4 may reflect the coordinated effect of antibody and MHC as a response to antigens. In fact, IGH is connected to most of MCH genes in Figure S4, which may imply the wide coordination between the two classes of molecules.

Discussion

Integrating genetic perturbations with gene expression data for inference of gene networks not only improves inference accuracy, but also enables learning of causal regulatory relations among genes. Although much progress has been made recently on the development of inference methods that integrate both types of data, a truly efficient algorithm is missing. The SEM provides a systematic framework to integrate both types of data, and offers flexibility to model both directed cyclic as well as acyclic graphs. However, there is no systematically designed inference method for SEMs of relatively high dimension, which is particularly true for gene networks typically including hundreds or thousands of genes. Traditionally, inference for SEMs has relied on the ML or generalized least-squares methods implemented with a numerical optimization algorithm [55], [56]; but recently, Bayesian alternatives [57] have emerged too, based on Markov chain Monte Carlo simulations [58], [59]. These methods not only are computationally intensive, but also may be inaccurate for sparse SEMs of relatively high dimension, since they do not account for sparsity present in the model.

In the context of QTL mapping, Newton's method is employed in [27] to implement the ML method, while the genetic algorithm [60], [61] is used in [24], [25] to maximize the likelihood function, and in conjunction with a model selection method using a Inline graphic test or Occam's window to search for the best network topology. These methods are not scalable to SEMs of relatively high dimension. The AL-based algorithm proposed in [26] is more efficient because it automatically incorporates variable selection into the inference process, and also takes into account the sparsity present in gene networks. However, the AL-based scheme borrows the adaptive Lasso [36] optimally designed for the linear regression model instead of the SEM. In contrast, the SML algorithm proposed in this paper directly maximizes the Inline graphic-regularized likelihood function of the SEM, which fully exploits the information present in the data and therefore improves inference accuracy. Moreover, the novel block coordinate ascent method combined with discarding rules can efficiently maximize the Inline graphic-regularized likelihood function, rendering the SML algorithm applicable to SEMs of high dimension. However, unlike the AL-based algorithm, the SML algorithm maximizes a non-convex objective function as given in (3). Although the “Recovery” Theorem in [26] guarantees the identifiability of the network, the algorithm can converge to a local maximum that may not necessarily be coincident with the global maximum corresponding to the optimal network. A common technique for alleviating this problem is to use multiple random initial values. We tested multiple initial values in our simulations and observed that the algorithm converged to the same solution. In Algorithm 2, we used the pathwise coordinate optimization strategy as used in [62], where the solution of (3) obtained with Inline graphic was used as the initial point for the run with Inline graphic. The pertinence of this strategy is corroborated by simulated numerical tests, showing significant performance gains of the SML algorithm in terms of detection power and FDR when compared to the AL-based algorithm.

Comparisons in the Simulation Studies section, as summarized in Figures 25, demonstrated that the SML algorithm markedly outperforms two state-of-the-art algorithms: the AL [26] and QDG [21] algorithms. For three directed acyclic networks with number of genes Inline graphic and 300, respectively, the PD of the SML algorithm exceeds 0.9 for all sample sizes from 100 to 1,000, and is greater than 0.99 for most sample sizes. This is much greater than the PD of the AL and QDG algorithm that ranges from 0.004 to 0.67. In fact, The QDG algorithm was too time-consuming to obtain results for Inline graphic. The FDR of SML is on the order of Inline graphic for most sample sizes, which is much smaller than those of the AL and QDG algorithms, that are between 0.25 and 0.6 for Inline graphic and 30. The FDR of the AL algorithm for Inline graphic is between 0.02 and 0.1. The only case where the FDR of SML exceeds that of the AL algorithm is when Inline graphic, and the sample size Inline graphic. However, the AL algorithm essentially does not work in this case, since its PD is about 0.04. In the case of directed cyclic networks, all algorithms offer slightly degraded performance when compared to that of directed acyclic networks. However, the SML algorithm still considerably outperforms the AL and QDG algorithms.

Using a limited amount of available data [45], 39 genes related to the immune system and having one eQTL per gene were selected to infer a possible network among these genes. At an FDR Inline graphic10% for the detected edges, a network of 9 out of 39 genes containing 13 edges were obtained. An edge between two genes in the inferred network may be an indication of the direct regulator effect, or indirect interaction or co-regulation mediated by some other genes that are not among the 39 genes. The majority of the edges were reasonably expected from the experimental results in the literature, while the remaining edges may represent new interactions to be elucidated.

Structural equation modeling has a long history of about a century, with well-documented contributions to various fields including biology, psychology, econometrics and other social sciences [55], [56], [63], [64]. The model considered in this paper belongs to a class of SEMs with observed variables [55]. The SML algorithm is the first one that is systematically developed for inferring sparse SEMs with observed variables. It is expected to accelerate the application of high-dimensional SEMs not only in biology, but also in other fields.

Methods

Ridge regression

Closed-form solution

Problem (4) can be solved row by row independently in closed form. Let Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic denote the Inline graphicth row of Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic, respectively. Then, problem (4) is equivalent to the following problem

graphic file with name pcbi.1003068.e279.jpg (5)

where Inline graphic stands for the Inline graphicth element of Inline graphic and Inline graphic denotes the Inline graphicth element of Inline graphic.

The constraints in (5) can be imposed directly by discarding elements of Inline graphic and Inline graphic known to be zero. To this end, define an Inline graphic vector Inline graphic and a vector Inline graphic collecting the entries of Inline graphic whose indexes are not in Inline graphic. Let Inline graphic and Inline graphic denote the solution for Inline graphic and Inline graphic, respectively. Similarly, let Inline graphic be a sub-matrix of Inline graphic formed by removing the Inline graphicth row of Inline graphic, and Inline graphic collecting those rows of Inline graphic whose indexes are not in Inline graphic. Under these definitions, (5) is equivalent to

graphic file with name pcbi.1003068.e304.jpg (6)

Minimizing for Inline graphic first, one arrives at

graphic file with name pcbi.1003068.e306.jpg (7)

Substituting (7) into (6) after defining Inline graphic, yields

graphic file with name pcbi.1003068.e308.jpg

which is a standard ridge regression problem with solution given by

graphic file with name pcbi.1003068.e309.jpg (8)

Finally, substituting (8) into (7) yields

graphic file with name pcbi.1003068.e310.jpg (9)

Vectors Inline graphic and Inline graphic are obtained by inserting zeros into Inline graphic and Inline graphic at appropriate positions specified by the constraints in (5). Collecting Inline graphic and Inline graphic, Inline graphic, yields the solution of (4), namely Inline graphic and Inline graphic.

Parameter Inline graphic is required to solve (4). A Inline graphic-fold CV scheme is adopted for this purpose with typical choices of Inline graphic or 10, as suggested in [65]. A detailed description of the CV procedure [65] is given in supporting text S1.

Inline graphic-regularized ML method

Coordinate-ascent algorithm

Solving (3) is performed by a cyclic block-coordinate ascent iteration. Consider a specific cycle where estimates of Inline graphic and Inline graphic obtained in the previous cycle are denoted by Inline graphic and Inline graphic, respectively. The first step of the cycle entails maximizing the objective function in (3) w.r.t. Inline graphic with Inline graphic fixed to Inline graphic, which yields a new estimate of Inline graphic denoted as Inline graphic. This step coincides with the minimization of the objective function in (4) w.r.t. Inline graphic, which admits a closed-form solution per row given by (7). In each of the next Inline graphic steps of the cycle, the objective function in (3) is maximized w.r.t. a single entry of Inline graphic, namely Inline graphic, with the remaining entries of Inline graphic equal to the corresponding entries of Inline graphic and Inline graphic. An expression for the new estimate of Inline graphic, Inline graphic is derived next.

Define matrix Inline graphic having all entries equal to those of Inline graphic except for its Inline graphicth entry, which is replaced by the variable Inline graphic, where Inline graphic and Inline graphic denote the Inline graphicth and Inline graphicth canonical vectors in Inline graphic, respectively. Then, the objective in (3) can be written as

graphic file with name pcbi.1003068.e351.jpg (10)

Upon re-arranging and discarding constant terms, (10) simplifies to

graphic file with name pcbi.1003068.e352.jpg (11)

where Inline graphic denotes the Inline graphicth co-factor of matrix Inline graphic, and Inline graphic are defined as

graphic file with name pcbi.1003068.e357.jpg
graphic file with name pcbi.1003068.e358.jpg
graphic file with name pcbi.1003068.e359.jpg

with Inline graphic representing the Inline graphic entry of the matrix between brackets. For numerical stability and computational savings, all co-factors Inline graphic, per row can be computed simultaneously by solving Inline graphic with Inline graphic. After an iteration step is completed and Inline graphic is computed, Inline graphic can be updated using the matrix inversion lemma as Inline graphic before updating Inline graphic.

A new estimate of Inline graphic is formed by maximizing Inline graphic in (11). To this end, consider two cases with Inline graphic and Inline graphic. If Inline graphic, the logarithmic term can be dropped from (11) yielding a standard Lasso problem with solution

graphic file with name pcbi.1003068.e374.jpg (12)

When Inline graphic, three hypotheses are tested, namely: i) Inline graphic; ii) Inline graphic; and, iii) Inline graphic. For hypotheses i) and iii), the solution can be found in closed form after equating to zero the derivative of (11) w.r.t. Inline graphic. The roots found in both cases have to be tested against the corresponding hypothesis. Then, the surviving roots are grouped with Inline graphic as candidate solutions, and the candidate yielding the maximum Inline graphic is the new estimate Inline graphic.

Specifically, under hypothesis i) where Inline graphic, the derivative of Inline graphic in (11) takes the form Inline graphic, which upon multiplication with Inline graphic turns into

graphic file with name pcbi.1003068.e387.jpg (13)

under the definitions

graphic file with name pcbi.1003068.e388.jpg
graphic file with name pcbi.1003068.e389.jpg

Consider the equation obtained by setting (13) equal to zero. If it has root(s), then they are given by

graphic file with name pcbi.1003068.e390.jpg (14)

Let Inline graphic stand for the set containing the positive root(s) in (14). If the equation does not have a solution, Inline graphic equals the empty set.

Similarly for hypothesis iii) where Inline graphic, setting the derivative of (11) equal to zero, one obtains an equation. If this equation has root(s), they are given by

graphic file with name pcbi.1003068.e394.jpg (15)

Let Inline graphic denote the set containing the negative root(s) in (15). If the equation does not have a solution, Inline graphic becomes the empty set. Considering all three hypotheses, one arrives at

graphic file with name pcbi.1003068.e397.jpg (16)

After a cycle is completed, the algorithm is checked for convergence by verifying whether the inequality Inline graphic is satisfied, where Inline graphic is a prespecified small constant. If yes, the algorithm is stopped and Inline graphic and Inline graphic are output as the final estimates of Inline graphic and Inline graphic; otherwise, Inline graphic and Inline graphic and one proceeds to execute the next cycle.

In order to increase the speed of the SML algorithm, the discarding rules proposed for sparse linear regression [66], [67] were adapted to the sparse SEM setup. Given Inline graphic, the discarding rules provide a means of computing a matrix Inline graphic, whose entries determining entries of Inline graphic that can be set to zero a priori without be updated during the coordinate-ascent iterations. A detailed description of the discarding rules, together with the CV procedure to select the optimal Inline graphic, and the expression for the required Inline graphic, that is, the minimum value of Inline graphic for which the solution to (3) is null, are provided in the supporting text S1.

SML algorithm

The overall SML approach described in the Methods section, including the ridge regression weights, the discarding rules, and the coordinate descent cycle is depicted step-by-step in Algorithm 1. The for-loop starting from line 8 and ending at the last line is the Inline graphic-regularized ML method for computing Inline graphic and Inline graphic in (3), which comprises the block coordinate ascent algorithm and discarding rules. In our computer program, these lines were written as a subroutine. Since the CV on line 7 needs to solve (3), the subroutine is also called on line 3 with Inline graphic varying from Inline graphic to Inline graphic. An additional subroutine implementing ridge regression was written to solve (4), and subsequently called on lines 1 and 2.

In the supporting text S1, three relevant extensions to the SML algorithm are described. First, stability selection [44] is applied to the SML, as an alternative to CV, to select the sparsity level so that the FDR is controlled. Second, the SML is extended to handle heteroscedasticity in the SEM error. Third, the SML is modified to enable inference of unknown eQTLs. In addition, supporting text S1 gives a description of the state-of-the-art AL-based and QDG algorithms that were considered for comparison with SML.

Supporting Information

Text S1

Supporting text.

(PDF)

Figure S1

Performance of the SML algorithm for DAGs [(a) and (b)] or DCGs [(c) and (d)] of Inline graphic  = 30 genes obtained with 5 (solid line) or 10 (dashed line) fold cross validation. Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with different sample sizes from 100 to 1,000.

(TIF)

Figure S2

Performance of the SML algorithm for DAGs [ (a) and (b)] or DCGs [(c) and (d)] of Inline graphic  = 30 genes obtained with the optimal Inline graphic (solid line) or an Inline graphic 10% less than the optimal Inline graphic (dashed line). Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with different sample sizes from 100 to 1,000.

(TIF)

Figure S3

Performance of the SML algorithm with stability selection (STS) or cross validation for DAGs [ (a) and (b)] or DCGs [(c) and (d)] of Inline graphic genes. Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with different sample sizes from 100 to 1,000.

(TIF)

Figure S4

The network of 39 human genes inferred from gene expression and eQTL data with the SML algorithm. The 39 genes related to the immune system were chosen from [45] to have a reliable eQTL per gene. The SML algorithm was run with stability selection and edges were detected at an Inline graphic. See Table S1 for the IDs and description of 39 genes. IGH in this figure corresponds to gene ID ENSG00000211897. A Inline graphic edge stands for inhibitory effect and a Inline graphic edge stands for activating effect.

(TIF)

Table S1

Thirty nine immune-related human genes used to infer a network.

(XLSX)

Table S2

Edges of the gene network in Figure 6 inferred with the SML algorithm and edges detected with AL and QDG algorithms.

(XLSX)

Software S1

Software package implementing the SML algorithm.

(ZIP)

Acknowledgments

A preliminary version of the SML algorithm fully developed in this paper was presented at 2011 IEEE International Workshop on Genomic Signal Processing and Statistics, December 4–6, 2011, San Antonio, Texas, USA. We would like to thank Dr. Zhibin Chen in the Department of Microbiology and Immunology at the University of Miami for selecting genes used in the inference of the human gene network and for his help with interpreting the inferred network. We would also thank Anhui Huang at the University of Miami for his help with imputing the missing genotypes for the data used in the inference of the human gene network.

Funding Statement

This work was supported by National Science Foundation grant CCF-0746882 and National Institute of Health grant R01GM104975-01 to XC. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae . Science 298: 799–804. [DOI] [PubMed] [Google Scholar]
  • 2. Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS (2000) Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci USA 97: 12182–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Basso K, Margolin AA, Stolovitzky G, Klein U, Dalla-Favera R, et al. (2005) Reverse engineering of regulatory networks in human B cells. Nat Genet 37: 382–90. [DOI] [PubMed] [Google Scholar]
  • 4. Dobra A, Hans C, Jones B, Nevins JR, Yao G, et al. (2004) Sparse graphical models for exploring gene expression data. J Multivar Anal 90: 196–212. [Google Scholar]
  • 5. Schäfer J, Strimmer K (2005) An empirical Bayes approach to inferring large-scale gene association networks. Bioinform 21: 754–764. [DOI] [PubMed] [Google Scholar]
  • 6. Friedman N, Linial M, Nachman I, Pe'er D (2000) Using Bayesian network to analyze expression data. J Comput Biol 7: 601–620. [DOI] [PubMed] [Google Scholar]
  • 7. Segal E, Shapira M, Regev A, Pe'er D, Botstein D, et al. (2003) Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet 34: 166–178. [DOI] [PubMed] [Google Scholar]
  • 8. Gardner TS, di Bernardo D, Lorenz D, Collins JJ (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301: 102–105. [DOI] [PubMed] [Google Scholar]
  • 9. di Bernardo D, Thompson MJ, Gardner TS, Chobot SE, Eastwood EL, et al. (2005) Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat Biotechnol 23: 377–383. [DOI] [PubMed] [Google Scholar]
  • 10. Schäfer J, Strimmer K (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat Appl Genet Mol Biol 4: article 32. [DOI] [PubMed] [Google Scholar]
  • 11. Bonneau R, Reiss D, Shannon P, Facciotti M, Hood L, et al. (2006) The inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo . Genome Biol 7: R36. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Sima C, Hua J, Jung S (2009) Inference of gene regulatory networks using time-series data: a survey. Curr Genomics 10: 416–429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Penfold CA, Wild DL (2011) How to infer gene networks from expression profiles, revisited. Interface Focus 1: 857–870. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Yip KY, Alexander RP, Yan KK, Gerstein M (2010) Improved reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data. PLoS ONE 5: e8121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Rockman MV (2009) Reverse engineering the genotype-phenotype map with natural genetic variation. Nature 456: 738–744. [DOI] [PubMed] [Google Scholar]
  • 16. Zhu J, Lum PY, Lamb J, GuhaThakurta D, Edwardsa S, et al. (2004) An integrative genomics approach to the reconstruction of gene networks in segregating populations. Cytogenet Genome Res 105: 363–374. [DOI] [PubMed] [Google Scholar]
  • 17. Zhu J, Wiener MC, Zhang C, Fridman A, Minch E, et al. (2007) Increasing the power to detect causal associations by combining genotypic and expression data in segregating populations. PLoS Comput Biol 3: e69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Zhu J, Zhang B, Smith EN, Drees B, Brem RB, et al. (2008) Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks. Nat Genet 40: 854–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Kulp DC, Jagalur M (2006) Causal inference of regulator-target pairs by gene mapping of expression phenotypes. BMC Genet 7: 125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Chen LS, Emmert-Streib F, Storey JD (2007) Harnessing naturally randomized transcription to infer regulatory relationships among genes. Genome Biol 8: R219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Neto EC, Ferrara CT, Attie AD, Yandell BS (2008) Inferring causal phenotype networks from segregating populations. Genetics 179: 1089–1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Aten JE, Fuller TF, Lusis AJ, Horvath S (2008) Using genetic markers to orient the edges in quantitative trait networks: The NEO software. BMC Syst Biol 2: 34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Millstein J, Zhang B, Zhu J, Schadt EE (2009) Disentangling molecular relationships with a causal inference test. BMC Genet 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Xiong M, Li J, Fang X (2004) Identification of genetic networks. Genetics 166: 1037–1052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Liu B, de la Fuente A, Hoeschele I (2008) Gene network inference via structural equation modeling in genetical genomics experiments. Genetics 178: 1763–1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Logsdon BA, Mezey J (2010) Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations. PLoS Comput Biol 6: e1001014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Mi XJ, Eskridge K, Wang D (2010) Regression-based multi-trait QTL mapping using a structural equation model. Stat Appl Genet Mol Biol 9: 38. [DOI] [PubMed] [Google Scholar]
  • 28. Gianola D, Sorensen D (2004) Quantitative genetic models for describing simultaneous and recursive relationships between phenotypes. Genetics 167: 1407–1424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. de los Campos G, Gianola D, Heringstad B (2006) A structural equation model for describing relationships between somatic cell score and milk yield in first-lactation dairy cows. J Dairy Sci 89: 4445–4455. [DOI] [PubMed] [Google Scholar]
  • 30. Wu XL, Heringstad B, Chang YM (2007) Inferring relationships between somatic cell score and milk yield using simultaneous and recursive models. J Dairy Sci 90: 3508–3521. [DOI] [PubMed] [Google Scholar]
  • 31. Jamrozik J, Bohmanova J, Schaeffer LR (2010) Relationships between milk yield and somatic cell score in Canadian Holsteins from simultaneous and recursive random regression models. J Dairy Sci 93: 1216–1233. [DOI] [PubMed] [Google Scholar]
  • 32. Valente BD, Rosa GJM, de los Campos G (2010) Searching for recursive causal structures in multivariate quantitative genetics mixed models. Genetics 185: 633–644. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Wu XL, Heringstad B, Gianola D (2010) Bayesian structural equation models for inferring relationships between phenotypes: a review of methodology, identifiability, and applications. J Anim Breed Genet 127: 3–15. [DOI] [PubMed] [Google Scholar]
  • 34. Rosa GJM, Valente BD, de los Campos G (2011) Inferring causal phenotype networks using structural equation models. Genet Sel Evol 43: 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Neto EC, Keller MP, Attie AD, Yandell BS (2010) Causal graphical models in systems genetics: A unified framework for joint inference of causal network and genetic architecture for correlated phenotypes. Ann Appl Stat 4: 320–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Zou H (2006) The adaptive Lasso and its oracle properties. J Amer Stat Assoc 101: 1418–1429. [Google Scholar]
  • 37. Tegner J, Yeung MK, Hasty J, Collins JJ (2003) Reverse engineering gene networks: integrating genetic perturbations with dynamical modeling. Proc Natl Acad Sci USA 100: 5944–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Jeong H, Mason SP, Barabássi AL, Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411: 41–42. [DOI] [PubMed] [Google Scholar]
  • 39. Thieffry D, Huerta AM, Pérez-Rueda E, Collado-Vides J (1998) From specific gene regulation to genomic networks: a global analysis of transcriptional regulation in Escherichia coli . Bioessays 20: 433–440. [DOI] [PubMed] [Google Scholar]
  • 40. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Statistical Soc Ser B 58: 267–288. [Google Scholar]
  • 41.Spirtes P, Glymour C, Scheines R (2000) Causation, Prediction, and Search, 2nd edition. Cambridge, MA: MIT Press.
  • 42. Kalisch M, Bühlmann P (2007) Estimating high-dimensional directed acyclic graphs with the PCalgorithm. J Mach Learn Res 8: 613–636. [Google Scholar]
  • 43. Li R, Tsaih SW, Shockley K (2006) Structural model analysis of multiple quantitative traits. PLoS Genet 2: e114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Meinshausen N, Bhlmann P (2010) Stability selection. J R Statist Soc B 72: 417–473. [Google Scholar]
  • 45. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, et al. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464: 768–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL (2007) A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851–861. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Giannakis G, Mateos G, Farahmand S, Kekatos V, Zhu H (2011) Uspacor: Universal sparsitycontrolling outlier rejection. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 1952–1955.
  • 48. Howie BN, Donnelly P, Marchini J (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4: 44–57. [DOI] [PubMed] [Google Scholar]
  • 50. Santiago T, Kulemzin SV, Reshetnikova ES, Chikaev NA, Volkova OY, et al. (2011) Fcrla is a resident endoplasmic reticulum protein that associates with intracellular igs, igm, igg and iga. Int Immunol 23: 43–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Wilson TJ, Gilfillan S, Colonna M (2010) Fc receptor-like a associates with intracellular igg and igm but is dispensable for antigen-specific immune responses. J Immunol 185: 2960–2967. [DOI] [PubMed] [Google Scholar]
  • 52. Chu CC, Paul WE (1997) An interleukin 4-induced mouse B cell gene isolated by cDNA representational difference analysis. Proc Natl Acad Sci USA 94: 2507–2512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Chavana SS, Tiana W, Hsueha K, Jawaheerd D, Gregersend PK, et al. (2002) Characterization of the humanhomolog of the IL-4 induced gene-1. Proc Natl Acad Sci USA 1576: 7080. [Google Scholar]
  • 54. Boulland ML, Marquet J, Molinier-Frenkel V, Mller P, Guiter C, et al. (2007) Human IL4I1 is a secreted l-phenylalanine oxidase expressed by mature dendritic cells that inhibits T-lymphocyte proliferation. Blood 110: 220–227. [DOI] [PubMed] [Google Scholar]
  • 55.Bollen KA (1989) Structural Equations with Latent Variables. Wiley-Interscience.
  • 56.Kaplan D (2009) Structural Equation Modeling: Foundations and Extensions, 2nd edition. Sage Publications.
  • 57.Lee SY (2007) Structural Equation Modeling: A Bayesian Approach. Wiley.
  • 58.Robert CP, Casella G (2004) Monte Carlo statistical method, 2nd edition. Springer.
  • 59.Carlin BP, Louis TA (2008) Bayesian Methods for Data Analysis. 3rd edition. Chapman and Hall/CRC.
  • 60.Holland JH (1972) Adaptation in Natural and Artificial Systems. Ann Arbor, MI: University of Michigan Press.
  • 61.Goldberg DE (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA: Addison-Wesley.
  • 62. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33: 1–22. [PMC free article] [PubMed] [Google Scholar]
  • 63.Shipley B (2002) Cause and Correlation in Biology: A User's Guide to Path Analysis, Structural Equations and Causal Inference. Cambridge University Press.
  • 64.Pearl J (2009) Causality: Models, Reasoning, and Inference. 2nd edition. Cambridge University Press.
  • 65.Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edition. New York: Springer.
  • 66.El Ghaoui L, Viallon V, Rabbani T (2010) Safe feature elimination in sparse supervised learning. Technical Report UC/EECS-2010-126, EECS Dept., University of California at Berkeley.
  • 67. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, et al. (2012) Strong rules for discarding predictors in lasso-type problems. J R Statist Soc B 74: 245266. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Text S1

Supporting text.

(PDF)

Figure S1

Performance of the SML algorithm for DAGs [(a) and (b)] or DCGs [(c) and (d)] of Inline graphic  = 30 genes obtained with 5 (solid line) or 10 (dashed line) fold cross validation. Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with different sample sizes from 100 to 1,000.

(TIF)

Figure S2

Performance of the SML algorithm for DAGs [ (a) and (b)] or DCGs [(c) and (d)] of Inline graphic  = 30 genes obtained with the optimal Inline graphic (solid line) or an Inline graphic 10% less than the optimal Inline graphic (dashed line). Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with different sample sizes from 100 to 1,000.

(TIF)

Figure S3

Performance of the SML algorithm with stability selection (STS) or cross validation for DAGs [ (a) and (b)] or DCGs [(c) and (d)] of Inline graphic genes. Expected number of nodes per node is Inline graphic. PD and FDR were obtained from 100 replicates of the network with different sample sizes from 100 to 1,000.

(TIF)

Figure S4

The network of 39 human genes inferred from gene expression and eQTL data with the SML algorithm. The 39 genes related to the immune system were chosen from [45] to have a reliable eQTL per gene. The SML algorithm was run with stability selection and edges were detected at an Inline graphic. See Table S1 for the IDs and description of 39 genes. IGH in this figure corresponds to gene ID ENSG00000211897. A Inline graphic edge stands for inhibitory effect and a Inline graphic edge stands for activating effect.

(TIF)

Table S1

Thirty nine immune-related human genes used to infer a network.

(XLSX)

Table S2

Edges of the gene network in Figure 6 inferred with the SML algorithm and edges detected with AL and QDG algorithms.

(XLSX)

Software S1

Software package implementing the SML algorithm.

(ZIP)


Articles from PLoS Computational Biology are provided here courtesy of PLOS

RESOURCES