Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2015 Apr 1;16(4):670–685. doi: 10.1093/biostatistics/kxv013

Statistical completion of a partially identified graph with applications for the estimation of gene regulatory networks

Donghyeon Yu 1, Won Son 2, Johan Lim 2, Guanghua Xiao 3,*
PMCID: PMC4570579  PMID: 25837438

Abstract

We study the estimation of a Gaussian graphical model whose dependent structures are partially identified. In a Gaussian graphical model, an off-diagonal zero entry in the concentration matrix (the inverse covariance matrix) implies the conditional independence of two corresponding variables, given all other variables. A number of methods have been proposed to estimate a sparse large-scale Gaussian graphical model or, equivalently, a sparse large-scale concentration matrix. In practice, the graph structure to be estimated is often partially identified by other sources or a pre-screening. In this paper, we propose a simple modification of existing methods to take into account this information in the estimation. We show that the partially identified dependent structure reduces the error in estimating the dependent structure. We apply the proposed method to estimating the gene regulatory network from lung cancer data, where protein–protein interactions are partially identified from the human protein reference database. The application shows that proposed method identified many important cancer genes as hub genes in the constructed lung cancer network. In addition, we validated the prognostic importance of a newly identified cancer gene, PTPN13, in four independent lung cancer datasets. The results indicate that the proposed method could facilitate studying underlying lung cancer mechanisms and identifying reliable biomarkers for lung cancer prognosis.

Keywords: Concentration matrix, Gaussian graphical models, Gene regulatory network, Lung cancer, Partially identified graph, Protein–protein interaction

1. Introduction

In recent years, statistical approaches have been developed to construct gene regulatory networks (GRNs) from mRNA expression data. A GRN describes the interactions among genes and how the genes work together to form modules of cell functions under specific contexts, such as disease status. It provides a systematic understanding of the molecular mechanisms underlying the biological processes (Friedman, 2004). In GRNs, highly connected genes are called hub genes. Because the hub genes are in key positions, their activities may affect many genes in the network and hence play an important role in biological processes. Recently, analysis of hub genes has shown to be a promising approach in identifying key disease driver genes (Akavia and others, 2010) and important biomarkers for predicting disease progression (Taylor and others, 2009; Tang and others, 2013). Currently, most of the existing computational methods use purely data-driven approaches to construct gene regulatory networks from gene expression data. These approaches do not rely on any prior knowledge about the network and are widely suited to many applications. However, for gene regulatory networks, information about many known connections (edges) between genes has been accumulated over decades of biological research, such as protein–protein interactions or transcriptional factor-binding sites. Using these known edges, we can turn a network construction problem into a statistical completion of a partially identified graph problem, which could lead to much better power in identifying the unknown edges. In this paper, we propose a statistical completion of a partially identified graph (SCPG) method, which is a modification of existing methods to incorporate the information about known edges. We show that the information on known edges reduces the error in identifying the unknown edges and improves the accuracy of the constructed networks.

Consider a Inline graphic-dimensional random vector from a multivariate Gaussian distribution with mean Inline graphic and covariance matrix Inline graphic, or the concentration matrix Inline graphic. In the Gaussian graphical model, the dependent structure among Inline graphic variables Inline graphic can be expressed using a graph Inline graphic, where a vertex set Inline graphic represents Inline graphic variables and an edge set Inline graphic represents pairs of random variables that are dependent on each other, given all other variables. In other words, that Inline graphic, the Inline graphicth element of Inline graphic, is equal to Inline graphic implies that Inline graphic and Inline graphic are conditionally independent given all other variables Inline graphic, Inline graphic.

Covariance selection, first introduced by Dempster (1972), is a class of problems that estimate dependent structures among multivariate Gaussian variables by detecting non-zero elements of the concentration matrix Inline graphic. Recently, researchers revisited the problem and studied the estimation of large-scale concentration matrices from a small number of observations. The Inline graphic-regularization on a concentration matrix or a partial correlation matrix is popularly used to obtain a sparse estimate of the dependent structure. Yuan and Lin (2007) propose the Inline graphic-regularized maximum likelihood estimator (MLE) of the concentration matrix and an algorithm to solve it using the determinant maximization problem (MAXDET), which has computational complexity of order Inline graphic and becomes slower as Inline graphic increases. Friedman and others (2007) propose a block coordinate descent procedure to solve the Inline graphic-regularized MLE, namely the graphical lasso. Meinshausen and Bühlmann (2006) formulate the covariance selection problem as a set of lasso regression problems and solve each of the lasso regression problems independently. Peng and others (2009) propose the sparse partial correlation estimation (SPACE) method, which solves the set of lasso regression problems jointly under the symmetry of the concentration matrix Inline graphic. However, this symmetry is not guaranteed by Meinshausen and Bühlmann (2006). Recently, Cai and others (2011) propose the constrained Inline graphic-minimization for inverse matrix estimation (CLIME) that directly minimizes the Inline graphic-norm of the concentration matrix with a relaxed constraint for the condition Inline graphic.

In this paper, we consider a simple modification of existing methods that incorporates “partially identified” dependent structures. The dependent structure to be estimated is often partially identified in practice. We also denote partially identified structures as “pre-identified” to emphasize that these structures are previously known. For example, GRNs and protein–protein interaction (PPI) networks are available in the public databases that were constructed for many previous laboratory experiments. In comparison, the pre-screening procedures recently proposed by many authors identify pairs of variables that are conditionally independent (Bair and others, 2006; Wasserman and Roeder, 2009). The existing methods do not take into account these partially identified structures. They frequently estimate a known dependence as independence, or vice versa, due to a lack of data information from small samples.

The modification is done by simply redefining the existing Inline graphic-regularization. To be specific, Inline graphic and Inline graphic are not penalized in the objective function or its constraints if Inline graphic and Inline graphic are pre-identified as conditionally dependent. The modification can be applied to the Inline graphic-regularized MLE, the Inline graphic-minimization and the regression-based methods. However, in this paper, we restrict our discussion to the modification of the regression-based method, i.e., the SPACE method.

The paper is organized as follows. In Section 2, we briefly introduce the SPACE method and propose the SCPG method as a modification to incorporate pre-identified dependent structures. In Section 3, we analytically show that the SCPG method reduces the asymptotic probability of mistakenly identifying an independent pair of variables as dependent. In Section 4, we numerically investigate the gains in accuracy by estimating the network from assuming the pre-identified graph structure. In Section 5, we apply the SCPG method to estimating the gene regulatory network from lung cancer data. We conclude the paper in Section 6.

2. Modification for partially identified graph

Suppose Inline graphic random vector with mean Inline graphic and positive definite covariance matrix Inline graphic. The partial correlation between Inline graphic and Inline graphic, denoted by Inline graphic, is the conditional correlation of Inline graphic and Inline graphic given Inline graphic. This partial correlation is closely related to the concentration matrix Inline graphic. It is known that Inline graphic. Also, for every Inline graphic,

2. (2.1)

where Inline graphic, and Inline graphic is uncorrelated with Inline graphic and has mean 0 and variance Inline graphic.

The identity (2.1) introduces a regression-based method to estimate Inline graphic or Inline graphic. Let Inline graphic be the Inline graphicth observation of the random vector Inline graphic for Inline graphic. Meinshausen and Bühlmann (2006) propose the neighborhood selection method to solve a set of lasso regression problems with respect to Inline graphics;that is, for Inline graphic,

2. (2.2)

Later, Peng and others (2009) propose the SPACE method, which minimizes the weighted sum of Inline graphic squared loss functions in (2.2) with a penalty term on the Inline graphic-norm of the partial correlation Inline graphics:

2. (2.3)

where Inline graphic is the Inline graphicth diagonal element of the concentration matrix and Inline graphic is a nonnegative weight for the Inline graphicth squared loss function.

The SPACE method has several advantages over the neighborhood selection method by Meinshausen and Bühlmann (2006) in estimating the concentration matrix and the graph structure. First, the SPACE method estimates the partial correlations and the diagonal elements of the concentration matrix. Thus, the estimation of the concentration matrix can easily be calculated by the relationship Inline graphic in the SPACE method, while the neighborhood selection method only obtains information about whether or not each off-diagonal element of the concentration matrix is zero. Second, the estimated edges from the SPACE method are symmetrical in the sense that Inline graphic; thus, if Inline graphic (or Inline graphic), then Inline graphic (or Inline graphic). Conversely, the neighborhood selection method separately solves Inline graphic problems in (2.2) and may obtain the contradictory edges (i.e., Inline graphic and Inline graphic). Finally, the SPACE method outperforms the neighborhood selection method in estimating graph structure and finding hubs in practice. This comparison study is reported in Peng and others (2009).

We now introduce the SCPG method as a modification of the SPACE method to take into account the pre-identified graph structure. The same modification can be applied to the Inline graphic-regularized MLEs and the Inline graphic-minimization, but those examples are omitted here. We consider the concentration matrix Inline graphic and its induced graph Inline graphic. Let Inline graphic be a set of pre-identified edges in Inline graphic. In this paper, we propose to solve

2. (2.4)

The modification in (2.4) only removes the penalties on the partial correlations corresponding to the pre-identified edges from (2.3). Thus, we can directly apply the active shooting algorithm, proposed by Peng and others (2009), to solve the modified problem. To be specific, we first rewrite the main problem (2.4) using matrix notation. The problem (2.4) without symmetry constraints becomes

2.

where Inline graphic for Inline graphic.

Let Inline graphic denote a set of pairs such that Inline graphic (i.e., Inline graphic) and Inline graphic denote a set of edges such that Inline graphic and Inline graphic (i.e., Inline graphic). Let Inline graphic be a Inline graphic-dimensional vector of Inline graphics for Inline graphic; let Inline graphic be a Inline graphic-dimensional vector of Inline graphics for Inline graphic; and let Inline graphic be a Inline graphic-dimensional vector of Inline graphics for Inline graphic. For Inline graphic, let

2.

We define a covariate matrix Inline graphic of Inline graphic, for Inline graphic, as a matrix with a size of Inline graphic and, if Inline graphic, its Inline graphicth column vector Inline graphic is defined as

2. (2.5)

where Inline graphic for Inline graphic.

The covariate matrices Inline graphic and Inline graphic with sizes of Inline graphic and Inline graphic, respectively, are defined similarly for coefficient vectors Inline graphic and Inline graphic. The whole group of covariate matrices Inline graphic and Inline graphic are then defined as

2.

The first part of the objective function in (2.3) is read as the least square error of the linear model

2. (2.6)

where Inline graphic is a Inline graphic-dimensional vector, Inline graphic is a design matrix with a size of Inline graphic, Inline graphic is from the Inline graphic-dimensional multivariate normal distribution with mean Inline graphic and covariance matrix Inline graphic, where Inline graphic and an operator Inline graphic denotes the Kronecker product. Thus, we can represent the problem (2.4) as

2.

Note that we set weights Inline graphics to one in this paper since we do not assume any strengths for nodes. We can only assume that we have partial information about true edges. If there is information about strength or importance for specific nodes, that information can be incorporated by changing the weights.

Now we briefly describe the proposed algorithm, which depends largely on the algorithm in Peng and others (2009). We first set initial values Inline graphic for Inline graphic. The proposed algorithm alternately updates the estimates Inline graphics and Inline graphics by the following steps:

  • Step 1: For a given Inline graphic for Inline graphic,
    graphic file with name M155.gif
    where Inline graphic is defined by (2.5) and (2.6) with Inline graphic for Inline graphic.
  • Step 2: Based on the identity (2.1), for a given Inline graphic and Inline graphic for Inline graphic,
    graphic file with name M162.gif
    where Inline graphic is the estimate of Inline graphic from the previous iteration.
  • Step 3: Repeat Steps 1 and 2 until the convergence occurs.

In Step 1, we apply the modified active shooting algorithm to incorporate the pre-identified edges Inline graphic. Details on the active shooting algorithm we propose are given in Appendix A of Supplementary material available at Biostatistics online.

3. Asymptotic error probabilities

In this section, we analytically find the changes in asymptotic error probability by using the information on pre-identified edges. As shown in Section 2, our main problem can be rewritten as the estimation of the sparse linear model (lasso regression), and we are able to compute the asymptotic true negative/positive probabilities of the model both with and without the information on dependent pairs of variables. The computation shows that the pre-identified dependent information asymptotically increases the true negative probability (the probability of identifying independent pairs as independent) while the true positive probability (the probability of identifying dependent pairs as dependent) of both methods converge to 1. Thus, the SCPG method reduces the error probability asymptotically. Our analysis of this section relies heavily on the results of Knight and Fu (2000) and Anderson (1955), which are reviewed in Appendix B of Supplementary material available at Biostatistics online.

For the asymptotic true negative probability, we consider the simplified model

3. (3.1)

where Inline graphic and Inline graphic. Under this model, we compare the true negative probabilities of the SCPG and SPACE methods.

Theorem 3.1 —

Suppose we have knowledge about Inline graphic. Let Inline graphic and Inline graphic be the solutions of the SPACE and SCPG methods, respectively. Then, the asymptotic true negative probabilities Inline graphic and Inline graphic satisfy the inequality

graphic file with name M174.gif

Proof —

See Appendix C.1 of Supplementary material available at Biostatistics online.

We next compare the asymptotic true positive probabilities of the SCPG and SPACE methods. Here, we consider the simplified model

3.

where both Inline graphic and Inline graphic. Suppose we have knowledge on Inline graphic. Let Inline graphic and Inline graphic be the solutions of the SPACE and SCPG methods, respectively. This section shows the following theorem:

Theorem 3.2 —

The asymptotic true positive probabilities of both the SPACE and SCPG models (which are Inline graphic and Inline graphic, respectively) converge to one as Inline graphic.

Proof —

See Appendix C.2 of Supplementary material available at Biostatistics online.

In comparing the asymptotic error probabilities between the SPACE and SCPG models, we show that the SCPG model asymptotically improves the true negative probability with the same performance as the true positive probability. Note that there is a difference between the asymptotic biases for Inline graphic and Inline graphic. However, a direct comparison of these asymptotic biases is difficult since the difference varies with the signs and structures of partial correlations in the model.

4. Numerical study

In the previous section, we showed that the SCPG method improves the asymptotic true negative probability and has the same performance for the asymptotic true positive probability in estimating graph structure for a given Inline graphic. This is mainly due to the bias reduction by using the prior information about partially identified edges. In practice, however, we generally encounter datasets with finite samples and choose a tuning parameter that minimizes an information criterion, such as the Bayesian information criterion (BIC) (Schwarz, 1978). Thus, we additionally investigate the performance of the SCPG method with finite samples for several graph structures and also compare the SCPG method with the SPACE method to confirm the improvements of the SCPG method in estimating a graph structure.

We first consider the Gaussian graphical model accompanied by the following AR(1), AR(2), hub and scale-free networks in simulation. The AR(1) and the AR(2) networks are from the time series model and the hub and scale-free networks reflect real biological networks. These four networks are illustrated in Figure 1. The details of the four networks, including how they are generated, are given in Appendix D of Supplementary material available at Biostatistics online.

Fig. 1.

Fig. 1.

Graphs of four networks used in simulation. Black nodes (Inline graphic) denote nodes whose degrees are Inline graphic. (a) (N1) AR(1), (b) (N2) AR(2), (c) (N3) hub network, and (d) (N4) scale-free.

We consider moderate-sized networks with Inline graphic nodes and sample sizes of Inline graphic, Inline graphic, and Inline graphic. To apply the SCPG method, we define two pre-identified edge sets Inline graphic and Inline graphic by randomly selecting 10 and Inline graphic of the true edges, respectively. These two pre-identified edge sets are also used to find the effects of the amount of information on estimating a graph structure. In each network, we generate 50 datasets from a Gaussian distribution with mean 0 and covariance matrix Inline graphic defined with the Inline graphicth element Inline graphic. Note that, for the hub and scale-free networks, we make the network have five exclusive sub-networks, each of which has Inline graphic nodes; the nodes in one sub-network are not connected to those in other sub-networks. This procedure is applied in Peng and others (2009) to describe the module-based networks frequently observed in real networks.

Let Inline graphic and Inline graphic be a Inline graphic-dimensional vector of the true partial correlations and the estimates of partial correlations at Inline graphic, respectively. To investigate the theoretical properties of the SCPG method with finite samples, we introduce the true positive rate (TPR) and the true negative rate (TNR) defined as follows:

4.

where Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic denotes an indicator function. We additionally define the false discovery rate (FDR) as

4.

where Inline graphic and Inline graphic. Note that the FDR is not defined if Inline graphic. In this case, we consider the FDR value to be 0 to summarize results with all datasets.

Figure 2 plots the average of TPRs, TNRs, and FDRs for various Inline graphics in the aforementioned four networks and shows several interesting features containing the result that are related to the theoretical properties in the previous section. Compared with SPACE, the SCPG method improves TPRs in all networks considered except the AR(1) network for a given Inline graphic.

Fig. 2.

Fig. 2.

The averages of TPR, TNR, and FDR for (N1)(N4) networks in Figure 1 with Inline graphic and Inline graphic. “None”, “C-10%”, and “C-30%” denote the SPACE method (solid), the SCPG method with 10% (dashed) and 30% (dotted) partially identified edges, respectively. (a) (N1) TPRInline graphic, (b) (N1) TNRInline graphic, (c) (N1) FDRInline graphic, (d) (N2) TPRInline graphic, (e) (N2) TNRInline graphic, (f) (N2) FDRInline graphic, (g) (N3) TPRInline graphic, (h) (N3) TNRInline graphic, (i) (N3) FDRInline graphic, (j) (N4) TPRInline graphic, (k) (N4) TNRInline graphic, and (l) (N4) FDRInline graphic.

In view of TNRs, however, the SCPG method improves on the SPACE method for any given Inline graphics in all networks we consider and also increases TNRs as the amount of pre-identified information increases. This result shows that the theoretical result for the true negative probability described in the previous section still holds with finite samples. Moreover, the SCPG method decreases FDRs for any given Inline graphics in all networks compared with the SPACE method. Interestingly, the SCPG method decreases FDRs while the TPRs decrease as the amount of pre-identified information increases in the AR(1) network.

The tuning parameter Inline graphic in both the SCPG and SPACE methods plays an important role in estimating the network, where a large (or a small) value of Inline graphic results in a sparse (or a dense) estimate of the network with low false positives (or low false negatives). Several information criteria, such as the Akaike information criterion and Bayesian information criterion (BIC), are heuristically used for the network model of these papers (Danaher and others, 2014; Yuan and Lin, 2007; Peng and others, 2009). They are originally designed for the linear regression model and some of them are theoretically shown to select the correct model (Wang and others, 2009; Fan and Tang, 2013). However, these are limited to the linear regression model, and there is no optimal rule for choosing Inline graphic in the network model. In this paper, we adopt the generalized information criterion (GIC) proposed by Fan and Tang (2013), which is shown to outperform the BIC in identifying the correct model in the linear regression. The “GIC-type” criterion used in this paper is defined like the “BIC type” criterion in Peng and others (2009) as

4.

where Inline graphic and Inline graphic is a cardinality of a set Inline graphic. For each dataset, we evaluate the Inline graphic on a grid of Inline graphic and choose a tuning parameter Inline graphic such that Inline graphic, where Inline graphic. In addition, the selected models of the SPACE and SCPG methods by the GIC are evaluated by the TPR, TNR, FDR, the mis-specification rate (MISR), and the Matthews correlation coefficient (MCC). The first three measures have been defined already. Here, we introduce the MISR and MCC, defined as

4.

where Inline graphic, Inline graphic, Inline graphic, and Inline graphic. Here, the MISR corresponds to the total error rate of a classifier and the MCC, with a value between Inline graphic and Inline graphic, measures the accuracy of a classifier, where Inline graphic, Inline graphic, and Inline graphic respectively, denote a perfect classification, a random classification, and a total discordance of classification.

Tables 1 and 2 report the average of these five measures over 50 datasets, which reveals some interesting features of the proposed SCPG method. First, compared with the SPACE method, the TNRs and the MCCs of the SCPG method increase as the amount of pre-identified information increases, for all the cases we consider. Second, the SCPG method has smaller error rates than the SPACE method in terms of the FDRs and the MISRs. Finally, the TPRs of the SCPG method are approximately equal to or higher than those of the SPACE method (without pre-identified information) in all the cases we consider except the AR(1) model. In summary, these features indicate that the SCPG method's performance is superior to the SPACE method in all aspects.

Table 1.

The averages of Inline graphic, TPR, TNR, FDR, MISR, and MCC for ARInline graphic and ARInline graphic networks over Inline graphic datasets

Network Inline graphic Info. Inline graphic TPR TNR FDR MISR MCC
AR(1) Inline graphic Inline graphic None 627.64 99.97 99.9 20.45 0.1 89.12
(2.55) (0.01) (0) (0.32) (0) (0.18)
10% 613.7 99.92 99.91 18.7 0.09 90.08
(2.22) (0.02) (0) (0.29) (0) (0.16)
30% 584.38 99.82 99.93 14.68 0.07 92.24
(2.51) (0.03) (0) (0.36) (0) (0.19)
Inline graphic None 609.58 100 99.91 18.09 0.09 90.46
(2.24) (0) (0) (0.3) (0) (0.17)
10% 596.62 100 99.92 16.3 0.08 91.44
(2.4) (0) (0) (0.33) (0) (0.18)
30% 570.74 100 99.94 12.52 0.06 93.5
(1.87) (0) (0) (0.28) (0) (0.15)
Inline graphic None 600.84 100 99.92 16.9 0.08 91.12
(2.18) (0) (0) (0.3) (0) (0.17)
10% 584.68 100 99.93 14.61 0.07 92.37
(1.83) (0) (0) (0.26) (0) (0.14)
30% 566.2 100 99.95 11.83 0.05 93.87
(1.62) (0) (0) (0.25) (0) (0.13)
AR(2) Inline graphic Inline graphic None 1431.8 75.21 99.45 47.59 0.74 62.4
(14.03) (0.63) (0.01) (0.14) (0) (0.24)
10% 1509.6 87.28 99.48 42.29 0.61 70.68
(10.81) (0.37) (0.01) (0.2) (0) (0.11)
30% 1403.8 94.87 99.63 32.56 0.41 79.8
(7.23) (0.19) (0) (0.25) (0) (0.12)
Inline graphic None 1873.38 100 99.29 46.75 0.7 72.71
(6.57) (0) (0.01) (0.19) (0.01) (0.13)
10% 1724.14 100 99.41 42.14 0.58 75.84
(6.22) (0) (0.01) (0.21) (0) (0.14)
30% 1455.84 100 99.63 31.49 0.37 82.61
(4) (0) (0) (0.19) (0) (0.11)
Inline graphic None 1801.44 100 99.35 44.61 0.64 74.17
(7.39) (0) (0.01) (0.23) (0.01) (0.15)
10% 1665.48 100 99.46 40.09 0.54 77.18
(6.52) (0) (0.01) (0.23) (0.01) (0.15)
30% 1418.54 100 99.66 29.69 0.34 83.7
(3.67) (0) (0) (0.18) (0) (0.11)

Inline graphic denotes the number of estimated edges. All values except for Inline graphic are multiplied by Inline graphic. The numbers in parentheses denote the standard errors of measures.

Table 2.

The averages of Inline graphic TPR, TNR, FDR, MISR, and MCC for hub and scale-free networks over Inline graphic datasets

Network Inline graphic Info. Inline graphic TPR TNR FDR MISR MCC
Hub Inline graphic Inline graphic None 318 48.94 99.97 12.15 0.26 65.38
(4.5) (0.5) (0) (0.49) (0) (0.26)
10% 343.1 53.79 99.97 10.56 0.24 69.2
(4.17) (0.44) (0) (0.44) (0) (0.21)
30% 363.4 60.18 99.98 5.61 0.2 75.24
(3.69) (0.42) (0) (0.35) (0) (0.18)
Inline graphic None 586.38 87.05 99.93 15.43 0.13 85.71
(3.34) (0.18) (0) (0.37) (0) (0.17)
10% 574.96 87.45 99.94 13.39 0.12 86.96
(2.74) (0.18) (0) (0.31) (0) (0.14)
30% 564.5 88.56 99.95 10.68 0.1 88.88
(2.54) (0.17) (0) (0.29) (0) (0.13)
Inline graphic None 654.32 97.01 99.92 15.57 0.1 90.44
(2.78) (0.09) (0) (0.33) (0) (0.17)
10% 644.34 96.99 99.93 14.31 0.09 91.12
(2.29) (0.08) (0) (0.28) (0) (0.14)
30% 619.42 96.92 99.95 10.92 0.07 92.88
(2.21) (0.08) (0) (0.29) (0) (0.15)
Scale-free Inline graphic Inline graphic None 396.6 66.73 99.95 16.58 0.19 74.49
(3.02) (0.26) (0) (0.42) (0) (0.18)
10% 399.04 69.17 99.95 14.04 0.17 77
(3.11) (0.24) (0) (0.45) (0) (0.16)
30% 402.3 73.28 99.97 9.72 0.14 81.25
(2.74) (0.24) (0) (0.38) (0) (0.15)
Inline graphic None 526.1 89.28 99.93 15.87 0.11 86.59
(3.3) (0.18) (0) (0.42) (0) (0.2)
10% 518.12 89.57 99.94 14.31 0.1 87.54
(3.12) (0.17) (0) (0.41) (0) (0.19)
30% 500.22 90.47 99.96 10.37 0.08 89.99
(2.9) (0.18) (0) (0.39) (0) (0.17)
Inline graphic None 561.52 96.79 99.93 14.61 0.08 90.86
(2.26) (0.09) (0) (0.34) (0) (0.19)
10% 551.34 96.63 99.94 13.18 0.07 91.55
(2.2) (0.09) (0) (0.34) (0) (0.19)
30% 536.6 97.14 99.96 10.33 0.06 93.29
(2.09) (0.07) (0) (0.33) (0) (0.17)

Inline graphic denotes the number of estimated edges. All values except for Inline graphic are multiplied by Inline graphic. The numbers in parentheses denote the standard errors of measures.

Before we end this section, we implement three additional numerical studies. First, we compare the performances of the SCPG and naive methods in estimating the structure of the network. Here, the naive method implies the direct addition of pre-identified edges to the estimated network by the SPACE. The results show that the SCPG method outperforms the naive method in all cases considered. The details of this comparison are detailed in Appendix H of Supplementary material available at Biostatistics online and the results are summarized in Table H.1 of Supplementary material available at Biostatistics online. It indicates that incorporating information of pre-identified edges help the estimation of network structure. In the second study, we investigate how the SCPG method is sensitive to the misspecification rates (the ratio of false positives in the pre-identified edges). Both details of the second study and results are reported in Appendix I of of Supplementary material available at Biostatistics online. The results show that the SCPG method still works better than the SPACE method in terms of error rates unless the misspecification rate is low (not Inline graphic15% in the study). However, we recommend readers choose the pre-identified edges in a conservative way. Finally, to understand how the SCPG performs in a large-scale network, we repeat the same numerical study as above for the hub and scale-free networks with 1000 nodes; these two networks are the most common assumptions for a large-scale network. The results are similar to what we had in Table 2. They are reported in Table J.1 in Appendix J of of Supplementary material available at Biostatistics online.

5. Applications with lung cancer adenocarcinoma

Two recent studies have shown that the hub genes in lung cancer gene regulatory networks may be potential robust biomarkers for lung cancer progression. To study whether our proposed method could discover novel gene biomarkers for cancer progression, we applied the proposed method to construct a network based on a microarray dataset from the Lung Cancer Consortium dataset (Shedden and others, 2008). This dataset measures the gene levels in 442 lung cancer adenocarcinoma patients. We identified 794 genes whose expression levels are significantly associated with patients’ survival time, after adjusting for clinical variables based on a univariate Cox regression (See Appendix E of of Supplementary material available at Biostatistics online). In addition, we used a list of PPIs from the human protein reference database (HPRD), which provided 39 240 pairs of PPIs for 9617 genes. Only 222 pairs of PPIs for 211 genes were matched to 794 genes in the lung cancer dataset. We used these 222 pairs as the pre-identified information.

In this study, we compared performances in constructing the gene regulatory network using (i) the SPACE method and (ii) the proposed SCPG method, with Inline graphic determined by the GIC. An overview of the networks constructed using the SPACE method and the proposed method is shown in Figure 3. The SPACE method estimated 297 edges for 135 genes of 794 genes (659 genes had no connection). The SCPG method estimated 455 edges for 299 genes (495 genes had no connection). To identify hub genes in the estimated graph, we applied a procedure similar to that described in Peng and others (2009). From the estimated graph structures, we first selected genes whose degrees lie over 0.95 quantiles of degree distribution. Then, we calculated the ranks of degrees of selected genes for various Inline graphic values. We selected potential hub genes such that the averages of the ranks of degrees were Inline graphic20, and the standard deviations were Inline graphic2. Following this procedure, we identified 17 hub genes from the SPACE method and 20 hub genes from the proposed method by incorporating the PPI network information. The identified hub genes are summarized in Table 3. There were 11 genes (highlighted in bold in Table 3) identified by both approaches, among which there were several key lung cancer genes, NKX2-1, HOP, and SFTPB (Further information is given in Appendix F of Supplementary material available at Biostatistics online). In comparing the two methods, we noted that the SCPG method identified nine genes that were missed by the SPACE method, including CTNNB1, CSNK2A1, ESR1, NEDD9, FYN, BRCA1, PTPN13, PIK3R1, and SLC34A2. Seven of these nine genes (identified only by SCPG) had been reported to play important roles in lung cancer, while two (UBE2C and TYMS) of six genes identified only by SPACE method are, based on our literature search, associated with lung cancer. (Further details are given in Appendix G of Supplementary material available at Biostatistics online.)

Fig. 3.

Fig. 3.

Estimated graph structures for the SPACE and SCPG methods. The nodes with numbers denote the detected hub genes reported in Table 3. (a) SPACE and (b) SCPG with PPI networks.

Table 3.

List of potential hub genes that identified by the SPACE and SCPG methods

SPACE
SCPG
No. Gene symbol Degree CR No. Gene symbol Degree CR
1 PRC1 39 1 GPR116 18
2 RRM2 18 2 NKX2-1 18 Inline graphic
3 CYP2B7P1 17 3 RRM2 18
4 GPR116 17 4 CTNNB1 17 Inline graphic
5 SFTPB 17 Inline graphic 5 CYP2B7P1 17
6 NKX2-1 16 Inline graphic 6 CSNK2A1 16 Inline graphic
7 TFF1 16 7 TFF1 15
8 HOP 15 Inline graphic 8 C1orf116 14
9 C1orf116 14 9 HOP 14 Inline graphic
10 FMO5 14 10 SFTPB 14 Inline graphic
11 CD302 12 11 ESR1 13 Inline graphic
12 HSD17B6 12 12 FMO5 12
13 HOXD1 9 13 CD302 11
14 TMPRSS2 9 14 NEDD9 11 Inline graphic
15 TPX2 9 15 FYN 10 Inline graphic
16 UBE2C 8 Inline graphic 16 PTPN13 10
17 TYMS 7 Inline graphic 17 BRCA1 9 Inline graphic
18 HSD17B6 9
19 PIK3R1 9 Inline graphic
20 SLC34A2 9

Bold font highlights the genes identified by both methods. “CR” denotes cancer-related genes identified by previous studies.

In addition, the SCPG method identified the PTPN13 gene, which had not been previously reported as a lung cancer related gene. To further study this gene, we have downloaded the mRNA expression together with the clinical annotation from four public lung cancer datasets, including (1) Tomida and others (2009) (Inline graphic), (2) Bhattacharjee and others (2001) (Inline graphic), (3) Raponi and others (2006) (Inline graphic), and (4) Jones and others (2004) (Inline graphic). These four datasets were selected because they were published in high-profile journals, contained relatively large sample sizes (at least 80 samples), and were measured from different microarray platforms. Interestingly, the under-expression of the PTPN13 gene is consistently associated with the poor prognosis of lung cancer patients in the four independent datasets, which were measured using different platforms (see Fig. G.1 of Supplementary material available at Biostatistics online). The results show that the mRNA expression of the PTPN13 gene is a novel and robust prognostic biomarker of potential clinical importance.

6. Conclusion

Recently, reconstructions of GRNs based on genome-wide mRNA expression data have been widely used to study biological mechanisms and identify novel biomarkers. Learning the gene network structures from gene expression data is a challenge because of the extremely large number of possible network edges and the small number of sample sizes in gene expression data to infer the true edges. However, for GRN, there are many previously identified edges (i.e., gene regulations) from pathway information, protein–protein interaction databases, and transcriptional factor binding databases. So instead of learning the structure of GRN from scratch, we can incorporate the known edges to mitigate the daunting task of network reconstruction. In this study, we proposed the SCPG method, a simple but effective modification of the SPACE method, to incorporate partially identified edges in estimating graph structure with a Gaussian graphical model. The SCPG method asymptotically increases the true negative probability and obtains the same performance in terms of the true positive probability compared with the SPACE method. Moreover, we numerically show that the SCPG method not only increases the true negative rate but also reduces the false discovery rate. The SCPG method was applied here to estimate the gene regulatory network of lung cancer data with pre-identified edges from the HPRD database, and it identified more cancer-related hub genes than the SPACE method. More importantly, the SCPG method identified a novel prognostic biomarker, the PTPN13 gene. We validated the prognostic performance of PTPN13 gene expression using four independent lung cancer mRNA expression datasets across different experimental platforms. The results indicate that the proposed SCPG method performs well in reconstructing a gene regulatory network and could be used to identify novel biomarkers for predicting disease outcomes.

In this study, we demonstrated that inferring gene network structures can be improved by incorporating information about previously identified edges from other resources. However, we need to be cautious because gene regulation could vary among different tissues or biological conditions, while most information available about previously identified edges (gene–gene interactions) is not condition specific. As a result, some edges reported in existing databases may not really be edges in the specific conditions under study, which may lead to false-positive edges. A reasonable way to avoid this is to select only the reported edges with high expression correlations for the corresponding gene pairs in the expression data to be used for constructing the network (Ahn and others, 2011). This step helps to identify the gene-gene interactions that are appropriate for the specific conditions under study. In addition, we used GIC to select the tuning parameter, which produced satisfactory results in the real data application. However, it is possible that there exist other examples where the GIC performs poorly. It is also possible that there are other methods for selecting the tuning parameter that could be superior to the GIC. In summary, methodology for objectively selecting tuning parameters is an interesting area for future research.

Supplementary Material

Supplementary material is available online at http://biostatistics.oxfordjournals.org.

Funding

This work was supported by the National Institutes of Health (R01CA172211 to Guanghua Xiao) and National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2011-0029104 to Johan Lim).

Supplementary Material

Supplementary Data

References

  1. Ahn J., Yoon Y., Park C., Shin E., Park S. (2011). Integrative gene network construction for predicting a set of complementary prostate cancer genes. Bioinformatics 27(13), 1846–1853. [DOI] [PubMed] [Google Scholar]
  2. Akavia U. D., Litvin O., Kim J., Sanchez-Garcia F., Kotliar D., Causton H. C., Pochanard P., Mozes E. and others (2010). An integrated approach to uncover drivers of cancer. Cell 143(6), 1005–1017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Anderson T. W. (1955). The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities. Proceedings of the American Mathematical Society 6(2), 170–176. [Google Scholar]
  4. Bair E., Hastie T., Paul D., Tibshirani R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association 101, 119–137. [Google Scholar]
  5. Bhattacharjee A., Richards W. G., Staunton J., Li C., Monti S., Vasa P., Ladd C., Beheshti J. and others (2001). Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proceedings of the National Academy of Sciences of the United States of America 98(24), 13790–13795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cai T., Liu W. D., Luo X. (2011). A constrained Inline graphic minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association 106(494), 594–607. [Google Scholar]
  7. Danaher P., Wang P., Witten D. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B 76(2), 373–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Dempster A. (1972). Covariance selection. Biometrics 28, 157–175. [Google Scholar]
  9. Fan Y., Tang C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B 75(3), 531–552. [Google Scholar]
  10. Friedman N. (2004). Inferring cellular networks using probabilistic graphical models. Science 303(5659), 799–805. [DOI] [PubMed] [Google Scholar]
  11. Friedman J., Hastie T., Tibshirani R. (2007). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Jones M. H., Virtanen C., Honjoh D., Miyoshi T., Satoh Y., Okumura S., Nakagawa K., Nomura H., Ishikawa Y. (2004). Two prognostically significant subtypes of high-grade lung neuroendocrine tumours independent of small-cell and large-cell neuroendocrine carcinomas identified by gene expression profiles. The Lancet 363(9411), 775–781. [DOI] [PubMed] [Google Scholar]
  13. Knight K., Fu W. (2000). Asymptotics for lasso-type estimators. The Annals of Statistics 28(5), 1356–1378. [Google Scholar]
  14. Meinshausen N., Bühlmann P. (2006). High-dimensional graph and variable selection with the lasso. Annals of Statistics 34(3), 1436–1462. [Google Scholar]
  15. Peng J., Wang P., Zhou N., Zhu J. (2009). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104, 735–746. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Raponi M., Zhang Y., Yu J., Chen G., Lee G., Taylor J. M., Macdonald J., Thomas D. and others (2006). Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. Cancer Research 66(15), 7466–7472. [DOI] [PubMed] [Google Scholar]
  17. Schwarz G. (1978). Estimating the dimension of a model. Annals of Statistics 6(2), 461–464. [Google Scholar]
  18. Shedden K., Taylor J. M., Enkemann S. A., Tsao M. S., Yeatman T. J., Gerald W. L., Eschrich S., Jurisica I. and others (2008). Gene expression-based survival prediction in lung adenocarcinoma: a multi-site, blinded validation study. Nature Medicine 14(8), 822–827. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Tang H., Xiao G., Behrens C., Schiller J., Allen J., Chow C. W., Suraokar M., Corvalan A. and others (2013). A 12-gene set predicts survival benefits from adjuvant chemotherapy in non-small cell lung cancer patients. Clinical Cancer Research 19(6), 1577–1586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Taylor I. W., Linding R., Warde-Farley D., Liu Y., Pesquita C., Faria D., Bull S., Pawson T. and others (2009). Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nature Biotechnology 27(2), 199–204. [DOI] [PubMed] [Google Scholar]
  21. Tomida S., Takeuchi T., Shimada Y., Arima C., Matsuo K., Mitsudomi T., Yatabe Y., Takahashi T. (2009). Relapse-related molecular signature in lung adenocarcinomas identifies patients with dismal prognosis. Journal of Clinical Oncology 27(17), 2793–2799. [DOI] [PubMed] [Google Scholar]
  22. Wang H., Li B., Leng C. (2009). Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society Series B 75(3), 531–552. [Google Scholar]
  23. Wasserman L., Roeder K. (2009). High dimensional variable selection. The Annals of Statistics 37(5), 2178–2201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Yuan M., Lin Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94(1), 19–35. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES