Abstract
Recent advances in high-throughput biotechnologies have generated var-ious types of genetic, genomic, epigenetic, transcriptomic and proteomic data across different biological conditions. It is likely that integrating data from diverse experiments may lead to a more unified and global view of biolog-ical systems and complex diseases. We present a coherent statistical frame-work for integrating various types of data from distinct but related biological conditions through graphical models. Specifically, our statistical framework is designed for modeling multiple networks with shared regulatory mech-anisms from heterogeneous high-dimensional datasets. The performance of our approach is illustrated through simulations and its applications to cancer genomics.
Keywords: Cancer genomics, data integration, graphical models
1. Introduction.
Recent advances in high-throughput technologies have generated unprecedented types and amounts of data for biomedical research. Examples include genome-wide characterizations of DNA variations (e.g., genotyping arrays, whole exome or genome sequencing), gene expression variations (e.g., gene expression microarrays, RNA sequencing), epigenetic variations and protein expression variations. Each data type, for example, genomic, transcriptomic or proteomic data, provides a comprehensive, but one-layer, view of the biological system being studied. Integrating data of diverse types is likely to lead to a more unified and global view. Thus, increasing research attention is being paid to the integrative analysis and modeling of various types of biomedical data. For instance, Varambally et al. (2005) reported the signatures of metastatic progression through integrative genomic and proteomic analysis of prostate cancer. Ouyang, Zhou and Wong (2009) proposed a predictive model to integrate ChIP-Seq and RNA-Seq to capture cooperation among regulators. Chen, Slack and Zhao (2013) developed a statistical framework for joint analysis of expression profiles of microRNA and messenger RNAs from multiple cancers. Troyanskaya et al. (2003) proposed a Bayesian framework for combining heterogeneous data sources for gene function prediction in Saccharomyces cerevisiae. Myers and Troyanskaya (2007) developed a network prediction approach to leveraging biological context information, and applied it to Saccharomyces cerevisiae. Myers et al. (2005) proposed a Bayesian approach to identifying biological networks from diverse functional genomic data. Myers et al. (2006) evaluated several evaluation methods, and suggested a new approach to evaluation in functional genomics. Shen and Tseng (2010) developed a new meta-analysis approach to pathway enrichment analysis when combining multiple genomic studies. For more literature review, see Ge, Walhout and Vidal (2003), Hawkins, Hon and Ren (2010), Hecker et al. (2009), Joyce and Palsson (2006), Ritchie et al. (2015).
In this paper, we focus on the problem of discovering regulatory relationships among heterogeneous genomic variables from biological conditions with potentially shared regulation mechanisms. In this scenario, genomic variables can be genomic variants (for instance, mutations and copy number alterations), epigenetic states (for instance, methylation status) and gene expression profiles. Biological conditions can be different tissue types or different cancer types, etc. The heterogeneous genomic variables can be binary, categorical or continuous. Different biological systems have both shared regulations and tissue or disease specific regulations. Thus, we need a statistical method to jointly learn conditional independence among a set of discrete or continuous variables across a set of distinct but related conditions. Conditional independence among variables can be represented by a graphical model in which nodes represent variables and the absence of an edge between two variables implies conditional independence.
In recent years, many efforts have been devoted to estimating undirected graphical models, especially in the high-dimensional setting under the assumption that the underlying graph is sparse. In most of the published work, the nodes in the graphical models represent either continuous or discrete variables, but not both. In the case of continuous variables, much interest has been focused on estimating Gaussian graphical models of the relationships among a set of random variables with a joint multivariate normal distribution, where zero entries in the precision (or concentration) matrix correspond to conditional independence. Meinshausen and Bühlmann (2006) proposed to estimate the precision matrix via a marginal penalized regression approach. Peng, Zhou and Zhu (2009) extended this approach to estimate partial correlations of Gaussian random variables by joint sparse regression models. Instead of performing regressions, Yuan and Lin (2006), Friedman, Hastie and Tibshirani (2008) and others took a penalized log-likelihood approach. This approach has been extended by Guo et al. (2011) and Danaher, Wang and Witten (2013) to infer multiple Gaussian graphical models based on data collected from distinct but related conditions such as different cancer types. Yin and Li (2011) and Li, Chun and Zhao (2012) considered external effects on the inferred edges through modeling conditional Gaussian graphical models. Chun et al. (2013) proposed joint conditional Gaussian graphical models with multiple sources of genomic data. In the case of discrete variables, the Ising model can be used to model conditional independence. Höfling and Tibshirani (2009) proposed a pseu-dolikelihood approach to estimating the sparse binary pairwise Markov networks. Ravikumar, Wainwright and Lafferty (2010) and Guo et al. (2010) formulated the model selection methods for high-dimensional Ising models under a penalized logistic regression framework. In the case that both discrete and continuous variables are considered, Lauritzen (1996) proposed a mixed graphical model in the low-dimensional setting. Recently, several methods have been proposed to estimate a mixed graphical model in the high-dimensional setting. Lee and Hastie (2012) proposed a pairwise graphical model over continuous and discrete variables using a group lasso penalty. Cheng, Levina and Zhu (2013) provided an approach that substitutes the l1 penalty for the group lasso penalty to reduce computation. Fellinghauer et al. (2013) took a random forests approach to mixed variables. Chen, Witten and Shojaie (2015) and Yang et al. (2013) investigated the pairwise graphical model in which the conditional distribution of the nodes belong to an exponential family.
However, in the scenario of multiple networks with mixed types of measurements, for instance, multiple cancer types with copy number variations and mutation measurements, the methods mentioned above are not suitable to be applied directly to gain biologically interpretable results. For instance, if we simply treat biological conditions (cancer types in the example) as categorical variables with equal roles as mutations, we may end up with a network modeling interactions among cancer types. These interactions are not as biologically meaningful as the interactions among mutations and/or copy number variations. Moreover, if we ignore the similarities among biological conditions and estimate the networks separately, we may get less accurate networks. Thus, there is a need to treat biological conditions and genomic measurements differently, and to discover multiple related mixed graphical models in the high-dimensional setting to represent distinct but related relationships under different conditions. We will use cancer genomic data as a motivating example to illustrate our method.
Cancers are complex diseases involving many different mechanisms. High-throughput technologies applied to human cancers have generated large genomic datasets, such as The Cancer Genome Atlas (TCGA) [Tomczak, Czerwińska and Wiznerowicz (2015)]. TCGA provides molecular landscapes of thousands of human cancers at multiple layers, including mutations and copy number alterations. It facilitates the study of regulatory mechanisms underlying various cancers. For example, Ciriello et al. (2013) identified distinct oncogenic processes as well as unexpected similarities among tumors originating from different tissues. However, the molecular regulatory networks underlying cancers are still largely unknown, impeding our understanding of cancer classifications and patient stratification, an important issue in precision medicine.
In this paper, we consider statistical learning of multiple graphical models that consist of both continuous and discrete variables, and develop a method named as Data Integration through Graphical models (DIG). We formally introduce our model in Section 2, and propose appropriate penalty schemes in Section 3 to handle high-dimensional data. Then the problem of estimating multiple mixed graphical model is formulated into an optimization problem. We propose an algorithm for parameter estimation in Section 4 and tuning parameter selection in Section 5. We then illustrate our method through simulations in Section 6 and real application to cancer genomic data in Section 7. We conclude our paper with discussion in Section 8.
2. Model.
We assume that there are a total of K groups where we have observations consisting of both continuous and discrete variables from each group. Let (xp×1, yq×1)k denote a mixed (i.e., having both continuous and discrete variables) random vector, where k is the group label (such as tissue or disease), xp×1 is a p-dimensional vector of discrete variables, and yq×1 is a q-dimensional vector of continuous variables. We assume that the density function f (x, y) has the following form proposed by Lauritzen (1996), with k omitted for simplicity:
| (2.1) |
where gx is a real-valued function of x, hx is a q-vector-valued function of x taking discrete values, and Ωx is a q × q positive definite symmetric matrix of x.
One can note that equation (2.1) can be rewritten as
Thus, the density defined in equation (2.1) implies a conditional Gaussian distribution of y|x with mean and variance . The marginal distribution of the discrete variables x has the following form:
We further simplify equation (2.1) by ignoring all interaction terms between discrete variables of order higher than two, and assuming that discrete variables affect continuous variables in a linear form, that is, the conditional covariance matrix and the canonical mean vector of the continuous (Gaussian) variables is modeled as a linear function of the discrete variables. Therefore, we have the following specifications for the functional forms of gx, hx and Ωx:
where λ0 is the normalizing constant,
each xj takes integer values 1 to Lj; λj (·) is a discrete function taking on Lj possible values; λjm(·, ·) is a bi-variate function with Lj × Lm possible values, and λjm(xj, xm) = λmj (xm, xj); η0 is a q-dimensional vector; ηj(·) is a q-dimensional function with Lj possible values for each dimension; Φ0 is a q × q matrix; and Φj (·) is a q × q matrix with each element having Lj possible values, which are all 0. For identifiability, we set λj(1) = 0, ηjr(1) = 0, Φjrs(1) = 0 and λjm(1, ·) = λjm(·, 1) = 0, for j ∈ {1, … , p}, m ∈ {1, … , p}, r ∈ {1, … , q} and s ∈ {1, … , q}.
Now, the model is parametrized by λj, λjm, η0, ηj, Φ0 and Φj, where j ∈ {1, … , p}, m ∈ {1, … , p}. For simplicity, we use Θ to denote the collection of the parameters mentioned above. Among these parameters, λj and η0 are nuisance parameters, which refer to discrete and continuous node potentials, respectively. The rest are responsible for edge potentials, which are the parameters of interest. Specifically, is the continuous-continuous edge potential; λjm is the discrete–discrete edge potential, which is a bivariate function taking on Lj × Lm values; ηjr (xj) is the continuous–discrete edge potential, which takes Lj values.
The model covers the two special situations when there are only discrete or continuous variables naturally. In the case of only discrete variables, the mixed model reduces to a discrete Markov random field,
while in the case of only continuous variables, the mixed model reduces to a multivariate Gaussian graphical model,
For a graphical model, the conditional distributions are important because they characterize the conditional dependence among the variables. The conditional distributions of the proposed model are as follows:
- The conditional distribution of xj given the rest is multinomial,
- The conditional distribution of yr given the rest is Gaussian,
where . This implies a regression model as
We assume that the observations from different classes, labeled by k where k varies from 1 to K, are independent. Given the observed data for class k with nk samples, the negative log-likelihood for class k is
The minimization of the negative log-likelihood incorporates the calculation of the normalization scalar. Because the mixed model includes the discrete model as a special case which is computationally intractable, directly minimizing (2.2) is challenging and impractical. Instead, we propose to use a computationally efficient and consistent estimation approach, the pseudolikelihood method, which is formed by the product of all conditional distributions as below:
| (2.3) |
The model above treats different biological conditions differently from categorical biological measurements such as SNPs, and estimates multiple biological networks from different biological conditions jointly. Mathematically, one may think of a possibility as treating the biological conditions and the categorical biological measurements such as SNPs equally, and make estimation through one mixed graphical model. This approach may result in edges among biological conditions. Such a network is biologically hard to interpret. Thus, we propose to use a framework of joint mixed graphical models instead of treating biological conditions equal to categorical biological measurements and estimating one mixed graphical model.
Another possibility one may think of is to estimate the networks from different biological conditions separately instead of jointly. As illustrated through a toy example that consists of observations from two classes following two normal distributions with distinct covariance matrices in Danaher, Wang and Witten (2013), estimating networks separately in each class results in less accurate estimates than estimating networks jointly. Thus, we adapt a joint graphical model approach in our application scenario.
It is easy to show that the additive negative log-pseudolikelihood is jointly convex in all the parameters {Θ(k)} over the region [see the Supplementary Material, Zhang, Ouyang and Zhao (2017)].
3. Penalty terms.
In the graphical representation of probability distributions, the absence of an edge between two variables corresponds to conditional independence between the two variables. In the proposed mixed model for each class (with k omitted for simplicity), there are three types of edges:
Discrete–discrete: If λjm(xj, xm) = 0 for all values of xj and xm, then there is no edge between nodes xj and xm. The corresponding edge potential is in either or of equation (2.3).
Discrete–continuous: If ηjr = 0 for all values of xj and Φjrs = 0 for all values of ys , s ∈ {1, … , q}, then there is no edge between nodes xj and yr. The corresponding edge potential parameter is in either or of equation (2.3).
Continuous–continuous: If Φ0rs = 0 for all values of yr and ys , and Φjrs = 0 for all values of yr, ys and xj, then there is no edge between nodes yr and ys. These parameters are in either or of equation (2.3).
In summary, we have the following equivalences:
and
and
To estimate the parameters for high-dimensional data, we assume that the underlying true graph is sparse, and incorporate penalization on the number of edges in the minimization of the additive negative log-pseudolikelihood to obtain a sparse graphical model. For the l0-norm, which is defined as the number of nonzero elements in a vector, that is, , we may use the l0 penalty to infer the graphical model for each class separately. Also, we notice that, for edges involving discrete variables, the absence of that edge requires the entire matrix λjm (for discrete–discrete), or matrix Φjrs and vector ηjr (for discrete–continuous) to be 0. Specifically, we set up the following optimization problem (with k omitted for simplicity):
| (3.1) |
where λjm is an Lj × Lm matrix, ηjr is a vector with length Lj, Φjrs is a vector with length Lj, Φ0rs is a scalar, and ρ is a tuning parameter. The function involved is integer valued and nonconvex, and it is generally hard to solve the optimization. In the machine learning literature, such kinds of difficult optimization problems are usually solved through appropriate relaxation, for example, Cheng, Levina and Zhu (2013). Notice that, for any vector b, and for any matrix B, . Thus, we can replace the l0 norm in optimizing (3.1) with an appropriate convex relation, for example,
We also notice that, for any vector b, , and for any matrix B = (bij), . Thus, we can replace and with the corresponding upper bound penalties, leading to the following optimization problem:
| (3.2) |
For simplicity, we let C denote the indices such that (θuv)u∈C,v ∈C contain parameters {λjm}, {ηj }, {Φ0} and {Φj} only. Then problem (3.2) can be written as
After discussing the optimization function for a single class, we now formulate our problem for joint analysis of multiple classes. The basic assumption for joint graphical model analysis across different biological conditions is that there are commonalities shared among multiple classes. We propose two penalization approaches (fused lasso and group lasso) to encourage borrowing information from multiple biological conditions for the estimation of the joint mixed graphical models:
| (3.3) |
Specifically, we define the penalty terms as follows.
In the case of the fused graphical lasso,
where ρ1 and ρ2 are tuning parameters. The fused graphical lasso penalty implies that graphs from multiple biological conditions are the same except for a few edges.
In the case of the group graphical lasso,
where ρ1 and ρ2 are tuning parameters. The group graphical lasso penalty treats the related biological conditions as one group, and implies that the underlying multiple graphs are the same.
4. Algorithm.
In this section, we focus on the numerical algorithms to solve the optimization problem proposed above. This constrained optimization problem can be simplified and solved by replacing it with a series of distributed problems through an augmented Lagrangian scheme. We first make the objective function separable by rewriting (3.3) as
| (4.1) |
subject to the constraint that Z(k) = Θ(k) for k = 1, … , K , where {Z} = {Z(1), … , Z(K)}. Then we carry out the function optimization and regularization locally and coordinate them globally via constraints by further rewriting problem (4.1) using the scaled augmented Lagrangian [Boyd et al. (2011), Hestenes (1969)] as
| (4.2) |
where U = {U(1), … , U(K)} are the dual feasibility-tolerance variables, and d is a scalar. The augmented Lagrangian optimization problem (4.2) can be solved by the alternating direction method of multipliers (ADMM), which guarantees to con-verge to the global optimum [Boyd et al. (2011)]. The skeleton of the algorithm at the ith iteration includes the following three steps:
Please refer to Supplementary Material for details [Zhang, Ouyang and Zhao (2017)]. Briefly, to estimate Θ, we use a coordinate-wise descent approach to obtain each parameter in Θ, and directly apply a well-suited proximal gradient algorithm, which can achieve ε-optimality within iterations. The convergence rates and properties of proximal gradient algorithms and their accelerated variants have been well studied [Auslender and Teboulle (2006), Beck and Teboulle (2009)]. To update Z, the optimization problem is separable with respect to each pair of elements in the matrix, and thus can be solved using the fused lasso signal approximator in Hoefling (2010) or the group lasso operator in Friedman, Hastie and Tibshirani (2010) depending on the choice of penalty P.
We note that separate regressions were used in estimating a single graphi-cal model, including the Gaussian graphical model [Meinshausen and Bühlmann (2006)], the Ising model [Ravikumar, Wainwright and Lafferty (2010)] and the mixed graphical model [Chen, Witten and Shojaie (2015), Cheng, Levina and Zhu (2013), Yang et al. (2013)]. The regression-type approach is computationally convenient by virtue of effective regression tools, such as glmnet [Friedman, Hastie and Tibshirani (2009)]. However, node-wise regression yields asymmetric estimates of edge potentials for an undirected graph, which results in an arbitrary or ad hoc selection of estimates of parameters. The computational diagram employed in our approach can yield symmetric estimates of edge potentials for the undirected graphs. In this respect, the proposed optimization method outperforms the simple approach to parameter estimation via separate node-wise regression. Moreover, as discussed in Section 2, it is not appropriate to treat biological conditions as discrete measurements in the modeling and estimate one mixed graph. Furthermore, estimating networks separately in each class can result in less accurate estimates than estimating networks jointly [Danaher, Wang and Witten (2013)]. Thus, we use the proposed symmetric pseudolikelihood method to jointly estimate mixed graphical models.
It is also notable that our model covers a special situation when there are only continuous variables across multiple biological conditions, which was studied by joint graphical lasso (JGL) [Danaher, Wang and Witten (2013)]. In this simple case, one only needs to estimate precision matrices of multivariate Gaussian random variables, which results in estimated concentration graphs. It has been shown that the thresholded sample covariance graph induces the same connected components as those induced by the estimated concentration graph under the same regularization parameter [Mazumder and Hastie (2012), Witten, Friedman and Simon (2011)]. With this nice property of a path of graphical lasso solutions, it can result in a faster computation by employing screenings of empirical covariance matrices to determine whether the solution to concentration graphs is block diagonal upon feature permutation, and by performing the JGL algorithm on the features within each block separately [Danaher, Wang and Witten (2013)].
5. Tuning parameter selection.
For the selection of tuning parameters, we propose the following Bayesian information criterion (BIC) type of approach:
where is the pseudolikelihood for the observations from the kth class with the tuning parameters ρ1 and ρ2, and Ek is the number of edges in the kth mixed graphical model. It is notable that the proposed BIC-type approach departures from classical BIC approaches by using the pseduolikelihood rather than likelihoods.
We notice that the Akaike information criterion (AIC) approach has been used for the selection of Gaussian graphical models [Danaher, Wang and Witten (2013)]. Analogously, one may use the following AIC-type approach for model selection in our research scenario using the pseudolikelihood:
We compared the two model selection criteria through simulation studies as shown in Section 6. Our analysis suggests that the AIC-type approach tends to choose too large but less accurate models compared to the proposed BIC-type criterion. Thus, we use the proposed BIC-type approach in the real application as illustrated in Section 7.
6. Simulations.
We demonstrate the performance of our approach through simulations. Without loss of generality, in the following and for the simplicity of simulation, we focus on Ωx = Ω, that is, Φjrs(xj) = 0 for any j ∈ {1, … , p}, r ∈ {1, … , q}, s ∈ {1, … , q} and xj ∈ {1, … , Lj }. With this assumption, the covariance matrix for the continuous variables is independent of the values of the discrete variables for the same class.
6.1. Random networks.
We first considered two mixed graphical models (representing two classes), each consisting of 10 categorical (with two levels, 1 and 2) and 10 Gaussian variables. The topologies for the two simulated networks are shown in Figure 1. The left panel of Figure 1 is the adjacent matrix of the mixed graphical model of the first class, while the right panel represents the second class. The first 10 rows/columns correspond to discrete variables, while the second 10 rows/columns correspond to continuous variables. The degree distributions are al-most uniform, which are similar to those in the synthetic experiments in Lee and Hastie (2012). Based on the edge sets defined in Figure 1, we assigned nonzero potentials on the edges for each mixed graphical model as follows.
Fig. 1.

The true adjacency matrices of simulated random networks in Section 6.1.
First, we generated a p × p edge potential matrix connecting discrete variables as below:
| (6.1) |
Second, we assigned 2p × q elements for an edge potential matrix connecting discrete and continuous random variables as follows:
| (6.2) |
Third, we generated a precision matrix Ω = (ωrs) of continuous variables as below:
To draw samples (x, y) from the joint density f (x, y), we first drew samples x ~ f (x) of the following form:
with
To overcome the difficulty with direct sampling from f (x), we adapted the Gibbs sampling approach in Lee and Hastie (2012). We drew 202,000 samples in total for discrete random variables of each mixed graphical model, and discarded the first 2000 samples which were generated in a burn-in period. Then we took one sample every 100 draws to preserve independence. After sampling x, we sampled y from the conditional distribution f (y|x), which is N (Ω−1(η0 + η(x)), Ω−1) with η0 = 0.
Using the proposed method DIG, we discovered the network structures for two classes over a range of tuning parameters. We recorded the total number of identified edges for each pair of tuning parameters and calculated the number of true positive edges and the number of false positive edges. It took 52 seconds to obtain the results using the proposed algorithm using a 3-GHz Intel Core i7 processor. For comparison, we also applied the JGL proposed by Danaher, Wang and Witten (2013) by treating the two values (1 and 2) of discrete random variables as continuous variables. Similarly, we calculated the number of true positive edges and the number of false positive edges for a range of tuning parameters of JGL. The results are shown in Figure 2. One can see that our method has better performance with both group lasso and fused lasso penalty schemes. This shows the benefit of explicitly modeling discrete and continuous variables in discovering the underlying networks.
Fig. 2.

Comparison of DIG and JGL on random networks using the simulated data in Section 6.1.
Then we investigate the performance of the tuning parameter selection procedure by checking the sensitivity and specificity of the selected model. The sensitivity and specificity are defined as below, respectively,
where TP refers to true positives, FP refers to false positives, TN refers to true negatives and FN refers to false negatives. For each pair of tuning parameters, we calculated the corresponding sensitivity, specificity and score for the proposed BIC-type model selection criterion as shown in Figure 3. Figure 3(a) shows the sensitivities of DIG with the group lasso penalty over a range of tuning parameters, while Figure 3(b) shows the sensitivities of DIG with the fused lasso penalty. Figure 3(c) shows the specificities of DIG with the group lasso penalty over a range of tuning parameters, while Figure 3(d) shows the specificities of DIG with the fused lasso penalty. Figures 3(e) and (f) show the corresponding BIC-type scores for the group lasso and fused lasso, respectively, with the selected model for each type of penalty indicated by a purple diamond. The results show that DIG achieves high sensitivities (1 for both group lasso and fused lasso penalties) and specificities (0.93 for group lasso penalty, 0.96 for fused lasso penalty) with the proposed BIC-type model selection approach. To compare, we also investigated model selection performance through the AIC-type approach. The orange circles in Figure 3 indicate the corresponding selected models for the group lasso penalty and the fused lasso penalty. In this simulation study, although the graphical models selected by the AIC-type approach can achieve as high sensitivities as the BIC-type approach, the corresponding specificities are lower (0.78 for group lasso penalty, 0.49 for fused lasso penalty) than those of the BIC-type approach. The results suggest that the AIC-type approach tends to choose a larger but less accurate model than the BIC-type approach.
Fig. 3.

Performance of DIG with a range of tuning parameter pairs on random networks using the simulated data in Section 6.1. Purple diamonds indicate the model selected by the proposed BIC procedure. Orange circles indicate the model selected by the AIC-type criterion.
6.2. Scale-free networks.
It has been shown that many real networks are scale-free, of which degree distributions follow power law. In this simulation, we investigate the performance of our approach for scale-free networks where the probability that a node has a connectivity of d is proportional to d–γ. It has been found that the γ values for real-world networks usually vary from 2 to 3 [Albert, Jeong and Barabási (2000), Barabási and Albert (1999), Govindan and Tangmunarunkit (2000), Jeong et al. (2001), Yook, Oltvai and Barabási (2004)]. Thus, we randomly generated two networks for two classes where γ is 2.433 and 2.317, respectively. The nodes in each network consist of 10 discrete variables and 10 continuous variables. The simulated structures of the two scale-free networks guided us to assign edge potentials. We used the formula of equation 6.1 for the potentials of the edges connecting discrete variables. We used the formula of equation 6.2 for the potentials of the edges between discrete and continuous variables. For the potentials of the edges connecting continuous variables, we drew random values using the following approach such that the precision matrix Ω is a positive definite matrix. For each class, we first created a q × q matrix with ones on the diagonal and zeros on elements not corresponding to network edges. Then we drew nonzero random values on elements corresponding to edges. To obtain each of these nonzero random values, we first drew a random number a from a uniform distribution U (0, 1), then we randomly picked a number from {0.1, –0.4} with equal probability. The nonzero random values assigned to the elements corresponding to edges are 0.3a + b. Then we divided each off-diagonal element by 1.5 times the sum of the absolute values of the off-diagonal elements in its row. Finally, we added the transpose of the matrix to the matrix itself to achieve a symmetric and positive definite matrix Ω. To draw samples for each graphical model, we adapted the sample generation procedure in Section 6.1. We set the burn-in threshold as 2000 in the Gibbs sampler. For each graphical model, we took one observation every 100 draws to preserve independence, and obtained 2000 samples in total for the following investigation.
Using the proposed DIG approach, we discovered network topologies using the simulated samples over a range of tuning parameters. Figures 4(a) and (b) show the sensitivities of DIG with group lasso penalty and fused lasso penalty, respectively, while Figures 4(c) and (d) show the corresponding specificities. Figures 4(e) and (f) show the corresponding scores of the proposed BIC-type approach, with the selected model for each type of penalty indicated by a purple diamond. The results show that DIG can achieve high sensitivity (1 for both group lasso and fused lasso penalties) and specificity (0.87 for group lasso penalty, 0.86 for fused lasso penalty) with the proposed BIC-type of approach. It suggests that the proposed BIC-type model selection approach can select suitable tuning parameters. We also investigated the effects of sample sizes on the performance of DIG. With a smaller sample size of 800, we obtained sensitivity as high as 1 for both group lasso and fused lasso, with a considerable specificity of 0.61 for group lasso and 0.62 for fused lasso.
Fig. 4.

Performance of DIG with a range of tuning parameter pairs on scale-free networks using the simulated data in Section 6.2. Purple diamonds indicate the models (group lasso penalty: ρ1 = 0.0001, ρ2 = 0.1; fused lasso penalty: ρ1 = 0.05, ρ2 = 0.1) selected by the proposed BIC procedure.
We also compared our approach with JGL in Danaher, Wang and Witten (2013). The results are shown in Figure 5. In the scenario of scale-free mixed networks, our method outperforms JGL with higher true positive discovery rates and lower false positive discovery rates in both group lasso and fused lasso penalty schemes.
Fig. 5.

Comparison of DIG and JGL on scale-free networks using the simulated data in Section 6.2.
7. Application to cancer genomic data.
We applied our DIG approach to TCGA datasets of two cancer types: colorectal carcinoma (coadread) and breast invasive carcinoma (brca). We obtained the mutation and copy number variation (CNV) data compiled by Ciriello et al. (2013) for 491 coadread subjects and 488 brca subjects, respectively. We used the PI3K-mTOR-AKT pathway to illustrate our method. Our data includes 62 genes with mutation information and their corresponding copy number measurements for each subject. We treated mutations as discrete variables with two levels representing the presence and absence of mutations, and copy number variations as continuous variables. We used the fused lasso penalty to the datasets from the two cancer types, and the proposed BIC approach to choose the tuning parameters (ρ1 = 0.3, ρ2 = 0.5). The selected optimum tuning parameter ρ2 = 0.5 for fused lasso indicates similarities exist in our data between coadread and brca. It took 134 seconds to obtain the results using a 3-GHz Intel Core i7 processor with our current DIG implementation. We have identified 1660 edges for the coadread class and 1632 edges for the brca class. Among the interactions of coadread, 16% of them are mutation-mutation interactions, 42.3% of them are mutation-CNV interactions, and the rest are CNV-CNV interactions. For brca, 16.2% of edges are mutation-mutation interactions, 43% of them are mutation-CNV interactions, and the rest are CNV–CNV interactions. The two tumor networks share 1584 edges. We studied the community structures or modules in the identified networks through the eigenspectrum decomposition of the modularity matrices described in Newman (2006). We found four modules in coadread with sizes of 16, 66, 22 and 20. We also found two modules in brca with sizes 58 and 66. Interestingly, the second module is shared in both coadread and brca. This module is shown in the left panel of Figure 6 with common interactions in coad-read and brca plotted. The nodes with the highest degrees are indicated by their names. The hubs in this common module for both coadread and brca are important oncogenes including TP53 that play important roles in many cancers. We also performed enrichment analysis using INGENUITY (www.ingenuity.com) on the genes evolved in this common module. As shown in the right panel of Figure 6, we found that the identified functions are very relevant to the studied biological context and critical to tumor development, for example, melanoma signaling and cell-cycle events. The rest of the tumor-specific communities for coadread and brca are presented in Supplementary Figure 1, and indicated by different colors. The representative nodes for each module with the highest degrees are indicated by their names as well. We also plotted the detailed tumor-specific interactions in Supplementary Figure 2. There are 76 coadread-specific interactions and 48 brca-specific interactions. Nodes are shown in different colors corresponding to degree differences between the two types of tumors. Specifically, red nodes have the same degrees in the two conditions, while blue nodes have higher degrees in the coad-read condition, and orange nodes have higher degrees in the brca condition. We also zoomed in some known oncogenes with their names shown in the pictures. We identified genetic variants that are known to be implicated in individual cancer types. For example, the MTOR copy number has a higher degree in coadread ver-sus brca, which is consistent with the activation of the PI3K-mTOR-AKT pathway in coadread [Ciriello et al. (2013)]. Also, the BRCA1 mutation has a higher de-gree in brca versus coadread, which corresponds to the inactivation of BRCA1 in breast tumors that leads to defective cell cycle arrest in response to DNA damage [Network et al. (2012)].
Fig. 6.

Common module and functions for networks identified by DIG. (a) The shared module with common interactions presented in both coadread and brca. (b) Identified functions by enrichment analysis of genes evolved in the common module. P-values are obtained by Fisher’s exact tests.
To compare, we also performed the analysis without considering the similarities of coadread and brca, for which ρ2 = 0. We selected the optimal value for the tuning parameter controlling sparsity as ρ1 = 0.5. We found 1222 edges for coadread and 981 edges for brca, respectively. Among them, there are 1000 coadread-specific interactions, 759 brca-specific interactions and 222 common interactions. The number of common interactions in this analysis is much less than that in the joint analysis above. We also performed modularity analysis for the resulting networks. We found three modules for caodread with sizes of 67, 1 and 56, as well as four modules for brca with sizes of 36, 45, 22 and 21. Moreover, the results show that the shared edges cannot form a common community. Furthermore, we also applied JGL [Danaher, Wang and Witten (2013)] to this cancer dataset by treating mutations as continuous variables of CNVs. It ended up with 890 interactions for coadread and 1218 interactions for brca. Among these interactions, for coadread, 1.69% are mutation-mutation interactions and 21.5% are mutation-CNV interactions. For brca, 0.41% are mutation-mutation interaction and 14.4% are mutation-CNV interactions. The results suggest that JGL is less capable of identifying interactions related to mutations. Specifically, JGL identified only one interaction (NRG1) associated with the mTOR mutation in coadread, and no interaction was identified with the mTOR mutation in brca. However, DIG identified 20 and 19 interactions associated with the mTOR mutation in coadread and brca, respectively. In coadread, DIG suggests that the mTOR mutation is associ-ated with the mutations of TP53, RB1, MAP3K1, COL4A5 and PTEN, as well as CNVs of MTOR, ARID1A, DNMT3A, TET2, FBXW7, SDK1, NRG1, CDKN1B, HCN4, CTCF, CDH1, MAP2K4, SMAD4, PHLPP1 and EP300. In brca, DIG indicates that the mTOR mutation is connected with the mutations of TP53, RB1, MAP3K1, COL4A5 and PTEN, as well as MTOR, DNMT3A, TET2, FBXW7, SDK1, NRG1, CDKN1B, HCN4, CTCF, CDH1, MAP2K4, SMAD4, PHLPP1 and EP300. As the PI3K-mTOR-AKT pathway is the biological context considered in our application, JGL is likely to miss important interactions relevant to the mTOR mutation compared to DIG. One of the evidences is that activation of p53 inhibits mTOR activity, and inhibited mTOR also affects p53 activity [Feng et al. (2005)].
8. Discussion.
In this paper, we have proposed a coherent statistical frame-work, DIG, for the problem of estimating multiple related mixed graphical models from high-dimensional data with both discrete and continuous variables and with observations belonging to distinct but related biological conditions. The application has been illustrated using cancer studies. DIG is a general statistical framework that can be applied to the genomics of other diseases. For future work, it is natural to extend the proposed framework employing exponential families for a mixed graphical model [Chen, Witten and Shojaie (2015), Yang et al. (2013)]. Furthermore, it would be interesting to develop hypothesis testing methods such that the final mixed graphical models are accompanied by a p-value on each edge and an overall estimate of edge false discovery rate. A systematic investigation of model selections and hypothesis testing for the components of the mixed graphical models would be important future work.
Supplementary Material
Acknowledgments.
We thank the Associate Editor, two referees and Zhiyi Chi for their careful reading of the manuscript and many helpful suggestions.
Supported in part by the InCHIP Faculty Affiliate Seed Grant at UConn, Faculty Research Excellence Program Award at UConn and the CICATS PreK Career Development Award at UConn.
Supported in part by the Research Starter Grant in Informatics from PhRMA Foundation.
Supported in part by National Science Foundation Grant DMS-11-06738, and National Institutes of Health (NIH) Grants R01 GM59507 and P01 CA154295.
Footnotes
SUPPLEMENTARY MATERIAL
Supplement to “A statistical framework for data integration through graphical models with application to cancer genomics.” (DOI: 10.1214/16-AOAS998SUPP; .pdf). We present technical and methodological details regarding the model and algorithm in Section 2 and 4. Furthermore, complementary results for the application in Section 7 are provided.
Contributor Information
YUPING ZHANG, DEPARTMENT OF STATISTICS, INSTITUTE FOR SYSTEMS GENOMICS, CENTER FOR QUANTITATIVE MEDICINE, INSTITUTE FOR COLLABORATION ON HEALTH, INTERVENTION, AND POLICY, THE CONNECTICUT INSTITUTE FOR THE BRAIN AND COGNITIVE SCIENCES, UNIVERSITY OF CONNECTICUT, STORRS, CONNECTICUT 06269, USA, yuping.zhang@uconn.edu.
ZHENGQING OUYANG, THE JACKSON LABORATORY FOR GENOMIC MEDICINE, DEPARTMENT OF BIOMEDICAL ENGINEERING, DEPARTMENT OF GENETICS AND GENOME SCIENCES, INSTITUTE FOR SYSTEMS GENOMICS, UNIVERSITY OF CONNECTICUT, FARMINGTON, CONNECTICUT 06030, USA, zhengqing.ouyang@jax.org.
HONGYU ZHAO, DEPARTMENT OF BIOSTATISTICS, YALE SCHOOL OF PUBLIC HEALTH, NEW HAVEN, CONNECTICUT 06510, USA, hongyu.zhao@yale.edu.
REFERENCES
- ALBERT R, JEONG H and BARABÁSI A-L (2000). Error and attack tolerance of complex networks. Nature 406 378–382. [DOI] [PubMed] [Google Scholar]
- AUSLENDER A and TEBOULLE M (2006). Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim 16 697–725 (electronic). MR2197553 [Google Scholar]
- BARABÁSI A-L and ALBERT R (1999). Emergence of scaling in random networks. Science 286 509–512. MR2091634 [DOI] [PubMed] [Google Scholar]
- BECK A and TEBOULLE M (2009). Gradient-based algorithms with applications to signal recovery. Convex Optim. Signal Process. Commun 42–88.
- BOYD S, PARIKH N, CHU E, PELEATO B and ECKSTEIN J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn 3 1–122. [Google Scholar]
- CHEN X, SLACK FJ and ZHAO H (2013). Joint analysis of expression profiles from multiple cancers improves the identification of microRNA–gene interactions. Bioinformatics 29 2137–2145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- CHEN S, WITTEN DM and SHOJAIE A (2015). Selection and estimation for mixed graphical models. Biometrika 102 47–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- CHENG J, LEVINA E and ZHU J (2013). High-dimensional mixed graphical models. Preprint Available at arXiv:1304.2810.
- CHUN H, CHEN M, LI B and ZHAO H (2013). Joint conditional Gaussian graphical models with multiple sources of genomic data. Front. Genet 4 Article ID 294 DOI: 10.3389/fgene.2013.00294. [DOI] [PMC free article] [PubMed] [Google Scholar]
- CIRIELLO G, MILLER ML, AKSOY BA, SENBABAOGLU Y, SCHULTZ N and SANDER C (2013). Emerging landscape of oncogenic signatures across human cancers. Nat. Genet 45 1127–1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- DANAHER P, WANG P and WITTEN DM (2013). The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B. Stat. Methodol 76 373–397. MR3164871 [DOI] [PMC free article] [PubMed] [Google Scholar]
- FELLINGHAUER B, BÜHLMANN P, RYFFEL M, VON RHEIN M and REINHARDT JD (2013). Stable graphical model estimation with random forests for discrete, continuous, and mixed variables. Comput. Statist. Data Anal 64 132–152. MR3061894 [Google Scholar]
- FENG Z, ZHANG H, LEVINE AJ and JIN S (2005). The coordinate regulation of the p53 and mTOR pathways in cells. Proc. Natl. Acad. Sci. USA 102 8204–8209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- FRIEDMAN J, HASTIE T and TIBSHIRANI R (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- FRIEDMAN J, HASTIE T and TIBSHIRANI R (2009). Glmnet: Lasso and elastic-net regularized generalized linear models. R Package Version 1
- FRIEDMAN J, HASTIE T and TIBSHIRANI R (2010). A note on the group lasso and a sparse group lasso Technical report, Dept. Statistics, Stanford Univ., Stanford. [Google Scholar]
- GE H, WALHOUT AJ and VIDAL M (2003). Integrating ‘omic’ information: A bridge between genomics and systems biology. Trends Genet 19 551–560. [DOI] [PubMed] [Google Scholar]
- GOVINDAN R and TANGMUNARUNKIT H (2000). Heuristics for Internet map discovery. In Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies 3 1371–1380. IEEE, New York. [Google Scholar]
- GUO J, LEVINA E, MICHAILIDIS G and ZHU J (2010). Joint structure estimation for categorical Markov networks Technical report, Dept. Statistics, Univ. of Michigan, Ann Arbor. [Google Scholar]
- GUO J, LEVINA E, MICHAILIDIS G and ZHU J (2011). Joint estimation of multiple graphical models. Biometrika 98 1–15. MR2804206 [DOI] [PMC free article] [PubMed] [Google Scholar]
- HAWKINS RD, HON GC and REN B (2010). Next-generation genomics: An integrative approach. Nat. Rev. Genet 11 476–486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- HECKER M, LAMBECK S, TOEPFER S, VAN SOMEREN E and GUTHKE R (2009). Gene regulatory network inference: Data integration in dynamic models—A review. Biosystems 96 86–103. [DOI] [PubMed] [Google Scholar]
- HESTENES MR (1969). Multiplier and gradient methods. J. Optim. Theory Appl 4 303–320. MR0271809 [Google Scholar]
- HOEFLING H (2010). A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Statist 19 984–1006. Supplementary materials available online. MR2791265 [Google Scholar]
- HÖFLING H and TIBSHIRANI R (2009). Estimation of sparse binary pairwise Markov networks using pseudo-likelihoods. J. Mach. Learn. Res 10 883–906. MR2505138 [PMC free article] [PubMed] [Google Scholar]
- JEONG H, MASON SP, BARABÁSI A-L and OLTVAI ZN (2001). Lethality and centrality in protein networks. Nature 411 41–42. [DOI] [PubMed] [Google Scholar]
- JOYCE AR and PALSSON BØ (2006). The model organism as a system: Integrating “omics” data sets. Nat. Rev., Mol. Cell Biol 7 198–210. [DOI] [PubMed] [Google Scholar]
- LAURITZEN SL (1996). Graphical Models. Oxford Statistical Science Series 17 Oxford Univ. Press, New York: MR1419991 [Google Scholar]
- LEE JD and HASTIE TJ (2012). Learning mixed graphical models. Preprint Available at arXiv:1205.5012. [DOI] [PMC free article] [PubMed]
- LI B, CHUN H and ZHAO H (2012). Sparse estimation of conditional graphical models with application to gene networks. J. Amer. Statist. Assoc 107 152–167. MR2949348 [DOI] [PMC free article] [PubMed] [Google Scholar]
- MAZUMDER R and HASTIE T (2012). Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res 13 781–794. MR2913718 [PMC free article] [PubMed] [Google Scholar]
- MEINSHAUSEN N and BÜHLMANN P (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist 34 1436–1462. MR2278363 [Google Scholar]
- MYERS CL and TROYANSKAYA OG (2007). Context-sensitive data integration and prediction of biological networks. Bioinformatics 23 2322–2330. [DOI] [PubMed] [Google Scholar]
- MYERS CL, ROBSON D, WIBLE A, HIBBS MA, CHIRIAC C, THEESFELD CL, DOLINSKI K and TROYANSKAYA OG (2005). Discovery of biological networks from diverse functional genomic data. Genome Biol 6 Article ID R114 DOI: 10.1186/gb-2005-6-13-r114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MYERS CL, BARRETT DR, HIBBS MA, HUTTENHOWER C and TROYANSKAYA OG (2006). Finding function: Evaluation methods for functional genomic data. BMC Genomics 7 187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- NETWORK CGA et al. (2012). Comprehensive molecular portraits of human breast tumours. Nature 490 61–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- NEWMAN MEJ (2006). Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E (3) 74 Article ID 036104. MR2282139 [DOI] [PubMed] [Google Scholar]
- OUYANG Z, ZHOU Q and WONG WH (2009). ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. USA 106 21521–21526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- PENG J, ZHOU N and ZHU J (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc 104 735–746. MR2541591 [DOI] [PMC free article] [PubMed] [Google Scholar]
- RAVIKUMAR P, WAINWRIGHT MJ and LAFFERTY JD (2010). High-dimensional Ising model selection using 1-regularized logistic regression. Ann. Statist 38 1287–1319. MR2662343 [Google Scholar]
- RITCHIE MD, HOLZINGER ER, LI R, PENDERGRASS SA and KIM D (2015). Methods of integrating data to uncover genotype-phenotype interactions. Nat. Rev. Genet 16 85–97. [DOI] [PubMed] [Google Scholar]
- SHEN K and TSENG GC (2010). Meta-analysis for pathway enrichment analysis when combining multiple genomic studies. Bioinformatics 26 1316–1323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- TOMCZAK K, CZERWINSKA P and WIZNEROWICZ M (2015). The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol 19 A68–A77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- TROYANSKAYA OG, DOLINSKI K, OWEN AB, ALTMAN RB and BOTSTEIN D (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA 100 8348–8353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- VARAMBALLY S, YU J, LAXMAN B, RHODES DR, MEHRA R, TOMLINS SA, SHAH RB, CHANDRAN U, MONZON FA, BECICH MJ et al. (2005). Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell 8 393–406. [DOI] [PubMed] [Google Scholar]
- WITTEN DM, FRIEDMAN JH and SIMON N (2011). New insights and faster computations for the graphical lasso. J. Comput. Graph. Statist 20 892–900. MR2878953 [Google Scholar]
- YANG E, RAVIKUMAR P, ALLEN GI and LIU Z (2013). On graphical models via univariate exponential family distributions. Preprint Available at arXiv:1301.4183. [PMC free article] [PubMed]
- YIN J and LI H (2011). A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Stat 5 2630–2650. MR2907129 [DOI] [PMC free article] [PubMed] [Google Scholar]
- YOOK S-H, OLTVAI ZN and BARABÁSI A-L (2004). Functional and topological characterization of protein interaction networks. Proteomics 4 928–942. [DOI] [PubMed] [Google Scholar]
- YUAN M and LIN Y (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol 68 49–67. MR2212574 [Google Scholar]
- ZHANG Y, OUYANG Z and ZHAO H (2017). Supplement to “A statistical framework for data integration through graphical models with application to cancer genomics.” DOI: 10.1214/16-AOAS998SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
