Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2018 Dec 26;21(3):561–576. doi: 10.1093/biostatistics/kxy078

Bayesian inference of networks across multiple sample groups and data types

Elin Shaddox 1,, Christine B Peterson 2, Francesco C Stingo 3, Nicola A Hanania 4, Charmion Cruickshank-Quinn 5, Katerina Kechris 6, Russell Bowler 7, Marina Vannucci 1,
PMCID: PMC7672535  PMID: 30590505

Summary

In this article, we develop a graphical modeling framework for the inference of networks across multiple sample groups and data types. In medical studies, this setting arises whenever a set of subjects, which may be heterogeneous due to differing disease stage or subtype, is profiled across multiple platforms, such as metabolomics, proteomics, or transcriptomics data. Our proposed Bayesian hierarchical model first links the network structures within each platform using a Markov random field prior to relate edge selection across sample groups, and then links the network similarity parameters across platforms. This enables joint estimation in a flexible manner, as we make no assumptions on the directionality of influence across the data types or the extent of network similarity across the sample groups and platforms. In addition, our model formulation allows the number of variables and number of subjects to differ across the data types, and only requires that we have data for the same set of groups. We illustrate the proposed approach through both simulation studies and an application to gene expression levels and metabolite abundances on subjects with varying severity levels of chronic obstructive pulmonary disease. Bayesian inference; Chronic obstructive pulmonary disease (COPD); Data integration; Gaussian graphical model; Markov random field prior; Spike and slab prior.

Keywords: Bayesian inference, Chronic obstructive pulmonary disease (COPD), Data integration, Gaussian graphical model, Markov random field prior, Spike and slab prior

1. Introduction

Gaussian graphical models, which describe the dependence relations among a set of random variables, have been widely applied to estimate biological networks on the basis of high-throughput data. When all samples are collected under similar conditions or reflect a single type of disease, methods such as the graphical lasso (Meinshausen and Bühlmann, 2006; Yuan and Lin, 2007; Friedman and others, 2008) or Bayesian network inference approaches (Roverato, 2002; Wang and Li, 2012) can be applied to infer a sparse network. In many studies, however, such as the COPDGene study (Regan and others, 2010) of this article, described below, samples are obtained for different subtypes or disease, varying experimental settings, or other heterogeneous conditions. In this setting, applying standard graphical model inference approaches to the pooled data across conditions will lead to spurious findings, while separate estimation for each subgroup reduces statistical power. The challenge becomes even more formidable when multiple data types (or platforms) are under consideration, specifically gene expression levels and metabolite abundances in the COPDGene study, measured on multiple subjects. In this case, pooling the data is not appropriate, as it ignores the fact that direct connections between variables of different data types may not be sensible. Nonetheless, analyzing data from each platform separately ignores potential commonalities, for example, that subjects with more advanced disease may have more extensive disruption of functional mechanisms across data types. The need for statistical methods to address these questions is particularly pressing given the increasing number of studies investing in comprehensive profiling of subjects across multiple data platforms. Our proposed statistical method enables joint inference of networks across sample groups and data types, providing accurate characterization of complex disease mechanisms which can be used to develop targeted treatment approaches.

Recently, methods have been proposed to estimate multiple networks on a common set of variables. Early work includes approaches that encourage either common edge selection or precision matrix similarity by penalizing cross-group differences (Guo and others, 2011; Danaher and others, 2014; Zhu and others, 2014; Cai and others, 2016). These methods use a single penalty parameter to control network similarity across all groups. Hao and others (2018) have extended the approach to simultaneously infer graph clustering via an additional penalty on the estimated cluster mean. In contrast, more recent proposals encourage network similarity in a more tailored manner, assuming that the networks for each sample group are related within a tree structure (Oates and Mukherjee, 2014), or, more generally, within an undirected weighted graph (Saegusa and Shojaie, 2016; Ma and Michailidis, 2016). These methods require that the cross-group relations are known a priori or inferred in a preliminary step. More flexible approaches that employ Bayesian frameworks to simultaneously learn the networks for each group and their similarity have been proposed in Peterson and others (2015) and Shaddox and others (2018).

In this work, we develop a graphical modeling framework which enables the joint inference of network structures when there is heterogeneity among both sets of subjects (i.e., at different stages of a disease) and sets of variables (i.e., types of data or platforms). Our proposed Bayesian hierarchical model first links the network structures within each platform using a Markov random field (MRF) prior to relate edge selection across the sample subgroups, and then links the measures of cross-group similarity across platforms. This is a flexible modeling approach, which allows the number of variables and number of subjects to differ across the data types, and only requires that we have data for the same set of subgroups. Consider for example, the gene expression and metabolite abundances measured on healthy controls and on moderate and severe COPD subjects of our case study. These two platforms measure different aspects of the same biological pathway. Small compounds and metabolites are measured by the LC/MS platform, while gene expression levels of enzymes and proteins are measured by the microarray platform. Also, alterations in the pathway affect different components (metabolites or enzymes) of the pathway. In this type of scenario, we can expect data between the two platforms to be related. Our modeling framework is concerned with two types of network similarity-within and between platforms. Within each platform, we assess how similar subgroups are in terms of their graph structure. This results in a super-graph for each platform expressing whether two subgroups are similar, i.e., connected, within each platform. We then assess whether or not these super graphs are similar between platforms. This approach enables the joint estimation of the biological networks in a flexible manner, as we make no assumptions on the directionality of influence across the data types, nor on the extent of network similarity across the sample groups and platforms. In this regards, our approach differs from many of the existing methods for integrative analysis, that typically model the association between different types of observed random variables assuming a direction of influence among the data types, see for example Wang and others (2013) and Cassese and others (2014) for the use of multi-component hierarchical models, Chen and others (2015) for mixed graphical models, and Lin and others (2016) for a multi-layered Gaussian graphical model where directed edges are allowed across layers of each data type. Instead, we infer measures of relative similarity based on the data, which provide valuable insight into the extent of network relatedness across sample groups and data types.

The article is organized as follows. We present the motivating chronic obstructive pulmonary disease (COPD) case study in Section 2. In Section 3, we describe the proposed Bayesian model and procedures for posterior inference. We return to the COPD data set in Section 4, where we apply our proposed method to infer metabolic and gene co-expression networks for varying disease stages. Section 5 provides simulations studies illustrating the performance of the proposed method against competing approaches. Finally, we conclude with a discussion in Section 6.

2. The COPDGene study

Our work has been motivated, in particular, by a collaborative study aimed at understanding how cellular metabolic and gene expression networks are disrupted by COPD, the 3rd leading cause of death in the United States (National Center for Health Statistics, 2016), and one of the top causes of hospitalization. While smoking is the primary risk factor for COPD, only 20% of smokers will ever develop the disease. There is a poor understanding of risk factors accounting for disease susceptibility, as well as the underlying pathogenic mechanisms resulting in airway inflammation and emphysema. Understanding the genetic, clinical, and molecular factors that determine why some smokers develop COPD is the primary motivation of the NIH funded multicenter observational study, COPDGene, which has over 10 000 participants and includes extensive molecular profiling using transcriptomics, metabolomics, and proteomics. For this study, subjects 45–80 years old with at least a 10 pack-year history of smoking were recruited and biomarker measurements were attained from blood (Regan and others, 2010). There is a high degree of heterogeneity in the patient population, which includes subjects from various clinical stages, defined using the global initiative for chronic obstructive lung disease (GOLD) staging criteria. We apportioned subjects according to GOLD COPD stages and model resulting networks for a control group (GOLD stage = 0), a mild or moderate group (GOLD stage = 1 or 2), and a severe group (GOLD stage = 3 or 4). Here, we focus in particular on a subset of subjects for whom gene expression levels or metabolite abundances are available. For the gene platform, this apportionment resulted in a control group (GOLD stage = 0) of 42 subjects, a mild or moderate group (GOLD stage = 1 or 2) of 42 subjects, and a severe group (GOLD stage = 3 or 4) of 42 subjects. For the metabolite platform, the control group again had 42 subjects, whereas the moderate and severe group had 45 and 44 subjects, respectively. Ten subjects had GOLD stage Inline graphic, indicating that although they had abnormal lung function, they did not satisfy the clinical criteria for COPD. Therefore, these subjects were excluded from the analysis. This data set illustrates the need for our proposed method, which can be used to analyze the multi-platform data across the heterogenous patient groups in a coherent and integrative fashion. In summary, this article is concerned with the analysis of data measured on two platforms, genes and metabolites, for three subgroups of subjects classified by COPD GOLD stage.

3. Proposed method

In this section, we provide details on the proposed method, including the likelihood, prior formulation, and procedures for posterior inference. Graphical representations are provided in Figure 1.

Fig. 1.

Fig. 1.

Left: Graphical model representation of the proposed model, illustrating variables, parameters, and hyper parameters for each of the Inline graphic groups and Inline graphic platforms. Right: A graphical illustration with Inline graphic subgroups and Inline graphic platforms.

3.1. Likelihood

Suppose, we observe data on Inline graphic data types and Inline graphic subgroups. In our COPDGene case study, we have Inline graphic and Inline graphic. For each subgroup and each platform, let Inline graphic be the Inline graphic data matrix, with Inline graphic indexing the subgroup, Inline graphic indexing the platform type, Inline graphic the sample size for subgroup Inline graphic from platform Inline graphic, and Inline graphic the total number of observed variables for platform Inline graphic. Assuming that the samples are independent and identically distributed within each of the Inline graphic subgroups and Inline graphic platforms, we can write the likelihood for subject Inline graphic in subgroup Inline graphic and platform Inline graphic as the multivariate normal distribution

graphic file with name M24.gif (3.1)

where the mean vector Inline graphic and precision matrix Inline graphic are specific to subgroup Inline graphic and platform Inline graphic. For simplicity, we column center the data for each subgroup, and therefore, assume Inline graphic. We note that Inline graphic is constrained to the space Inline graphic of Inline graphic positive-definite symmetric matrices. We denote the entry in the Inline graphicth row and Inline graphicth column of Inline graphic as Inline graphic.

3.2. Prior formulation

The patterns of zeros in the precision matrices Inline graphic correspond to undirected graphs among the variables. Specifically, Inline graphic if and only if the corresponding edge Inline graphic is missing in the conditional dependence graph for subgroup Inline graphic from platform Inline graphic. The goal of our modeling formulation is to infer a sparse version the precision matrices Inline graphic in a manner that links inference across platforms.

The graph for each platform Inline graphic and subgroup Inline graphic can be defined by a set of vertices Inline graphic and edges Inline graphic, and may be expressed as a symmetric binary matrix Inline graphic, where each off-diagonal element Inline graphic denotes the inclusion of edge Inline graphic. Our proposed model first links the edge inclusion indicators across sample subgroups within each platform, and then links platforms based on the dependences across subgroups within each platform. We now describe in detail the components of our prior.

3.2.1. Mixture prior on precision matrix elements.

We rely on the mixture prior proposed in Wang (2015) to infer a sparse version of Inline graphic. This prior is attractive as it allows direct modeling of the latent graph Inline graphic and is computationally scalable. Mathematically, it can be expressed as a product of Inline graphic normal mixture densities on the off-diagonal elements, and Inline graphic exponential densities on the diagonal elements, normalized to have total probability of one. This is equivalent to a hierarchical model

graphic file with name M54.gif (3.2)

where Inline graphic if edge Inline graphic is present in graph Inline graphic and Inline graphic otherwise, with Inline graphic (Inline graphic) being set to a small number. The two component normal mixture model has been shown to be a successful prior in the context of variable selection, which in our case is equivalent to edge selection, and the choice of hyperparameters Inline graphic and Inline graphic has been closely studied by George and McCulloch (1993). If for example, Inline graphic is chosen to be small, the event Inline graphic indicates that the edge Inline graphic comes from the Inline graphic or diffuse component of the mixture, and consequently Inline graphic is closer to zero and can be estimated as zero. In contrast, if Inline graphic is chosen to be large, the event Inline graphic means Inline graphic comes from the other component Inline graphic and Inline graphic can then be thought of as substantially different from zero.

3.2.2. MRF priors for linked network inference.

MRF priors (Besag, 1974) have been successfully employed to capture network structure among the variables in Bayesian variable selection modeling frameworks (Li and Zhang, 2010; Stingo and others, 2011) and more recently to link the selection of edges across multiple networks (Peterson and others, 2015; Shaddox and others, 2018). Here, we build upon this line of work and utilize MRF priors both to link edge selection across networks within a platform, and to link the network similarity parameters across platforms.

Prior linking networks within each platform: Let Inline graphic represent the vector of binary inclusion indicators of edge Inline graphic across the Inline graphic graphs for platform Inline graphic. We define a MRF prior on this vector of binary inclusion indicators, linking edge selection across networks within a platform as

graphic file with name M77.gif (3.3)

with Inline graphic a sparsity parameter and Inline graphic a Inline graphic symmetric matrix characterizing pairwise relatedness across sample subgroups. The diagonal elements of Inline graphic are constrained to be 0, while the off-diagonal elements Inline graphic drive the within platform similarity and link the edge selection between sample subgroups Inline graphic and Inline graphic, such that a larger magnitude represents increased preference for shared similar structure between those two subgroups. In our experience, these entries can be interpreted on a relative rather than an absolute scale, as magnitude can vary depending on hyperparameter settings, although ordering is generally preserved. Additionally, the vector of binary inclusion indicators allows easy interpretation of the off-diagonal elements of Inline graphic as regression coefficients of a probit model. If we introduce the notation Inline graphic, then we can write the joint prior across graphs Inline graphic for platform Inline graphic as the product of the densities for each edge as

graphic file with name M89.gif (3.4)

Imposing sparsity on the matrix Inline graphic results in a “super-graph” describing relatedness of the networks across the sample subgroups, with zero entries indicating that the networks are sufficiently different that edge selection should not be shared. This is achieved assuming a spike-and-slab prior on the off-diagonal entries of Inline graphic, with a Gamma as the slab, since only positive values for Inline graphic are sensible,

graphic file with name M93.gif (3.5)

where Inline graphic represents the gamma function, Inline graphic and Inline graphic are fixed hyperparameters, and the latent indicator variable Inline graphic indicates the event that the network for subgroup Inline graphic is related to subgroup Inline graphic on platform Inline graphic. The joint prior on the off-diagonal entries is the product of the marginal densities

graphic file with name M101.gif (3.6)

This prior construction allows sharing of information between subgroups when appropriate, without forcing similarity in cases where the networks are actually different. Additionally, we specify a prior on the sparsity parameter Inline graphic as

graphic file with name M103.gif (3.7)

where Inline graphic denotes the Beta function, and Inline graphic and Inline graphic are fixed hyperparameters. Platform specific hyperparameters may be chosen in cases where sparsity is known to be different from one platform to another.

Prior linking cross-group relations across platforms: To link networks at the platform level, we model the overall relationship between each pair of platforms based on the dependencies across subgroups within each platform. This is a flexible approach which allows the number of variables and number of subjects to differ across the platforms, and only requires that we have data for the same set of subgroups. Specifically, we construct an MRF prior on the vector of binary indicators for network relatedness between subgroups Inline graphic and Inline graphic across all platforms, Inline graphic, as

graphic file with name M110.gif (3.8)

with Inline graphic capturing the sparsity of the vector Inline graphic and Inline graphic a Inline graphic symmetric matrix denoting pairwise similarity across platforms, in a similar manner to the matrices Inline graphic described previously. The off-diagonal elements of Inline graphic drive the between platform similarity, a non-zero Inline graphic indicates platforms Inline graphic and Inline graphic have similar super graphs Inline graphic and Inline graphic. As above, we place a spike-and-slab prior on the entries of Inline graphic,

graphic file with name M123.gif (3.9)

with Inline graphic and Inline graphic fixed hyperparameters, and Inline graphic a latent binary variable which indicates that platforms Inline graphic and Inline graphic have related cross-group dependencies. Off-diagonal entries Inline graphic in the symmetric Inline graphic matrix signify the magnitude of pairwise relatedness across platforms, modeling the relations across different platforms as learned from the data in an innovative and versatile manner. We then place independent BernoulliInline graphic priors on the latent indicators Inline graphic, with Inline graphic a fixed hyperparameter Inline graphic, and specify a prior on Inline graphic similarly to (3.7) with hyperparameters Inline graphic and Inline graphic to complete the model.

3.3. Posterior inference

Let Inline graphic denotes the set of all parameters and Inline graphic denotes the observed data for all sample subgroups and all platforms. We can write the joint posterior as

graphic file with name M140.gif (3.10)

As this distribution is analytically intractable, we construct a Markov chain Monte Carlo (MCMC) sampler to obtain a posterior sample of the parameters of interest.

3.3.1. MCMC sampling scheme.

Our MCMC scheme includes a block Gibbs sampler to sample the precision matrix Inline graphic and graph Inline graphic for each platform Inline graphic and subgroup Inline graphic. Then, we sample the graph similarity parameters Inline graphic and Inline graphic for each platform using a Metropolis–Hastings method that is equivalent to a reversible jump and incorporates between-model and within-model moves. Next, we use Metropolis-Hastings steps to sample the edge-specific sparsity parameters Inline graphic and the cross-subgroup relation sparsity parameters Inline graphic from their respective posterior conditional distributions. Lastly, we update the cross-platform parameters Inline graphic and Inline graphic using a Metropolis–Hastings method similarly to the one used to update Inline graphic and Inline graphic. A detailed description of the MCMC algorithm is provided in the supplementary material available at Biostatistics online.

3.3.2. Model selection.

There are various approaches for making inference on the graph structures based on the MCMC output. One approach is to use the maximum a posteriori (MAP) estimate, which represents the mode of the posterior distribution for each graph. However, this approach is generally not preferred in the context of large networks since the space of possible graphs is large, and we may only visit a particular graph a few times during the MCMC. We then rely on a more practical approach for model selection, and estimate the marginal posterior probability (MPP) of inclusion for each edge Inline graphic, which we calculate as the proportion of MCMC iterations, after burn-in, where edge Inline graphic was included in graph Inline graphic. Final inference is performed by selecting edges according to the median model (i.e., with MPP Inline graphic) for inclusion in our posterior selected graphs (Barbieri and Berger, 2004).

4. Case study on COPD severity

We are interested in studying the reshaping of gene and metabolite networks as disease stage worsens. Our ultimate goal is to be able to map the underlying molecular causes of disease progression and to determine whether biological platforms describe the same mechanisms.

Gene expression levels were measured from peripheral blood mononuclear cells using the Affymetrix Human Genome U133 Plus 2.0 Array (Bahr and others, 2013), and plasma metabolite abundances were generated from liquid chromotography/mass spectrometry (Bowler and others, 2014). Candidate pathways were selected as follows. Differently expressed genes and differently abundant metabolites were identified for airflow obstruction (forced expiratory volume in 1 s percent predicted) correcting for age, sex, body mass index, and current smoking status. KEGG Pathways (Kanehisa and others, 2014) that showed enrichment of the significant genes and metabolites were then prioritized. Top candidate pathways may play a role in the response to cigarette smoke exposure and are interesting candidates for more detailed exploration in emphysema.

Below, we report results on one of the top candidate pathways we analyzed, Regulation of autophagy (RegAuto). Results on a second candidate pathway, FcInline graphicR-mediated phagocytosis (FcInline graphicR), can be found in the supplementary material available at Biostatistics online. Expression levels were measured for 28 (RegAuto) and 104 (FcInline graphicR) probesets. These were collapsed to 20 (RegAuto), and 58 (FcInline graphicR) unique genes by selecting, for each gene, the probeset with the strongest association to emphysema. Metabolite data were matched to lipid and aqueous annotation files in order to extract KEGG IDs for each sample. After subsetting to the RegAuto and FcInline graphicR pathways, we were left with 117 (RegAuto) and 60 (FcInline graphicR) measurements, but numerous instances of duplicate KEGG IDs. To reduce redundancy and exclude highly correlated covariates, we carried out an iterative principal component analysis procedure to select a subset of less correlated variables for analysis. This procedure is outlined in the supplementary material available at Biostatistics online and an example code is provided online. After this procedure, 21 (RegAuto) and 23 (FcyInline graphicR) metabolites were left for analysis.

4.1. Hyperparameter settings

The application of our model requires the specification of several hyperparameters. Here we describe the specification we used to obtain the results reported below and refer to the simulation study for more insights and sensitivity analyses. For prior (3.2) on the precision matrix elements, hyperparameters were specified as Inline graphic and Inline graphic according to published guidelines given in Wang (2015). As for the prior (3.5) on the off-diagonal entries of Inline graphic linking sample subgroups within a platform, we specified the slab portion of the mixture prior as a GammaInline graphic with Inline graphic and Inline graphic for both platforms. This resulted in a prior with mean approximately equal to 0.1 and Inline graphic, which avoids assigning high values to the off diagonal entries of Inline graphic. For the prior (3.7) on the sparsity parameter Inline graphic of the MRF prior linking networks within each platform, we specified Inline graphic and Inline graphic resulting in a prior probability of edge inclusion around 0.125. The similarly specified prior on sparsity parameter Inline graphic in the MRF prior (3.8) linking cross-subgroup relations across platforms was specified as Inline graphic and Inline graphic, for all subgroup pairs Inline graphic, resulting in approximately Inline graphic prior probability of subgroup relatedness. The mixture prior (3.9) on the off-diagonal entries of Inline graphic was specified as GammaInline graphic with Inline graphic and Inline graphic, resulting in a prior mean of Inline graphic and Inline graphic, avoiding assigning high values to the off diagonal entries of Inline graphic. Lastly, the hyperparameter Inline graphic in the Bernoulli prior on the indicators of platform similarity Inline graphic was specified as Inline graphic. Sensitivity analyses reported in the supplementary material available at Biostatistics online show that hyperparameter settings have minimal impact on graph learning performance as the inferred network remains fairly stable. With certain settings, large changes may occur in the magnitude of relative similarity measures Inline graphic and Inline graphic, however, ordering is generally preserved. Results, we report here and in the supplementary material available at Biostatistics online were obtained by running two MCMC samplers for 10 000 burn-in iterations followed by 30 000 iterations used for inference, with different starting points. To verify convergence of the chains, we compared correlations of resulting MPPs from the two chains. Those were in the range Inline graphic, for Pearson correlations. Final results were obtained by pooling together the output of the two chains.

4.2. Results

Estimated graphs for control, moderate, and severe subgroups, for the RegAuto pathway, obtained by selecting edges with MPPs greater than 0.5, are shown in Figure 2, and those for the FcInline graphicR pathway are reported in the supplementary material available at Biostatistics online. In these plots, obtained using the software cytoscape (Shannon and others, 2003), the size of a node is drawn proportionally to the number of edges connecting that node to others in the same graph (i.e., the “degree” of the node). For the RegAuto pathway, relative network similarities across subgroups were estimated as

Fig. 2.

Fig. 2.

RegAuto pathway, gene (top), and metabolite (bottom) platforms: estimated graphs for control (left), moderate (middle), and severe (right) subgroups, obtained by selecting edges with MPPs greater than 0.5. The size of the nodes is proportional to their degree.

graphic file with name M194.gif

with relative similarity across platforms estimated as Inline graphic. These values indicate a preference for shared structure across platforms and sample subgroups. Histograms of posterior distributions of non-zero values of Inline graphic and Inline graphic are shown in the supplementary material available at Biostatistics online.

Table 1 indicates the total number of inferred pair interactions across the two pathways, together with the counts of pairs that exhibit evidence of disrupted interactions due to disease severity. In the table, for each pathway, the three disease subgroups ordered from least to most severe are coded with 0’s and 1’s, with 1 indicating a high MPP (Inline graphic) of edge inclusion in the subgroup network. For instance, 110 would indicate that the edge is present in the control and moderate subgroup, yet not in the severe subgroup. That is, in the severe subgroup the MPP of edge inclusion falls below the threshold of 0.50. Group codings of 100 and 110 indicate greater interaction in the control and group codings 011 and 001 indicate greater interaction in disease. For the gene platform, counts for known protein–protein interactions are included in parentheses for the gene platform. Biological General Repository for Interaction Datasets (BioGrids) v. 3.4.156 (Chatr-Aryamontri and others, 2017) was used to obtain protein–protein interactions and disease annotation information was gathered from Stelzer and others (2011). We observe 50–60% disruption in total pairs of genes and metabolites, and different patterns of disruption for metabolites and gene interactions. For both metabolites and genes, there are a large number of connections in control subjects that are then disrupted in moderate/severe subjects. But for metabolites, there is also a relatively large number of metabolite connections in severe subjects that are not present in the moderate/control subjects, suggesting that parts of the metabolite pathway are activated as disease severity increases. These results also illustrate that while our method takes advantages of commonalities between the platforms, it can also highlight platform specific differences.

Table 1.

Case study on COPD: numbers of total pairs of unique gene interactions and numbers of disease disrupted pairs based on disease severity. Numbers in parentheses reflect the number of pairs with known protein protein interactions

Pathway Platform Total pairs 100 110 011 001 Total disrupted
FcInline graphicR Metabolites 73 17 5 3 18 43
FcInline graphicR Genes 656 (49) 151 (8) 63 (7) 74 (3) 63 (4) 351 (22)
Reg Auto Metabolites 66 14 4 5 17 40
Reg Auto Genes 101 (6) 23 (2) 13 7 8 51 (2)

In order to gain further intuition on the properties of the estimated graphs, we calculated a number of graph metrics across all subgroups and pathways. Results on number of edges, global clustering coefficient, averaged betweenness centrality, and count of hub nodes are reported in Table 2. The global clustering coefficient of a graph is based on node triplets, i.e., three connected nodes, and is defined as the number of closed triplets divided by the total number of connected triplets. It measures the degree to which nodes in a graph tend to cluster together, with values closer to one if the graph is more modular i.e., it can be divided into clusters of highly connected nodes. Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes, as a measure of how important the node is in serving as a connector between other nodes in the graph.

Table 2.

Case study on COPD: graph measures results, including number of edges, global clustering, betweenness centrality, and count of hub nodes, for each subgroup. Hub nodes are defined as nodes with a degree Inline graphic, or at least four connections

FcInline graphicR pathway
Metabolites Genes
Group 1 Group 2 Group 3   Group 1 Group 2 Group 3
59 57 58 Number of edges 405 444 332
0.1665 0.2430 0.1683 Global clustering 0.4268 0.4495 0.4442
0.2122 0.2783 0.3348 Betweenness centrality 0.0771 0.0483 0.0995
12 5 10 Count of hub nodes 50 53 46
Reg Auto pathway
49 51 54 Number of edges 71 76 49
0.0881 0.2143 0.1003 Global clustering 0.4649 0.5175 0.4123
0.1524 0.21117 0.1862 Betweenness centrality 0.1800 0.1205 0.1435
9 8 6 Count of hub nodes 14 15 8

Specific hub nodes and extended degree results can be found in the supplementary material available at Biostatistics online.

A close inspection of the estimated networks and our results suggests that, in general, estimated gene networks exhibited a trend of decreased connectivity or a large drop in connections as disease severity increased, while metabolite networks do not show such a trend. There may be several reasons why the network patterns are different between the genes and metabolites. One possible reason is that the same metabolites are present in other biological pathways that may be compensating for the changes due to disease. Another reason is that the plasma metabolomics may be reflecting activity in multiple organs, while the gene level data is primarily reflecting changes in gene expression more specifically in the blood. Additionally, results in Table 2 generally indicate higher global clustering coefficients and degree centrality measures for gene platforms than for metabolite platforms. This suggests that gene networks are generally more clustered into denser subnetworks characterized by high connectivity within each pathway when compared to metabolite networks. Additionally, interpreting degree centrality measures in the context of information flow within networks suggests that when disrupted, highly connected genes may impact network communication more than disrupted metabolite interactions.

4.3. Hub node analysis

Further analysis of the results was carried out on hub nodes, for both platforms, to validate findings with known protein–protein interactions and to examine disease related gene annotation. Hub listings were generated for each pathway and each platform to allow analysis of node connectivity and variations in connectivity as disease increased in severity. A summary of results is given in Table 2, where hub node count for our application setting signifies the number of nodes per group with a degree Inline graphic, or at least four connections. As an example, for genes in the FCInline graphicR pathway, we find that there are less connections per node, less hub nodes, and less connections in the Severe subjects compared to the Moderate/Control subjects suggesting that there is overall disruption for this pathway at the gene expression level. In the supplementary material available at Biostatistics online, we provide biological background on specific genes, metabolites, and connections in the estimated networks for the two pathways.

5. Simulation studies

In this section, we compare our proposed method with three recently proposed graphical model learning methods: Fused Graphical Lasso, Group Graphical Lasso, and Hub Graphical Lasso. The first two methods are designed to learn the network structure of related subgroups (Danaher and others, 2014): The Fused Graphical Lasso encourages both shared structure and shared edge values, the Group Graphical Lasso encourages shared graph structures but not shared edge values. The Hub Graphical Lasso (Mohan and others, 2014) encourages similarity across networks based on the presence or absence of highly connected hub nodes. None of the competing methods encourages similarity across platforms.

5.1. Comparison study

We investigate whether alternative methods can produce satisfactory results, in terms of network accuracy, in settings that mimic our COPD data (two platforms and three sampling subgroups). We consider two set-ups for generating Inline graphic adjacency and precision matrices, for each sampling group Inline graphic:

Inline graphic Scale free networks: the probability that a given node has Inline graphic edges is proportional to Inline graphic. We kept Inline graphic, the default setting as stated in the igraph package (Csardi and Nepusz, 2006), and simulated networks of the same size of pathways analyzed in the COPD case study (Inline graphic nodes).

Inline graphic AR(2) networks: the entries of the Inline graphic precision matrix are defined as Inline graphic for Inline graphic, Inline graphic for Inline graphic and Inline graphic for Inline graphic. We simulated networks of larger size than pathways analyzed in the COPD case study (Inline graphic nodes).

As our model learns similarity between networks and does not enforce similarity unless supported by the data, current modeling allows for all patterns of similarity. In particular, from the preliminary adjacency matrices above, in our simulations we considered two settings of pairwise similarity across sampling groups for each platform: In setting one, for platform 1, Groups 1 and 2 were set up to be “similar” while Group 3 was set up to be different. We generated “similar” networks across all three groups for the second platform. Here, two groups are defined as “similar” if the precision matrix of one group shares approximately Inline graphic of edges with the precision matrix of the other group. In setting 2, both platforms were set up to have different networks across all three subgroups.

For scale free networks, to ensure that each generated precision matrix was positive definite, we used a similar approach to that of Danaher and others (2014) where each off-diagonal element is divided by the sum of the off-diagonal elements in its row, and then the matrix is averaged with its transpose. Consequently, precision matrices generated via the scale free network method have lower signal, in terms of magnitude of the non-zero elements of the precision matrices, than the AR(2) networks; we simulated scale free networks of size Inline graphic and AR(2) networks of size Inline graphic to ensure a minimal signal strength. After all precision matrices were determined, data matrices Inline graphic of size Inline graphic for Inline graphic and Inline graphic, were generated from normal distributions Inline graphic and variables were standardized to have a standard deviation of one. We used the same hyperparameter setting used in the analysis of the COPD data, and ran our MCMC samplers for 10 000 burnin iterations followed by 30 000 iterations used for inference. Additional sensitivity analyses may be found in the supplementary material available at Biostatistics online. Using a 2-core 1.7 GHz Intel core i7 processor with 8 GB memory, our code takes approximately 40 min to run 5000 iterations for a two platform scenario with 40 variables per platform. Alternative methods, such as the fused and group graphical lasso, are computationally more efficient, although grid searches and trials to determine optimized penalty parameters can be quite time consuming.

In Table 3, we report network accuracy metrics averaged over 25 replicates; we considered the true positive rate (TPR), the false positive rate (FPR), the Matthews correlation coefficient (MCC), and area under the curve (AUC). Overall the proposed method performs comparatively well, and it is the only approach that controls the false positive rate across all scenarios. The differences in performances in favor of the proposed approach are particularly large in Setting Two. This is not surprising since the proposed approach is the only joint graph inference approach that learns from the data whether groups are related and, consequently, does not always enforce similarity across groups. Additionally, in the supplementary material available at Biostatistics online, we show a comparison of TPRs attained across methods for fixed FDRs, providing some evidence that our proposed method improves power with respect to methods that employ separate estimations for each subgroup.

Table 3.

Simulation study: in setting one, one group on one of the two platforms is dissimilar from the others. In setting two, both platforms have dissimilar groups. Network accuracy metrics are reported as Mean (Standard Error) over 25 replicates for Inline graphic scenarios and 50 replicates for Inline graphic scenarios.

Methods TPR FPR MCC AUC
Setting one, Inline graphic
Inline graphic 0.743 (0.0031) 0.028 (0.0004) 0.639 (0.0028) 0.936 (0.0020)
Inline graphic 0.785 (0.0031) 0.060 (0.0005) 0.536 (0.0023) 0.912 (0.0022)
Inline graphic 0.123 (0.0049) 0.005 (0.0004) 0.263 (0.0062) 0.899 (0.0022)
Inline graphic 0.611 (0.0063) 0.022 (0.0005) 0.579 (0.0055) 0.895 (0.0036)
Setting two, Inline graphic
Inline graphic 0.907 (0.0016) 0.157 (0.0006) 0.439 (0.0012) 0.963 (0.0003)
Inline graphic 0.930 (0.0015) 0.167 (0.0005) 0.436 (0.0010) 0.954 (0.0004)
Inline graphic 1.000 (0.0000) 0.467 (0.0048) 0.232 (0.0021) 0.945 (0.0004)
Inline graphic 1.000 (0.0001) 0.028 (0.0004) 0.794 (0.0020) 1.000 (0.0001)
Setting one, Inline graphic
Inline graphic 0.657 (0.0037) 0.035 (0.0005) 0.546 (0.0029) 0.919 (0.0023)
Inline graphic 0.777 (0.0031) 0.069 (0.0005) 0.506 (0.0022) 0.916 (0.0023)
Inline graphic 0.263 (0.0064) 0.005 (0.0003) 0.427 (0.0050) 0.905 (0.0023)
Inline graphic 0.636 (0.0063) 0.023 (0.0005) 0.597 (0.0053) 0.941 (0.0037)
Setting two, Inline graphic
Inline graphic 0.735 (0.0017) 0.080 (0.0004) 0.451 (0.0015) 0.957 (0.0009)
Inline graphic 0.998 (0.0002) 0.270 (0.0006) 0.343 (0.0005) 0.938 (0.0229)
Inline graphic 1.000 (0.0000) 0.464 (0.0044) 0.233 (0.0019) 0.945 (0.0004)
Inline graphic 1.000 (0.0001) 0.026 (0.0004) 0.808 (0.0021) 1.000 (0.0001)

6. Conclusion

Motivated by a collaborative study on COPD progression, we have proposed a novel approach for joint multiple platform network analysis (here, genes and metabolites). Our Bayesian approach uses computationally efficient priors on precision matrices and hierarchical MRF priors to link similarities across subgroups and platforms. Even though less scalable than alternative methods, a Bayesian framework makes use of all information in the data, sharing it across subgroups when appropriate, and enabling joint estimation in a very flexible manner, as we make no assumptions on the directionality of influence across the data types or on the extent of network similarity. In addition, our model formulation allows the numbers of variables and subjects to differ across data types. We have demonstrated improved performance over alternative approaches for multiple networks using simulated data. On the COPDGene data, we have jointly inferred metabolite and gene networks across subgroups of disease stage, identifying notable interactions that illustrate disease progression and suggesting pathway compensation as a consequence to disease. These interactions pinpoint molecular targets for further study and provide potential therapy options.

Supplementary Material

kxy078_Supplementary_Materials

Supplementary Data

Acknowledgments

The authors also acknowledge support from NSF/RTG 1547433 and thank Dessy Akinfenwa and Ami Sheth for help with the simulation study. Conflict of Interest: None declared.

7. Software

MATLAB code is available at https://github.com/elinshaddox/MultiplePlatformBayesianNetworks.

Funding

NSF/DMS 1811568/1811445 (to M.V. and C.P.) NHLBI (U01HL089897, U01HL089856, and P20HL113445) and Butcher Foundation. NLM Training Program (T15 LM007093 to E.S.); NIH/NCI (P30CA016672 to C.B.P.). COPDGene study (NCT00608764) supported by the COPD Foundation through contributions to an Industry Advisory Committee comprised of AstraZeneca, Boehringer-Ingelheim, GlaxoSmithKline, Novartis, Pfizer, Siemens, and Sunovion.

References

  1. Bahr  T. M., Hughes  G. J., Armstrong  M., Reisdorph  R., Coldren  C. D., Edwards  M. G., Schnell  C., Kedi  R., LaFlamme  D. J., Reisdorph  N., Keckris  K. J.  and others. (2013). Peripheral blood mononuclear cell gene expression in chronic obstructive pulmonary disease. American Journal of Respiratory Cell and Molecular Biology  49, 316–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Barbieri  M. M. and Berger  J. O. (2004). Optimal predictive model selection. The Annals of Statistics  32, 870–897. [Google Scholar]
  3. Besag  J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B  36, 192–236. [Google Scholar]
  4. Bowler  R. P., Jacobson  S., Cruickshank  C., Hughes  G. J., Siska  C., Ory  D. S., Petrache  I., Schaffer  J. E., Reisdorph  N. and Kechris  K. (2014). Plasma sphingolipids associated with copd phenotypes. American Journal of Respiratory and Critical Care Medicine  191, 275–284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cai  T. T., Li  H., Liu  W. and Xie  J. (2016). Joint estimation of multiple high dimensional precision matrices. Statistica Sinica  26(2), 445–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cassese  A., Guindani  M., Tadesse  M. G., Falciani  F. and Vannucci  M. (2014). A hierarchical Bayesian model for inference of copy number variants and their association to gene expression. Annals of Applied Statistics  8, 148–175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chatr-Aryamontri  A., Oughtred  R., Boucher  L., Rust  J., Chang  C., Kolas  N. K., O’Donnell  L., Oster  S., Theesfeld  C., Sellam  A., Stark  C., Britkreutz  B., Dolinski  K.  and others. (2017). The biogrid interaction database: 2017 update. Nucleic Acids Research  45, D369–D379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen  S., Witten  D. M. and Shojaie  A. (2015). Selection and estimation for mixed graphical models. Biometrika  102, 47–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Csardi  G. and Nepusz  T. (2006). The igraph software package for complex network research. InterJournal Complex Systems, 1695. [Google Scholar]
  10. Danaher  P., Wang  P. and Witten  D. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society, Series B  76, 373–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Friedman  J., Hastie  T. and Tibshirani  R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics  9, 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. George  E. and McCulloch  R. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association  88, 881–889. [Google Scholar]
  13. Guo  J., Levina  E., Michailidis  G. and Zhu  J. (2011). Joint estimation of multiple graphical models. Biometrika  98, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hao  B., Sun  W., Liu  Y. and Cheng  G. (2018). Simultaneous clustering and estimation of heterogeneous graphical models. Journal of Machine Learning Research  217, 1–58. [PMC free article] [PubMed] [Google Scholar]
  15. Kanehisa  M., Goto  S., Sato  Y., Kawashima  M., Furumichi  M. and Tanabe  M. (2014). Data, information, knowledge and principle: back to metabolism in kegg. Nucleic Acids Research  42, 199–205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Li  F. and Zhang  N. (2010). Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. Journal of the American Statistical Association  105, 1202–1214. [Google Scholar]
  17. Lin  J., Basu  S., Banerjee  M. and Michailidis  G. (2016). Penalized maximum likelihood estimation of multi-layered Gaussian graphical models. Journal of Machine Learning Research  17, 1–51. [Google Scholar]
  18. Ma  J. and Michailidis  G. (2016). Joint structural estimation of multiple graphical models. Journal of Machine Learning Research  17, 1–48. [Google Scholar]
  19. Meinshausen  N. and Bühlmann  P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics  34, 1436–1462. [Google Scholar]
  20. Mohan  K., London  P., Fazel  M., Witten  D. and Lee  S. (2014). Node-based learning of multiple Gaussian graphical models. Journal of Machine Learning Research  15, 445–488. [PMC free article] [PubMed] [Google Scholar]
  21. National Center for Health Statistics. (2016). Health, United States, 2015: With Special Feature on Racial and Ethnic Health Disparities. [PubMed]
  22. Oates  C. and Mukherjee  S. (2014). Joint structure learning of multiple non-exchangeable networks. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics  33, 687–695. [Google Scholar]
  23. Peterson  C., Stingo  F. and Vannucci  M. (2015). Bayesian inference of multiple Gaussian graphical models. Journal of the American Statistical Association  110, 159–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Regan  E. A., Hokanson  J. E., Murphy  J. R., Make  B., Lynch  D. A., Beaty  T. H., Curran-Everett  D., Silverman  E. K. and Crapo  J. D. (2010). Genetic epidemiology of copd (copdgene) study design. COPD  7, 32–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Roverato  A. (2002). Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics  29, 391–411. [Google Scholar]
  26. Saegusa  T. and Shojaie  A. (2016). Joint estimation of precision matrices in heterogeneous populations. Electronic Journal of Statistics  10, 1341–1392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Shaddox  E., Stingo  F., Peterson  C. B., Jacobson  S., Cruickshank-Quinn  C., Kechris  K., Bowler  R. and Vannucci  M. (2018). A Bayesian approach for learning gene networks underlying disease severity in COPD. Statistics in Biosciences  10, 59–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Shannon  P., Markiel  A., Ozier  O., Baliga  N. S., Wang  J. T., Ramage  D., Amin  N., Schwikowski  B. and Ideker  T. (2003). Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Research  13, 2498–2504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stelzer  G., Dalah  I., Stein  T., Satanower  Y., Rosen  N., Nativ  N., Oz-Levi  D., Olender  T., Belinky  F., Bahir  I., Krug  H., Perco  P., Mayer  B., Kolker  E., Safran  M.  and others. (2011). In-silico human genomics with genecards. Human Genomics  5, 709–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Stingo  F. C., Chen  Y. A., Tadesse  M. G. and Vannucci  M. (2011). Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. The Annals of Applied Statistics  5, 1978–2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wang  H. (2015). Scaling it up: stochastic search structure learning in graphical models. Bayesian Analysis  10, 351–377. [Google Scholar]
  32. Wang  H. and Li  S. (2012). Efficient Gaussian graphical model determination under Inline graphic-Wishart prior distributions. Electronic Journal of Statistics  6, 168–198. [Google Scholar]
  33. Wang  W., Baladandayuthapani  V., Morris  J. S., Broom  B. M., Manyam  G. and Do  K. A. (2013). iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data. Bioinformatics  29, 149–159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Yuan  M. and Lin  Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika  94, 19–35. [Google Scholar]
  35. Zhu  Y., Shen  X. and Pan.  W. (2014). Structural pursuit over multiple undirected graphs. Journal of the American Statistical Association  109, 1683–1696. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxy078_Supplementary_Materials

Supplementary Data


Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES