Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2019 Aug 2;22(2):233–249. doi: 10.1093/biostatistics/kxz027

Fast hybrid Bayesian integrative learning of multiple gene regulatory networks for type 1 diabetes

Bochao Jia 1, Faming Liang 2,; The TEDDY Study Group2
PMCID: PMC8035990  PMID: 33838043

SUMMARY

Motivated by the study of the molecular mechanism underlying type 1 diabetes with gene expression data collected from both patients and healthy controls at multiple time points, we propose a hybrid Bayesian method for jointly estimating multiple dependent Gaussian graphical models with data observed under distinct conditions, which avoids inversion of high-dimensional covariance matrices and thus can be executed very fast. We prove the consistency of the proposed method under mild conditions. The numerical results indicate the superiority of the proposed method over existing ones in both estimation accuracy and computational efficiency. Extension of the proposed method to joint estimation of multiple mixed graphical models is straightforward.

Keywords: Data integration, Meta-analysis, Multiple Gaussian graphical models, ψ-Learning

1. Introduction

Type 1 diabetes (T1D) is one of the most common autoimmune diseases. The Environmental Determinants of Diabetes in the Young (TEDDY) study is designed to identify environmental exposures triggering islet autoimmunity and T1D in genetically high-risk children. A large dataset has been collected through the study, including clinical data, genetic data, and demographical data. While great efforts have been made for identifying the genetic and environmental factors that contribute to the etiology of the disease, the molecular mechanism underlying the disease is still far from understanding. To enhance our understanding to the molecular mechanism, this work aims to learn gene regulatory networks (GRNs) by integrating the gene expression data measured from both the patients and healthy controls at multiple time points. Figure 1 shows the structure of the data, where the gene expression was measured for each of the case and control children at nine time points within 4 years of age. How to integrate the data collected under the 18 distinct conditions have posed a great challenge on the current statistical methods.

Fig. 1.

Fig. 1.

Structure of the T1D data considered in the article, where the numbers represent nine time points at which gene expression data were collected and the arrows represent joint estimation of Gaussian graphical models by integrating the data across different time points and case–control groups.

During the past decade, a variety of approaches have been proposed for estimating multiple GRNs with data collected under multiple distinct conditions. These approaches can be roughly grouped into two categories, namely, regularization and Bayesian.

The regularization approaches work with some specific penalty functions that enhance the shared structure of the graphical models. For example, Guo and others (2011) employed a hierarchical penalty that targets the removal of common zeros in the precision matrices across conditions. Danaher and others (2014) employed penalized fused lasso or group lasso penalties that encourage shared elements of the precision matrices. Chun and others (2015) employed a class of nonconvex penalty functions that regularize the common and condition-specific structures hierarchically. A shortcoming of these approaches is that they assume the observations under different conditions are independent. This is hard to be satisfied for the temporal data, where the observations were taken from the same cohort at multiple time points. To address this issue, Zhou and others (2010) and Qiu and others (2016) proposed to model the temporal data in a high-dimensional time series and then estimate the time-varying graphical structure using a nonparametric method by assuming that the covariance changes smoothly over time. These approaches usually require the time series to be fairly long, say, 50 or longer.

As an analog to regularization approaches, Bayesian approaches enhance the shared structure of multiple graphical models by employing some specific priors. For example, Peterson and others (2015) links the estimation of graph structures via a Markov random field (MRF) prior which encourages common edges. However, since this method involves repeated calculations of concentration matrices (i.e., inverse of covariance matrices), it is only applicable when the graph is not very large. To accelerate computation, Lin and others (2017) proposed a Bayesian analog of the neighborhood selection method (Meinshausen and Bühlmann, 2006) to learn the structure of multiple graphical models with the MRF prior.

In this article, we propose a fast hybrid Bayesian integrative analysis (FHBIA) method for jointly estimating multiple Gaussian graphical models. The proposed method consists of both frequentist and Bayesian components. First, it applies a Inline graphic-learning method, which is a frequentist method, to transform the original data to edge-wise Inline graphic-scores. The Inline graphic-score, which forms an equivalent measure of the partial correlation coefficient, provides a good summary for the graph structure information contained in the data under each condition. Then, it applies a Bayesian method to model the Inline graphic-scores for edge clustering and applies a meta-analysis method for integrating data information across distinct conditions. Finally, it applies a multiple hypothesis test method for edge determination. Due to the use of the Inline graphic-score transformation, FHBIA avoids inversion of high-dimensional covariance matrices and thus can be executed very fast. The multiple hypothesis test produces a Inline graphic-value (Storey, 2002), which can be viewed as an uncertainty measure, for each potential edge of the multiple Gaussian graphs. We prove consistency of the proposed method under mild conditions and illustrate its performance using simulated and real data examples. The numerical results indicate the superiority of the proposed method over the existing ones in both estimation accuracy and computational efficiency.

2. Fast hybrid Bayesian integrative analysis

The FHBIA method consists of a few steps, including Inline graphic-score calculation, Bayesian clustering and meta-analysis, and joint edge detection (JED), with the diagram shown in Figure 2.

Fig. 2.

Fig. 2.

Diagram of the FHBIA method: (i) datasets Inline graphic’s for Inline graphic are first transformed to edgewise Inline graphic-scores Inline graphic’s through the step of Inline graphic-score transformation; (ii) Inline graphic-scores are processed through the step of Bayesian clustering and meta-analysis to get Bayesian integrated Inline graphic-scores denoted by Inline graphic’s; (iii) Bayesian integrated Inline graphic-scores are further processed through the step of JED to get graph estimates Inline graphic’s.

2.1. Inline graphic-Score transformation

Suppose that we have a dataset of Inline graphic variables observed under Inline graphic distinct conditions. Let Inline graphic denote the dataset observed under condition Inline graphic, where Inline graphic denotes the sample size under condition Inline graphic; and Inline graphic is a Inline graphic-dimensional random vector distributed according to the multivariate normal distribution Inline graphic, and Inline graphic and Inline graphic are the mean and covariance matrix of the distribution, respectively. The sample size Inline graphic is not necessarily the same for all conditions. Without loss of generality, we assume that Inline graphic is a zero vector for all Inline graphic. With slight abuse of notation, we let Inline graphic denote the Inline graphic variables that are common for all Inline graphic datasets. Let Inline graphic denote the index set of the variables, where each variable is called a node in the terminology of graphs.

We adopt the Inline graphic-learning algorithm to transform each dataset Inline graphic to edge-wise scores independently. The Inline graphic-learning algorithm first produced a Inline graphic-partial correlation coefficient for each pair of nodes. Denote the Inline graphic-partial correlation coefficients by Inline graphic for all Inline graphic, which are equivalent to the true partial correlation coefficients Inline graphic for determining the structure of the Gaussian Graphical Model (GGM) in the sense that Inline graphic, Inline graphic. Further, the Inline graphic-learning algorithm converts the Inline graphic-partial correlation coefficients to Inline graphic-scores, denoted by Inline graphic, via Fisher’s transformation and the probit transformation such that Inline graphic approximately holds under the null hypothesis Inline graphic for any Inline graphic. Therefore, the Inline graphic-score can be used as a test statistic for identifying nonzero partial correlation coefficients and thus the structure of the GGM. The use of Inline graphic-scores enables the proposed method to avoid inversion of high-dimensional covariance matrices that the existing Bayesian methods often need to deal with, and hence the proposed method can be executed very fast. Refer to the supplementary material available at Biostatistics online for the detail of the Inline graphic-learning algorithm.

Since the GGM is undirected, we have a total of Inline graphicInline graphic-scores to calculate for each dataset Inline graphic. For convenience, we re-arrange the Inline graphic-scores for each dataset Inline graphic into an Inline graphic-vector Inline graphic with Inline graphic and re-arrange the Inline graphic-scores for all Inline graphic datasets into an Inline graphic matrix Inline graphic with Inline graphic and Inline graphic.

2.2. Bayesian clustering and meta-analysis

Consider the Inline graphic-scores Inline graphic, where each pair Inline graphic corresponds to one candidate edge in the graph Inline graphic. Let Inline graphic be the indicator for the status of edge Inline graphic in the underlying graph Inline graphic; Inline graphic if the edge exists and 0 otherwise. The Inline graphic’s work as latent variables in FHBIA. Conditioned on Inline graphic, we assume that Inline graphic’s are mutually independent and follow a two-component mixture Gaussian distribution given by

graphic file with name M82.gif (2.1)

for Inline graphic and Inline graphic. When Inline graphic, Inline graphic’s have a value close to 0, otherwise, Inline graphic’s might have a large negative or positive value depending on the sign of the partial correlation coefficient. Under the assumption that the structure of the GGM changes only slightly under adjacent conditions, it is reasonable to assume that for each Inline graphic, the sign of Inline graphic’s are not changed when the edge exists; therefore, Inline graphic’s can be modeled by a two-component mixture Gaussian distribution. In some cases, e.g., when Inline graphic grows, a three-component mixture Gaussian distribution might be needed, which allows us to handle the scenario that an edge is included in multiple graphs, but its partial correlation coefficients have different signs in different graphs. The derivation under this scenario is given in the supplementary material available at Biostatistics online, which is a simple extension of the deviation presented below. Regarding the two-component mixture distribution (2.1), we further note that Inline graphic can be simply set to 0 considering the physical mean of Inline graphic-scores. However, as shown below, this general setup does not cause any computational difficulty.

Essentially, we have formulated the problem of inference of Inline graphic’s as a clustering problem, grouping Inline graphic to up to two different clusters. For the case of three-component mixture distribution, this is similar. Let Inline graphic and Inline graphic. Conditioned on Inline graphic, the joint likelihood function of Inline graphic is given by

graphic file with name M100.gif (2.2)

where Inline graphic is the density function of the Gaussian distribution with mean Inline graphic and variance Inline graphic. Taking a product of (2.2) over Inline graphic, we have the joint distribution of all Inline graphic-scores Inline graphic conditioned on Inline graphic’s and other parameters. Then, using the Bayes theorem, Inline graphic’s can be inferred with an appropriate priors of Inline graphic’s and other parameters. For example, the MRF prior used in Peterson and others (2015) and Lin and others (2017) can again be used here as the prior of Inline graphic’s. In this case, the posterior distribution can be sampled from using a Monte Carlo Markov chain (MCMC) algorithm.

Instead of specifying a joint prior distribution for all Inline graphic’s, we assume in this article that Inline graphic’s are a priori independent for different Inline graphic’s, as we believe that the neighboring dependence of the Gaussian graphical network has been accounted for in calculation of the Inline graphic-scores. To enhance shared edges among distinct conditions, we consider two types of priors for Inline graphic’s, namely, temporal prior and spatial prior, with borrowed terms from geostatistics. Figure 3 illustrates the application scenarios of the two types of priors. The temporal prior can be used in the scenario that the networks Inline graphic’s evolve sequentially along with the index Inline graphic. In this scenario, it is quite common to consider the index Inline graphic as the time of experiments. The spatial prior can be used in the scenario that the networks or precision matrices evolve independently from a common structure. For example, the temporal prior can be applied when we construct genetic networks using a set of gene expression data measured for the same tissue at multiple time points, and the spatial prior can be applied if the gene expression data are measured for different tissues at the same time point.

Fig. 3.

Fig. 3.

Illustration of application scenarios of the temporal and spatial priors: (a) networks evolve along with time (temporal prior); (b) networks evolve independently from a common structure (spatial prior).

2.2.1. Temporal prior

To enhance the similarity of the networks between adjacent conditions, we let Inline graphic be subject to the following prior distribution

graphic file with name M120.gif (2.3)

where Inline graphic indicates the change of the status of the edge Inline graphic from condition Inline graphic to condition Inline graphic, and Inline graphic is a prior hyperparameter representing the prior probability of edge status changes. In this article, we assume that Inline graphic follows a beta distribution Inline graphic, where Inline graphic and Inline graphic are pre-specified parameters. Further, we let Inline graphic and Inline graphic be subject to an improper uniform distribution, i.e., Inline graphic and Inline graphic, and let Inline graphic and Inline graphic be subject to an inverted-gamma distribution, i.e., Inline graphic, where Inline graphic and Inline graphic are pre-specified constants. Then the joint posterior distribution of Inline graphic is given by

graphic file with name M140.gif

where Inline graphic’s denote the respective prior distributions. After integrating out the parameters Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic, we have the marginal posterior distribution of Inline graphic given by

graphic file with name M148.gif (2.4)

when Inline graphic and Inline graphic hold, where Inline graphic, Inline graphic, Inline graphic and Inline graphic. When Inline graphic and Inline graphic, we have Inline graphic. When Inline graphic and Inline graphic, we have Inline graphic.

Given Inline graphic distinct conditions, the total number of possible configurations of Inline graphic is Inline graphic. When Inline graphic is small, we can provide an exhaustive evaluation of the Inline graphic configurations. That is, for each possible configuration of Inline graphic, we can calculate its posterior probability and integrated Inline graphic-scores exactly. For each possible configuration Inline graphic for Inline graphic, we denote the posterior probability by Inline graphic and the integrated Inline graphic-score by Inline graphic. According to Stouffer’s meta-analysis method (Stouffer and others, 1949), which is also known as the inverse normal method for combining Inline graphic-values (Zaykin, 2011), we define the integrated Inline graphic-score as

graphic file with name M175.gif (2.5)

for Inline graphic, where the weight Inline graphic might account for the size or quality of the samples collected under each condition. In this article, we set Inline graphic for all Inline graphic. Such a weighted average score integrates data information on the edge across all conditions. Then the Bayesian Stouffer integrated Inline graphic-score, or Bayesian integrated Inline graphic-score in short, is given by

graphic file with name M182.gif (2.6)

When Inline graphic is large, the Inline graphic’s can be estimated with a short MCMC run, say, by running the Gibbs sampler (Geman and Geman, 1984) for a few hundred iterations. Since the MCMC can be run in parallel for different Inline graphic’s, the computation is not a big burden in this case.

It is interesting to point out that the Bayesian Stouffer integrated Inline graphic-score Inline graphic is different from the conventional Bayesian estimator of Inline graphic. The latter is given by

graphic file with name M189.gif (2.7)

and

graphic file with name M190.gif (2.8)

It is easy to see that the Bayesian Stouffer integrated Inline graphic-score amplifies the Bayesian averaged Inline graphic-score (2.7) by a factor between 1 and Inline graphic. Such amplification makes the two clusters of edges more separable in the scores and, as pointed out in the Proof of Lemma 3.4 (in the supplementary material available at Biostatistics online), helps to improve the power of the proposed method by reducing the false negative error. Also, we would point out that for each Inline graphic, if the edge clustering pattern Inline graphic is correct and Inline graphic, then the Stouffer integrated Inline graphic-score Inline graphic has a constant variance of 1, while the simply averaged Inline graphic-score Inline graphic has a varied variance depending on the value of Inline graphic. Therefore, the Bayesian Stouffer integrated Inline graphic-scores are more comparable than the Bayesian averaged Inline graphic-scores in edge determination.

2.2.2. Spatial prior

To enhance our prior knowledge that there exists a common structure for all the networks from which they evolve independently, we let Inline graphic’s be subject to the prior distribution

graphic file with name M205.gif (2.9)

where Inline graphic indicates the status change of the edge Inline graphic at condition Inline graphic from Inline graphic, and Inline graphic is the mode of Inline graphic and represents the common status of the edge Inline graphic across all networks. With this prior distribution, the posterior distribution Inline graphic can also be expressed in the form of (2.4) but with Inline graphic and Inline graphic.

2.3. Joint edge detection

To jointly estimate the structure of multiple GGMs based on the Bayesian Stouffer integrated Inline graphic-scores (2.6), a multiple hypothesis test can be applied. The multiple hypothesis test classifies the integrated Inline graphic-scores into two classes, presence of edges and absence of edges. In this article, we adopt the empirical Bayesian method developed by Liang and Zhang (2008) for the multiple hypothesis test, which models the integrated Inline graphic-scores by a mixture distribution Inline graphic, where Inline graphic denotes an integrated Inline graphic-score; Inline graphic and Inline graphic denote the probabilities of edge absence and edge presence, respectively; and Inline graphic and Inline graphic denote the probability density functions of integrated Inline graphic-scores with edge absence and edge presence, respectively. As in Liang and Zhang (2008), we parameterize Inline graphic by an exponential power distribution, and parameterize Inline graphic by a mixture of exponential power distributions. The parameters Inline graphic, Inline graphic and those contained in Inline graphic and Inline graphic are estimated using the stochastic approximation method by minimizing the Kullback–Leibler divergence between the density Inline graphic and the empirical one. The threshold values for grouping the integrated Inline graphic-scores into the classes Inline graphic and Inline graphic are determined according to the value of Inline graphic, a pre-specified false discovery rate level. How to specify the value of Inline graphic will be discussed in Section 2.4. Note that this multiple hypothesis test method allows for the dependence between test statistics, i.e., integrated Inline graphic-scores for this problem. Other methods that account for the dependence between test statistics, e.g., the two-stage method by Benjamini and Yekutieli (2001), can also be applied here.

Finally, we would like to point out that the empirical Bayesian method used above produces a Inline graphic-value (Storey, 2002) for each potential edge of the multiple graphs. The Inline graphic-value, like the Inline graphic-value for the single hypothesis test, provides an uncertainty measure for each potential edge.

2.4. Parameter setting

FHBIA contains two free parameters, i.e., Inline graphic and Inline graphic, which refer to the significance levels of the multiple hypothesis tests conducted in correlation screening and JED, respectively. Following the suggestion of Liang and others (2015), we set Inline graphic and Inline graphic as the default values. Otherwise, their values will be stated in the context. In general, a high significance level of correlation screening will lead to slightly large conditioning sets in calculation of Inline graphic-partial correlation coefficients, which reduces the risk of missing important variables in the conditioning sets. Including a few false variables in the conditioning sets will not hurt much the accuracy of the Inline graphic-partial correlation coefficients. As shown in Xu and others (2019), the performance of the Inline graphic-learning algorithm can be quite robust to the choice of Inline graphic. However, the setting of Inline graphic is quite free, which determines the sparsity of the resulting graphs. A smaller value of Inline graphic might be used if sparse graphs are preferred.

In addition to the two free parameters, FHBIA contains four prior-hyperparameters, i.e., Inline graphic, Inline graphic, Inline graphic, and Inline graphic. Since the probability Inline graphic usually takes a small value, we set Inline graphic for its prior distribution Beta(Inline graphic,Inline graphic). Since the variance of the Inline graphic-scores is approximately equal to 1 under the null hypothesis that the true partial correlation coefficient is equal to 0, we set Inline graphic for its prior distribution IG(Inline graphic, Inline graphic). The same prior hyperparameter settings have been used in all examples of this article.

2.5. Consistency

Under the faithfulness assumption, sparsity assumption, and other regularity conditions for the joint Gaussian distribution, e.g., the dimension Inline graphic is allowed to grow exponentially with the sample size Inline graphic for some constant Inline graphic and the largest eigenvalue of the covariance matrix can grow with Inline graphic at a restricted rate, Liang and others (2015) showed that the multiple hypothesis test based on the Inline graphic-scores produces a consistent estimate for the GGM with data observed under single condition. Essentially, Liang and others (2015) showed that the Inline graphic-scores are separable in probability for the pairs of nodes with edge absence and edge presence.

To accommodate the change from single condition to multiple conditions, we modified the assumptions of Liang and others (2015) and added an assumption about Inline graphic. Under the new set of assumptions, we proved that the FHBIA method is consistent.

Theorem 2.1

Assume Inline graphicInline graphic (see supplementary material available at Biostatistics online) hold. Then

Theorem 2.1

where Inline graphic denotes the true network under condition Inline graphic, Inline graphic denotes the FHBIA estimator of Inline graphic, and Inline graphic denotes a threshold value of Bayesian integrated Inline graphic-scores based on which the edges are determined for all Inline graphic graphs.

The proof of the theorem is given in the supplementary material available at Biostatistics online. Theorem 2.1 implies that for all graphs there exists a common threshold with respect to which the Bayesian integrated Inline graphic-scores are separable in probability for the pairs of nodes with edge presence and edge absence. Here, we would like to highlight three points. First, as indicated by our proof [see inequality (S29) in the Proof of Lemma 3.5 in supplementary material available at Biostatistics online], the data integration step can indeed improve the power of proposed method. Second, following from the inequalities (S29) and (S30) and the condition Inline graphic given in the supplementary material available at Biostatistics online, we can conclude the sign consistency of the estimator Inline graphic; i.e., for any edge of the graph, the sign of the Bayesian integrated Inline graphic-score has the same sign as the true partial correlation coefficient when the sample size Inline graphic becomes large. Third, the assumption imposed on Inline graphic, i.e., Inline graphic, is rather weak, where Inline graphic, Inline graphic, and Inline graphic are all some positive constants as defined in other assumptions and Inline graphic (see the supplementary material available at Biostatistics online). For example, we can choose Inline graphic and thus Inline graphic. This is consistent with our numerical results; the method can perform very well even with a small value of Inline graphic.

3. Simulation studies

3.1. Scenario with temporal priors

To illustrate the performance of the proposed FHBIA method under the scenario with temporal priors, we consider three types of network structures, namely, autoregressive (AR), scale-free, and hub, which are all allowed to change slightly with the evolvement of conditions. For all the types of structures, we fix Inline graphic and Inline graphic, and varied the sample size Inline graphic and 500. We let Inline graphic denote the precision matrix at condition Inline graphic for Inline graphic. At each condition Inline graphic, we generated 10 independent datasets of size Inline graphic by drawing from the multivariate Gaussian distribution Inline graphic.

For the AR network structure, the precision matrix at condition 1 is given by

graphic file with name M305.gif (3.10)

which represents an AR(2) graphical model. To construct Inline graphic, we employed the following random edge deleting–adding procedure: we first randomly removed 5% edges in Inline graphic by setting the corresponding nonzero elements to 0, and then added the same number of edges at random by replacing zeros in Inline graphic with the values drawn from the uniform distribution defined on Inline graphic; to ensure Inline graphic to be positive definite, we set the diagonal elements of Inline graphic to be the smallest absolute eigenvalue of Inline graphic plus a small positive number, where Inline graphic is obtained from Inline graphic by setting the diagonal elements to zero. In the same procedure, we generated Inline graphic conditioned on Inline graphic and then generated Inline graphic conditioned on Inline graphic. We note that similar procedures have been used in Peterson and others (2015) and Lin and others (2017) to generate multiple precision matrices. For the scale-free and hub structures, we first generated the precision matrix Inline graphic using the R package “huge,” then applied the random edge deleting-adding procedure to generate Inline graphic’s for Inline graphic in a sequential manner.

The FHBIA method was first applied to this example. To access the performance of the method, we plot the precision-recall curves in Figure S1 of supplementary material available at Biostatistics online. The same rule applies to other tables and figures included in the supplementary material available at Biostatistics online. The precision and recall are defined by Inline graphic, Inline graphic, where TP, FP, and FN denote true positives, false positives, and false negatives, respectively, as defined in Table S1 of supplementary material available at Biostatistics online. To draw the precision-recall curves shown in Figure S1 of supplementary material available at Biostatistics online, we fix the significance level of correlation screening to Inline graphic and varied the value of Inline graphic, the significance level of JED. The precision and recall values were calculated by cumulating the TP, FP, FN, and TN values across all Inline graphic conditions. In this article, we employ the precision-recall curve instead of the Receiver Operating Characteristic (ROC) curve as the classification problem involved in recovering the network structure is severely imbalanced, which contains a large number of negative cases due to the network sparsity. As pointed out by Saito and Rehmsmeier (2015) and Davis and Goadrich (2006), the precision-recall curve can be more informative than the ROC curve in the imbalanced classification scenario.

For comparison, the MRF method (Lin and others, 2017), fused graphical Lasso (FGL), and group graphical Lasso (GGL) (Danaher and others, 2014) were applied to this example. The Matlab code of the MRF is available at https://github.com/linzx06/Spatial-and-Temporal-GGM. Both FGL and GGL are available in the R package JGL. For a thorough comparison, we also applied the original Inline graphic-learning algorithm to this example, for which the models under each condition were estimated separately. The results are summarized in Figure S1 and Table S2 of supplementary material available at Biostatistics online. The comparison indicates that FHBIA significantly outperforms the existing methods, especially when the sample size is small. When the sample size is large, FHBIA, MRF, FGL, and GGL tend to perform similarly for the scale-free and hub networks. It is not surprising that FHBIA always outperforms the separated Inline graphic-learning algorithm, which implies the importance of data integration for such high-dimensional problems.

Table S3 of supplementary material available at Biostatistics online reports the CPU time cost by FGL, GGL, MRF, separated Inline graphic-learning and FHBIA for one dataset of AR(2) structure, where the CPU time was measured on a Linux desktop with Inter Core i7-4790 CPU3.6Ghz. All computations reported in this article were done on the same computer. The CPU times of these methods for the other two graph structures are about the same. FGL is extremely slow for this example, as it needs to search over a grid of possible values for an optimal setting of Inline graphic. The grid we used consists of 100 different pairs of Inline graphic. Moreover, for each pair of Inline graphic, it needs to solve a generalized fused Lasso problem for which a closed-form solution does not exist when Inline graphic is greater than 2. Solving the generalized fused Lasso problem is time consuming and has a computational complexity of Inline graphic. The GGL is better as for which there exists a closed-form solution to the regularized parameter optimization problem under each setting of Inline graphic, although the optimal setting of Inline graphic also needs to be searched over a grid of 100 points. The computational complexity of MRF is of Inline graphic (Lin and others, 2017) while FHBIA is of Inline graphic, which can be pretty fast for a small value of Inline graphic. The separated Inline graphic-learning is a little more time consuming than FHBIA because it needs to conduct multiple hypothesis tests under each condition.

3.2. Scenario with spatial priors

As in the scenario with temporal priors, we considered three types of network structures: AR(2), scale-free, and hub. For each type of structures, we set Inline graphic and Inline graphic, and tried two sample sizes Inline graphic and Inline graphic. For AR(2), we first generated the precision matrix Inline graphic according to (2.1). Conditioned on Inline graphic, we generated the precision matrices Inline graphic, Inline graphic, independently using the random edge deleting–adding procedure as described in the scenario of temporal priors. For the other two types of structures, we generated the precision matrices Inline graphic using the R package huge, and then generated Inline graphic, Inline graphic independently using the random edge deleting–adding procedure. Given the precision matrices, we then generated 10 independent datasets of size Inline graphic by drawing from the multivariate Gaussian distribution Inline graphic for each condition Inline graphic.

The FHBIA, MRF, FGL, GGL, separated Inline graphic-learning and graphical EM (Xie and others, 2016) methods were applied to this example. The graphical EM algorithm was specially designed for jointly estimating multiple dependent Gaussian graphical models under this scenario.

Figure S2 of supplementary material available at Biostatistics online shows the precision-recall curves produced for two datasets by FHBIA, MRF, FGL, GGL, separated Inline graphic-learning, and graphical EM. Table S4 of supplementary material available at Biostatistics online summarizes the performance of these methods for all simulated datasets of this example. The comparison indicates that FHBIA significantly outperforms all other methods, especially when the sample size is small.

Table S5 of supplementary material available at Biostatistics online reports the CPU time cost by MRF, FGL, GGL, separated Inline graphic-learning, graphical EM, and FHBIA for one dataset of AR(2) structure. The CPU times for the other two graph structures are about the same. For FGL, this example is even more time consuming than the previous one, although it was run under exactly the same setting for the two examples. One reason is that Inline graphic has increased from 4 to 5. For FHBIA, the CPU time is not much increased compared to the previous example.

4. TEDDY data analysis

This section applied the FHBIA method to the mRNA gene expression data collected in the study of TEDDY. In the study, to reduce potential bias and retain study power while reducing the costs by limiting the numbers of samples requiring laboratory analyses, the gene expression data were collected from the nested matched case–control cohort. A subject who developed two primary outcomes, persistent confirmed islet autoimmunity (i.e., the presence of one confirmed autoantibody, GADA65A, IA-2A, or IAA, on two or more consecutive samples) and/or T1D, was defined as a case. The controls are randomly selected among cohort members who have not yet developed the disease at the time a case is diagnosed. For each subject, the gene expression data were collected at multiple time points within 4 years of age. Refer to Lee and others (2014) for the detailed description for the study. Our goal is to integrate all the data to construct one gene network under each distinct condition.

The dataset consists of 21 285 genes and 742 samples collected at multiple time points from a total of 313 subjects. Among the 742 samples, half of them are for the case and half of them are for the control. The dataset also contains some external variables for each patient, which include age (the time of data collected), gender, race, race ethnicity, season of birth, number of older siblings, and country. To simplify the analysis, we first filtered out some non-differentially expressed genes across the case and control conditions. This was done by conducting a paired Inline graphic-test for each gene at each time point and then applied the multiple hypothesis test method by Liang and Zhang (2008) to identify the set of genes that are significantly differentially expressed under the two conditions at least at one time point. With this filtering process, 572 genes were selected for further study. Figure S3 of supplementary material available at Biostatistics online shows the histogram of the ages of the samples. Based on this histogram, we selected only the samples fallen into the first nine groups for the further analysis, where each mode of the histogram is treated as a group. The respective group sizes are 29, 40, 49, 43, 32, 27, 27, 23, and 21, which are the same for both the case and control. Since the samples were grouped in ages, the index Inline graphic can be understood as the time of experiments. In grouping the samples, we have ensured that in each group, each sample corresponds to a different patient and thus the samples within the same group can be treated as mutually independent. Since the sample size of each group is small, we set Inline graphic and Inline graphic, which are smaller than the default values.

To adjust the effect of external variables, we adopted the method proposed by Liang and others (2015). Let Inline graphic denote the external variables observed at condition Inline graphic. To adjust for their effects, we can replace the empirical correlation coefficient used in the Inline graphic-score calculation step by the p-value obtained in testing the hypotheses Inline graphic for the regression

graphic file with name M367.gif (4.11)

where Inline graphic denotes the expression value of gene Inline graphic measured at condition Inline graphic, and Inline graphic denotes a vector of Gaussian random errors. Similarly, we can replace the Inline graphic-partial correlation coefficient calculated in the Inline graphic-score calculation step by the p-value obtained in testing the hypotheses Inline graphic for the regression

graphic file with name M375.gif (4.12)

where Inline graphic is the separator of Inline graphic and Inline graphic under condition Inline graphic. With the Inline graphic-values, we can define the adjusted Inline graphic-score as Inline graphic, where Inline graphic is the p-value obtained from equation (4.12) for edge Inline graphic at condition Inline graphic.

For this dataset, the effect of all available demographical variables, including age (the time of data collection), gender, race, race ethnicity, season of birth, number of older siblings, and country, have been adjusted. With the adjusted Inline graphic-scores, the FHBIA method is ready to be applied to construct the gene networks. Given the complexity of the dataset, which contains case and control groups and multiple time points for each group, we calculated the integrated Inline graphic-scores in two steps. First, we integrated the Inline graphic-scores across nine time points under the case and control, separately. Then, for each time point, we integrated the Inline graphic-scores across the case and control conditions. In this way, all information of the data collected under the 18 conditions were integrated together. Figure 1 shows a schematic diagram for this two-step procedure. Finally, we applied the multiple hypothesis test to the Bayesian integrated Inline graphic-scores to determine the structure of the gene networks under the 18 conditions. The total CPU time cost by FHBIA was 19.2 h, which is pretty long as Inline graphic is large. For a larger value of Inline graphic, we might resort to MCMC for estimating the posterior probabilities Inline graphic’s.

Figure 4 shows the networks constructed by FHBIA for the case samples at nine time points. The networks have identified quite a few hub genes, which refer to the genes with high connectivity. Table 1 shows the top 5 hub genes identified at each time point for the case samples. The lists of hub genes are pretty stable. For example, RPS26P11 and RPS26 consistently appear as top 2 genes at all time points, the gene ADAM10 appeared at five out of nine time points, and quite a few genes appeared twice or more times, such as PRF1, POGZ, BCL11B, GGNBP2, and TMEM159. Note that RPS26P11 is a pseudo-gene, which represents a segment of the gene RPS26.

Fig. 4.

Fig. 4.

Gene networks produced by FHBIA for the case TEEDY samples at nine time points. The red edge lines denote new connections appearing in the current network compared with the network at the previous time point; the blue edge lines denote the disappearing connections in the network of the next time point; the green edge lines denote the lines that are both newly appearing connections and disappearing connections; the gray edge lines denote the unchanged connections in the current network and network at the previous time point. (a) Time 1, (b) time 2, (c) time 3, (d) time 4, (e) time 5, (f) time 6, (g) time 7, (h) time 8, and (i) time 9.

Table 1.

Top 5 hub genes identified by FHBIA for the case TEDDY samples at nine time points: “Links” denotes the number of links of the gene to other genes, Inline graphic is the index of time points, *indicates that there exist other genes which has the same number of links with this genes, Inline graphic indicates that this gene has been verified as a T1D-related gene in the literature

Case group
Gene Links Gene Links Gene Links
Inline graphic Inline graphicRPS26 104 Inline graphicRPS26 68 Inline graphicRPS26 64
Inline graphicRPS26P11 40 Inline graphicRPS26P11 15 Inline graphicRPS26P11 12
Inline graphicADAM10 4 Inline graphic Inline graphicADAM10 5 Inline graphic Inline graphicADAM10 5
Inline graphic POGZ 3 Inline graphic PRF1 4 U2SURP 4
Inline graphicTMEM159* 3 Inline graphic POGZ 3 Inline graphicBCL11B* 3
Inline graphic Inline graphicRPS26 99 Inline graphicRPS26 91 Inline graphicRPS26 86
Inline graphicRPS26P11 14 Inline graphicRPS26P11 18 Inline graphicRPS26P11 42
Inline graphicADAM10 6 Inline graphic Inline graphicADAM10 4 Inline graphic Inline graphicBCL11B 3
Inline graphicBCL11B 3 Inline graphicBCL11B 3 GNPTG 3
Inline graphicPOGZ* 3 Inline graphicPOGZ* 3 Inline graphicGGNBP2 3
Inline graphic Inline graphicRPS26 78 Inline graphicRPS26 70 Inline graphicRPS26 61
Inline graphicRPS26P11 46 Inline graphicRPS26P11 39 Inline graphicRPS26P11 30
Inline graphicBCL11B 3 Inline graphic Inline graphic PRF1 4 Inline graphic Inline graphicTMEM159 3
Inline graphicTMEM159 3 Inline graphicBCL11B 3 Inline graphic GGNBP2 3
Inline graphicGGNBP2 3 Inline graphicGGNBP2 3 Inline graphicOGT* 2

Table 1 includes 11 different genes in total. Among the 11 genes, 9 genes have been verified in the literature to be T1D associated genes. For example, Schadt and others (2008) reported that RPS26 is a T1D causal gene, and Ma and Hart (2013) reported that the gene O-GlcNAc transferase (OGT) is directly linked to many metabolic diseases including diabetes. Other than identifying some verified T1D associated genes, we have also some new findings such as gene PRF1. Orilieri and others (2008) claimed that PRF1 variations are susceptibility factors for T1D development. In Table 1, PRF1 appeared as a hub gene twice, which suggests that the connection between PRF1 and T1D might be worth to be further explored. Moreover, we also identifies some connection changes in the networks. As showed in Figure 4, the new appearing and disappearing connections are marked in different colors at each time point, which identify some evolvement patterns of the network.

For comparison, the GGL method was also applied to this example, for which the regularization parameters were chosen according to the minimum AIC criterion. The total CPU time cost by the method was 20.2 h. FGL was not applied to this example, as it would take extremely long CPU time. Figure S4 available at Biostatistics online shows the networks constructed by GGL for the case samples at all nine time points. Table S6 available at Biostatistics online shows the top 5 hub genes identified by GGL at each time point for the case samples. The lists of hub genes are pretty stable, which consists of seven different genes only. Among the seven genes, only three genes RPS26, OGT, and JMJD1C have been verified in the literature as T1D-associated genes. Moreover, as showed in Figure S4 available at Biostatistics online, the hub genes in networks are almost identical at each time point. In summary, FHBIA tends to outperform GGL for this real data example which can identify more hub genes which are associated with T1D.

From the perspective of data analysis, one might also be interested in estimating the gene networks constructed from the controls, as well as the differences between the networks from the cases and controls. For comparing the networks from the cases and controls, we can adopt the method described in Section 6 of Liang and others (2015). However, since the method by Liang and others (2015) requires that the two networks under comparison are independent, the sample information from the cases and controls should not be integrated in this case. We left this work to the future.

5. Discussion

We have proposed the FHBIA method for jointly estimating multiple GGMs under distinct conditions and applied it to TEDDY data. The FHBIA method consists of a few important steps, which is to first summarize the graph structure information contained in the data using the Inline graphic-learning algorithm, then integrate information via a meta-analysis procedure under the Bayesian framework, and finally determine the structures of multiple graphs via a multiple hypothesis test. Compared to the existing methods, FHBIA has a few significant advantages. First, FHBIA includes a meta-analysis procedure to explicitly integrate information across distinct conditions. In contrast, the existing methods integrate information through prior distributions or penalty function, which is often less efficient. Second, FHBIA can be run very fast, especially when Inline graphic is small. The overall computational complexity of FHBIA is Inline graphic, where the factor Inline graphic is the total number of possible configurations of an edge across all Inline graphic conditions. When Inline graphic is large, we need to resort to MCMC for an efficient estimation of the posterior probabilities Inline graphic’s for Inline graphic. Since Inline graphic’s can be estimated for each Inline graphic independently, this step can be done in parallel. In addition, we note that the correlation coefficients and Inline graphic-scores can also be calculated in parallel. Hence, the whole method can be executed very fast on a parallel architecture. Moreover, instead of working on the original data, the Bayesian integration step chooses to work on the edge-wise Inline graphic-scores, which avoids to invert high-dimensional covariance matrices and thus can be very fast. Note that, in calculation of Inline graphic-scores, the Inline graphic-learning algorithm also successfully avoids to invert high-dimensional covariance matrices through correlation screening. Third, the empirical Bayesian method that FHBIA employed for multiple hypothesis tests produces a Inline graphic-value (Storey, 2002) for each potential edge of the multiple graphs. The Inline graphic-value provides an uncertainty measure for each potential edge. This has been beyond the ability of many of the existing methods, especially when Inline graphic is large.

The FHBIA method has a very flexible framework, which can be easily extended to joint estimation of multiple mixed graphical models. For example, consider the scenario that the data consist of only Gaussian and multinomial random variables, for which the joint distribution is well defined (Lee and Hastie, 2015). For such mixed data, the Inline graphic-learning algorithm can be performed under the framework of generalized linear models; that is, we can replace the correlation coefficients and Inline graphic-partial correlation coefficients used in the algorithm by the corresponding Inline graphic-values obtained in the marginal variable screening tests (Fan and Song, 2010) and conditional independence tests. Then, we can replace the Inline graphic-scores by the Inline graphic-scores corresponding to the Inline graphic-values of the conditional independence tests. For other types of continuous random variables, we can apply the nonparanormal transformation (Liu and others, 2009) to Gaussianize them prior to the application of the FHBIA method. For certain types of discrete data, e.g., next generation sequencing data, we can apply the transformations developed in (Jia and others, 2017) to continuize and Gaussianize them prior to the application of the FHBIA method.

Finally, we would like to mention that the sparsity assumption imposed on the networks does not limit the applications of the FHBIA method. Sparsity is a just trick adopted by people for making statistical inference where there is no enough amount of data available, e.g., in dealing with small-n-large-p problems. In this article, the sparsity assumption A4 (in the supplementary material available at Biostatistics online) is given in the form of the sample size Inline graphic, which leads to a bounded neighborhood size of Inline graphic for each node, see Lemma 3.2 and the Inline graphic-learning algorithm presented in the supplementary material available at Biostatistics online. Therefore, when the sample size Inline graphic is large enough, e.g., when using the UK biobank data which consists of over 500 K samples to construct GRNs, FHBIA can be applied to learn dense genetic networks (Boyle and others, 2017). For UK biobank data, we have Inline graphic, which is much larger than the number of genes (Inline graphic K) we usually considered and thus neighborhood truncation in the correlation screening step of Inline graphic-learning will not be triggered. When the data size is not large enough, the FHBIA will identify only the strongest connections in terms of partial correlation coefficients. This property is inherited from the Inline graphic-learning algorithm.

Supplementary Material

kxz027_Supplementary_Material

Acknowledgments

The authors thank the editor, associate editor, two referees, and Dr. George Tseng for their constructive comments which have led to significant improvement of this article. Members of the TEDDY Study Group are listed in the Supplementary File. Conflict of Interest: None declared.

6. Software

The software accompanied with this article is available as a module called JGGM in the R package equSA at https://cran.r-project.org/web/packages/equSA/index.html.

Funding

Leona M. and Harry B. Helmsley Charitable Trust (2015PG-T1D050); F.L.’s research was support in part by the grants USF-ITN-15-11-MH, DMS-1612924, DMS/NIH R01-GM117597, and NIH R01-GM126089. The TEDDY Study is funded by U01 DK63829, U01 DK63861, U01 DK63821, U01 DK63865, U01 DK63863, U01 DK63836, U01 DK63790, UC4 DK63829, UC4 DK63861, UC4 DK63821, UC4 DK63865, UC4 DK63863, UC4 DK63836, UC4 DK95300, UC4 DK100238, UC4 DK106955, UC4 DK112243, UC4 DK117483, and Contract No. HHSN267200700014C from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), Centers for Disease Control and Prevention (CDC), and JDRF. NIH/NCATS Clinical and Translational Science Awards to the University of Florida (UL1 TR000064) and the University of Colorado (UL1 TR001082), in part.

References

  1. Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 1165–1188. [Google Scholar]
  2. Boyle, E. A.,Li, Y. I. and Pritchard, J. K. (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Chun, H.,Zhang, X. and Zhao, H. (2015). Gene regulation network inference with joint sparse Gaussian graphical models. Journal of Computational and Graphical Statistics 24, 954–974. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Danaher, P.,Wang, P. and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 373–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Davis, J. and Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, USA. New York, NY, USA: ACM, pp. 233–240. [Google Scholar]
  6. Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Annals of Statistics 38, 3567–3604. [Google Scholar]
  7. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741. [DOI] [PubMed] [Google Scholar]
  8. Guo, J.,Levina, E.,Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika 98, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Jia, B.,Xu, S.,Xiao, G.,Lamba, V. and Liang, F. (2017). Learning gene regulatory networks from next generation sequencing data. Biometrics 73, 1221–1230. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Lee, H. S.,, Burkhardt, B. R.,, McLeod, W.,, Smith, S.,, Eberhard, C.,, Lynch, K.,, Hadley, D.,, Rewers, M.,, Simell, O.,, She, J. X. and others (2014). Biomarker discovery study design for type 1 diabetes in The Environmental Determinants of Diabetes in the Young (TEDDY) study. Diabetes/Metabolism Research and Reviews 30, 424–434. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Lee, J. and Hastie, T.J. (2015). Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics 24, 230–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Liang, F.,Song, Q. and Qiu, P. (2015). An equivalent measure of partial correlation coefficients for high dimensional Gaussian graphical models. Journal of the American Statistical Association 110, 1248–1265. [Google Scholar]
  13. Liang, F. and Zhang, J. (2008). Estimating the false discovery rate using the stochastic approximation algorithm. Biometrika 95, 961–977. [Google Scholar]
  14. Lin, Z.,Wang, T.,Yang, C. and Zhao, H. (2017). On joint estimation of Gaussian graphical models for spatial and temporal data. Biometrics 73, 769–779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Liu, H.,Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]
  16. Ma, J. and Hart, G. W. (2013). Protein O-GlcNAcylation in diabetes and diabetic complications. Expert Review of Proteomics 10, 365–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Annals of Statistics 34, 1436–1462. [Google Scholar]
  18. Orilieri, E.,, Cappellano, G.,, Clementi, R.,, Cometa, A.,, Ferretti, M.,, Cerutti, E.,, Cadario, F.,, Martinetti, M.,, Larizza, D.,, Calcaterra, V. and others (2008). Variations of the perforin gene in patients with type 1 diabetes. Diabetes 57, 1078–1083. [DOI] [PubMed] [Google Scholar]
  19. Peterson, C.,Stingo, F. C. and Vannucci, M. (2015). Bayesian inference of multiple Gaussian graphical models. Journal of the American Statistical Association 110, 159–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Qiu, H.,Han, F.,Liu, H. and Caffo, B. (2016). Joint estimation of multiple graphical models from high dimensional time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78, 487–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Saito, T. and Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10, e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Schadt, E. E.,, Molony, C.,, Chudin, E.,, Hao, K.,, Yang, X.,, Lum, P. Y.,, Kasarskis, A.,, Zhang, B.,, Wang, S.,, Suver, C. and others (2008). Mapping the genetic architecture of gene expression in human liver. PLoS Biology 6, e107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64, 479–498. [Google Scholar]
  24. Stouffer, S. A.,Suchman, E. A.,Devinney, L. C.,Star, S. A., and Williams, R. M., Jr.. and others (1949). The American Soldier: Adjustment during Army Life, (Studies in social psychology in World War II) Volume 1. Oxford, England: Princeton University Press. [Google Scholar]
  25. Xie, Y.,Liu, Y. and Valdar, W. (2016). Joint estimation of multiple dependent Gaussian graphical models with applications to mouse genomics. Biometrika 103, 493–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Xu, S.,Jia, B. and Liang, F. (2019). Learning moral graphs in construction of high-dimensional Bayesian networks for mixed data. Neural Computation 31, 1183–1214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Zaykin, D. V. (2011). Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. Journal of Evolutionary Biology 24, 1836–1841. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Zhou, S.,Lafferty, J. and Wasserman, L. (2010). Time varying undirected graphs. Machine Learning 80, 295–319. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxz027_Supplementary_Material

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES