SUMMARY
Motivated by the study of the molecular mechanism underlying type 1 diabetes with gene expression data collected from both patients and healthy controls at multiple time points, we propose a hybrid Bayesian method for jointly estimating multiple dependent Gaussian graphical models with data observed under distinct conditions, which avoids inversion of high-dimensional covariance matrices and thus can be executed very fast. We prove the consistency of the proposed method under mild conditions. The numerical results indicate the superiority of the proposed method over existing ones in both estimation accuracy and computational efficiency. Extension of the proposed method to joint estimation of multiple mixed graphical models is straightforward.
Keywords: Data integration, Meta-analysis, Multiple Gaussian graphical models, ψ-Learning
1. Introduction
Type 1 diabetes (T1D) is one of the most common autoimmune diseases. The Environmental Determinants of Diabetes in the Young (TEDDY) study is designed to identify environmental exposures triggering islet autoimmunity and T1D in genetically high-risk children. A large dataset has been collected through the study, including clinical data, genetic data, and demographical data. While great efforts have been made for identifying the genetic and environmental factors that contribute to the etiology of the disease, the molecular mechanism underlying the disease is still far from understanding. To enhance our understanding to the molecular mechanism, this work aims to learn gene regulatory networks (GRNs) by integrating the gene expression data measured from both the patients and healthy controls at multiple time points. Figure 1 shows the structure of the data, where the gene expression was measured for each of the case and control children at nine time points within 4 years of age. How to integrate the data collected under the 18 distinct conditions have posed a great challenge on the current statistical methods.
Fig. 1.
Structure of the T1D data considered in the article, where the numbers represent nine time points at which gene expression data were collected and the arrows represent joint estimation of Gaussian graphical models by integrating the data across different time points and case–control groups.
During the past decade, a variety of approaches have been proposed for estimating multiple GRNs with data collected under multiple distinct conditions. These approaches can be roughly grouped into two categories, namely, regularization and Bayesian.
The regularization approaches work with some specific penalty functions that enhance the shared structure of the graphical models. For example, Guo and others (2011) employed a hierarchical penalty that targets the removal of common zeros in the precision matrices across conditions. Danaher and others (2014) employed penalized fused lasso or group lasso penalties that encourage shared elements of the precision matrices. Chun and others (2015) employed a class of nonconvex penalty functions that regularize the common and condition-specific structures hierarchically. A shortcoming of these approaches is that they assume the observations under different conditions are independent. This is hard to be satisfied for the temporal data, where the observations were taken from the same cohort at multiple time points. To address this issue, Zhou and others (2010) and Qiu and others (2016) proposed to model the temporal data in a high-dimensional time series and then estimate the time-varying graphical structure using a nonparametric method by assuming that the covariance changes smoothly over time. These approaches usually require the time series to be fairly long, say, 50 or longer.
As an analog to regularization approaches, Bayesian approaches enhance the shared structure of multiple graphical models by employing some specific priors. For example, Peterson and others (2015) links the estimation of graph structures via a Markov random field (MRF) prior which encourages common edges. However, since this method involves repeated calculations of concentration matrices (i.e., inverse of covariance matrices), it is only applicable when the graph is not very large. To accelerate computation, Lin and others (2017) proposed a Bayesian analog of the neighborhood selection method (Meinshausen and Bühlmann, 2006) to learn the structure of multiple graphical models with the MRF prior.
In this article, we propose a fast hybrid Bayesian integrative analysis (FHBIA) method for jointly estimating multiple Gaussian graphical models. The proposed method consists of both frequentist and Bayesian components. First, it applies a
-learning method, which is a frequentist method, to transform the original data to edge-wise
-scores. The
-score, which forms an equivalent measure of the partial correlation coefficient, provides a good summary for the graph structure information contained in the data under each condition. Then, it applies a Bayesian method to model the
-scores for edge clustering and applies a meta-analysis method for integrating data information across distinct conditions. Finally, it applies a multiple hypothesis test method for edge determination. Due to the use of the
-score transformation, FHBIA avoids inversion of high-dimensional covariance matrices and thus can be executed very fast. The multiple hypothesis test produces a
-value (Storey, 2002), which can be viewed as an uncertainty measure, for each potential edge of the multiple Gaussian graphs. We prove consistency of the proposed method under mild conditions and illustrate its performance using simulated and real data examples. The numerical results indicate the superiority of the proposed method over the existing ones in both estimation accuracy and computational efficiency.
2. Fast hybrid Bayesian integrative analysis
The FHBIA method consists of a few steps, including
-score calculation, Bayesian clustering and meta-analysis, and joint edge detection (JED), with the diagram shown in Figure 2.
Fig. 2.
Diagram of the FHBIA method: (i) datasets
’s for
are first transformed to edgewise
-scores
’s through the step of
-score transformation; (ii)
-scores are processed through the step of Bayesian clustering and meta-analysis to get Bayesian integrated
-scores denoted by
’s; (iii) Bayesian integrated
-scores are further processed through the step of JED to get graph estimates
’s.
2.1.
-Score transformation
Suppose that we have a dataset of
variables observed under
distinct conditions. Let
denote the dataset observed under condition
, where
denotes the sample size under condition
; and
is a
-dimensional random vector distributed according to the multivariate normal distribution
, and
and
are the mean and covariance matrix of the distribution, respectively. The sample size
is not necessarily the same for all conditions. Without loss of generality, we assume that
is a zero vector for all
. With slight abuse of notation, we let
denote the
variables that are common for all
datasets. Let
denote the index set of the variables, where each variable is called a node in the terminology of graphs.
We adopt the
-learning algorithm to transform each dataset
to edge-wise scores independently. The
-learning algorithm first produced a
-partial correlation coefficient for each pair of nodes. Denote the
-partial correlation coefficients by
for all
, which are equivalent to the true partial correlation coefficients
for determining the structure of the Gaussian Graphical Model (GGM) in the sense that
,
. Further, the
-learning algorithm converts the
-partial correlation coefficients to
-scores, denoted by
, via Fisher’s transformation and the probit transformation such that
approximately holds under the null hypothesis
for any
. Therefore, the
-score can be used as a test statistic for identifying nonzero partial correlation coefficients and thus the structure of the GGM. The use of
-scores enables the proposed method to avoid inversion of high-dimensional covariance matrices that the existing Bayesian methods often need to deal with, and hence the proposed method can be executed very fast. Refer to the supplementary material available at Biostatistics online for the detail of the
-learning algorithm.
Since the GGM is undirected, we have a total of 
-scores to calculate for each dataset
. For convenience, we re-arrange the
-scores for each dataset
into an
-vector
with
and re-arrange the
-scores for all
datasets into an
matrix
with
and
.
2.2. Bayesian clustering and meta-analysis
Consider the
-scores
, where each pair
corresponds to one candidate edge in the graph
. Let
be the indicator for the status of edge
in the underlying graph
;
if the edge exists and 0 otherwise. The
’s work as latent variables in FHBIA. Conditioned on
, we assume that
’s are mutually independent and follow a two-component mixture Gaussian distribution given by
![]() |
(2.1) |
for
and
. When
,
’s have a value close to 0, otherwise,
’s might have a large negative or positive value depending on the sign of the partial correlation coefficient. Under the assumption that the structure of the GGM changes only slightly under adjacent conditions, it is reasonable to assume that for each
, the sign of
’s are not changed when the edge exists; therefore,
’s can be modeled by a two-component mixture Gaussian distribution. In some cases, e.g., when
grows, a three-component mixture Gaussian distribution might be needed, which allows us to handle the scenario that an edge is included in multiple graphs, but its partial correlation coefficients have different signs in different graphs. The derivation under this scenario is given in the supplementary material available at Biostatistics online, which is a simple extension of the deviation presented below. Regarding the two-component mixture distribution (2.1), we further note that
can be simply set to 0 considering the physical mean of
-scores. However, as shown below, this general setup does not cause any computational difficulty.
Essentially, we have formulated the problem of inference of
’s as a clustering problem, grouping
to up to two different clusters. For the case of three-component mixture distribution, this is similar. Let
and
. Conditioned on
, the joint likelihood function of
is given by
![]() |
(2.2) |
where
is the density function of the Gaussian distribution with mean
and variance
. Taking a product of (2.2) over
, we have the joint distribution of all
-scores
conditioned on
’s and other parameters. Then, using the Bayes theorem,
’s can be inferred with an appropriate priors of
’s and other parameters. For example, the MRF prior used in Peterson and others (2015) and Lin and others (2017) can again be used here as the prior of
’s. In this case, the posterior distribution can be sampled from using a Monte Carlo Markov chain (MCMC) algorithm.
Instead of specifying a joint prior distribution for all
’s, we assume in this article that
’s are a priori independent for different
’s, as we believe that the neighboring dependence of the Gaussian graphical network has been accounted for in calculation of the
-scores. To enhance shared edges among distinct conditions, we consider two types of priors for
’s, namely, temporal prior and spatial prior, with borrowed terms from geostatistics. Figure 3 illustrates the application scenarios of the two types of priors. The temporal prior can be used in the scenario that the networks
’s evolve sequentially along with the index
. In this scenario, it is quite common to consider the index
as the time of experiments. The spatial prior can be used in the scenario that the networks or precision matrices evolve independently from a common structure. For example, the temporal prior can be applied when we construct genetic networks using a set of gene expression data measured for the same tissue at multiple time points, and the spatial prior can be applied if the gene expression data are measured for different tissues at the same time point.
Fig. 3.
Illustration of application scenarios of the temporal and spatial priors: (a) networks evolve along with time (temporal prior); (b) networks evolve independently from a common structure (spatial prior).
2.2.1. Temporal prior
To enhance the similarity of the networks between adjacent conditions, we let
be subject to the following prior distribution
![]() |
(2.3) |
where
indicates the change of the status of the edge
from condition
to condition
, and
is a prior hyperparameter representing the prior probability of edge status changes. In this article, we assume that
follows a beta distribution
, where
and
are pre-specified parameters. Further, we let
and
be subject to an improper uniform distribution, i.e.,
and
, and let
and
be subject to an inverted-gamma distribution, i.e.,
, where
and
are pre-specified constants. Then the joint posterior distribution of
is given by
![]() |
where
’s denote the respective prior distributions. After integrating out the parameters
,
,
,
, and
, we have the marginal posterior distribution of
given by
![]() |
(2.4) |
when
and
hold, where
,
,
and
. When
and
, we have
. When
and
, we have
.
Given
distinct conditions, the total number of possible configurations of
is
. When
is small, we can provide an exhaustive evaluation of the
configurations. That is, for each possible configuration of
, we can calculate its posterior probability and integrated
-scores exactly. For each possible configuration
for
, we denote the posterior probability by
and the integrated
-score by
. According to Stouffer’s meta-analysis method (Stouffer and others, 1949), which is also known as the inverse normal method for combining
-values (Zaykin, 2011), we define the integrated
-score as
![]() |
(2.5) |
for
, where the weight
might account for the size or quality of the samples collected under each condition. In this article, we set
for all
. Such a weighted average score integrates data information on the edge across all conditions. Then the Bayesian Stouffer integrated
-score, or Bayesian integrated
-score in short, is given by
![]() |
(2.6) |
When
is large, the
’s can be estimated with a short MCMC run, say, by running the Gibbs sampler (Geman and Geman, 1984) for a few hundred iterations. Since the MCMC can be run in parallel for different
’s, the computation is not a big burden in this case.
It is interesting to point out that the Bayesian Stouffer integrated
-score
is different from the conventional Bayesian estimator of
. The latter is given by
![]() |
(2.7) |
and
![]() |
(2.8) |
It is easy to see that the Bayesian Stouffer integrated
-score amplifies the Bayesian averaged
-score (2.7) by a factor between 1 and
. Such amplification makes the two clusters of edges more separable in the scores and, as pointed out in the Proof of Lemma 3.4 (in the supplementary material available at Biostatistics online), helps to improve the power of the proposed method by reducing the false negative error. Also, we would point out that for each
, if the edge clustering pattern
is correct and
, then the Stouffer integrated
-score
has a constant variance of 1, while the simply averaged
-score
has a varied variance depending on the value of
. Therefore, the Bayesian Stouffer integrated
-scores are more comparable than the Bayesian averaged
-scores in edge determination.
2.2.2. Spatial prior
To enhance our prior knowledge that there exists a common structure for all the networks from which they evolve independently, we let
’s be subject to the prior distribution
![]() |
(2.9) |
where
indicates the status change of the edge
at condition
from
, and
is the mode of
and represents the common status of the edge
across all networks. With this prior distribution, the posterior distribution
can also be expressed in the form of (2.4) but with
and
.
2.3. Joint edge detection
To jointly estimate the structure of multiple GGMs based on the Bayesian Stouffer integrated
-scores (2.6), a multiple hypothesis test can be applied. The multiple hypothesis test classifies the integrated
-scores into two classes, presence of edges and absence of edges. In this article, we adopt the empirical Bayesian method developed by Liang and Zhang (2008) for the multiple hypothesis test, which models the integrated
-scores by a mixture distribution
, where
denotes an integrated
-score;
and
denote the probabilities of edge absence and edge presence, respectively; and
and
denote the probability density functions of integrated
-scores with edge absence and edge presence, respectively. As in Liang and Zhang (2008), we parameterize
by an exponential power distribution, and parameterize
by a mixture of exponential power distributions. The parameters
,
and those contained in
and
are estimated using the stochastic approximation method by minimizing the Kullback–Leibler divergence between the density
and the empirical one. The threshold values for grouping the integrated
-scores into the classes
and
are determined according to the value of
, a pre-specified false discovery rate level. How to specify the value of
will be discussed in Section 2.4. Note that this multiple hypothesis test method allows for the dependence between test statistics, i.e., integrated
-scores for this problem. Other methods that account for the dependence between test statistics, e.g., the two-stage method by Benjamini and Yekutieli (2001), can also be applied here.
Finally, we would like to point out that the empirical Bayesian method used above produces a
-value (Storey, 2002) for each potential edge of the multiple graphs. The
-value, like the
-value for the single hypothesis test, provides an uncertainty measure for each potential edge.
2.4. Parameter setting
FHBIA contains two free parameters, i.e.,
and
, which refer to the significance levels of the multiple hypothesis tests conducted in correlation screening and JED, respectively. Following the suggestion of Liang and others (2015), we set
and
as the default values. Otherwise, their values will be stated in the context. In general, a high significance level of correlation screening will lead to slightly large conditioning sets in calculation of
-partial correlation coefficients, which reduces the risk of missing important variables in the conditioning sets. Including a few false variables in the conditioning sets will not hurt much the accuracy of the
-partial correlation coefficients. As shown in Xu and others (2019), the performance of the
-learning algorithm can be quite robust to the choice of
. However, the setting of
is quite free, which determines the sparsity of the resulting graphs. A smaller value of
might be used if sparse graphs are preferred.
In addition to the two free parameters, FHBIA contains four prior-hyperparameters, i.e.,
,
,
, and
. Since the probability
usually takes a small value, we set
for its prior distribution Beta(
,
). Since the variance of the
-scores is approximately equal to 1 under the null hypothesis that the true partial correlation coefficient is equal to 0, we set
for its prior distribution IG(
,
). The same prior hyperparameter settings have been used in all examples of this article.
2.5. Consistency
Under the faithfulness assumption, sparsity assumption, and other regularity conditions for the joint Gaussian distribution, e.g., the dimension
is allowed to grow exponentially with the sample size
for some constant
and the largest eigenvalue of the covariance matrix can grow with
at a restricted rate, Liang and others (2015) showed that the multiple hypothesis test based on the
-scores produces a consistent estimate for the GGM with data observed under single condition. Essentially, Liang and others (2015) showed that the
-scores are separable in probability for the pairs of nodes with edge absence and edge presence.
To accommodate the change from single condition to multiple conditions, we modified the assumptions of Liang and others (2015) and added an assumption about
. Under the new set of assumptions, we proved that the FHBIA method is consistent.
Theorem 2.1
Assume
–
(see supplementary material available at Biostatistics online) hold. Then
where
denotes the true network under condition
,
denotes the FHBIA estimator of
, and
denotes a threshold value of Bayesian integrated
-scores based on which the edges are determined for all
graphs.
The proof of the theorem is given in the supplementary material available at Biostatistics online. Theorem 2.1 implies that for all graphs there exists a common threshold with respect to which the Bayesian integrated
-scores are separable in probability for the pairs of nodes with edge presence and edge absence. Here, we would like to highlight three points. First, as indicated by our proof [see inequality (S29) in the Proof of Lemma 3.5 in supplementary material available at Biostatistics online], the data integration step can indeed improve the power of proposed method. Second, following from the inequalities (S29) and (S30) and the condition
given in the supplementary material available at Biostatistics online, we can conclude the sign consistency of the estimator
; i.e., for any edge of the graph, the sign of the Bayesian integrated
-score has the same sign as the true partial correlation coefficient when the sample size
becomes large. Third, the assumption imposed on
, i.e.,
, is rather weak, where
,
, and
are all some positive constants as defined in other assumptions and
(see the supplementary material available at Biostatistics online). For example, we can choose
and thus
. This is consistent with our numerical results; the method can perform very well even with a small value of
.
3. Simulation studies
3.1. Scenario with temporal priors
To illustrate the performance of the proposed FHBIA method under the scenario with temporal priors, we consider three types of network structures, namely, autoregressive (AR), scale-free, and hub, which are all allowed to change slightly with the evolvement of conditions. For all the types of structures, we fix
and
, and varied the sample size
and 500. We let
denote the precision matrix at condition
for
. At each condition
, we generated 10 independent datasets of size
by drawing from the multivariate Gaussian distribution
.
For the AR network structure, the precision matrix at condition 1 is given by
![]() |
(3.10) |
which represents an AR(2) graphical model. To construct
, we employed the following random edge deleting–adding procedure: we first randomly removed 5% edges in
by setting the corresponding nonzero elements to 0, and then added the same number of edges at random by replacing zeros in
with the values drawn from the uniform distribution defined on
; to ensure
to be positive definite, we set the diagonal elements of
to be the smallest absolute eigenvalue of
plus a small positive number, where
is obtained from
by setting the diagonal elements to zero. In the same procedure, we generated
conditioned on
and then generated
conditioned on
. We note that similar procedures have been used in Peterson and others (2015) and Lin and others (2017) to generate multiple precision matrices. For the scale-free and hub structures, we first generated the precision matrix
using the R package “huge,” then applied the random edge deleting-adding procedure to generate
’s for
in a sequential manner.
The FHBIA method was first applied to this example. To access the performance of the method, we plot the precision-recall curves in Figure S1 of supplementary material available at Biostatistics online. The same rule applies to other tables and figures included in the supplementary material available at Biostatistics online. The precision and recall are defined by
,
, where TP, FP, and FN denote true positives, false positives, and false negatives, respectively, as defined in Table S1 of supplementary material available at Biostatistics online. To draw the precision-recall curves shown in Figure S1 of supplementary material available at Biostatistics online, we fix the significance level of correlation screening to
and varied the value of
, the significance level of JED. The precision and recall values were calculated by cumulating the TP, FP, FN, and TN values across all
conditions. In this article, we employ the precision-recall curve instead of the Receiver Operating Characteristic (ROC) curve as the classification problem involved in recovering the network structure is severely imbalanced, which contains a large number of negative cases due to the network sparsity. As pointed out by Saito and Rehmsmeier (2015) and Davis and Goadrich (2006), the precision-recall curve can be more informative than the ROC curve in the imbalanced classification scenario.
For comparison, the MRF method (Lin and others, 2017), fused graphical Lasso (FGL), and group graphical Lasso (GGL) (Danaher and others, 2014) were applied to this example. The Matlab code of the MRF is available at https://github.com/linzx06/Spatial-and-Temporal-GGM. Both FGL and GGL are available in the R package JGL. For a thorough comparison, we also applied the original
-learning algorithm to this example, for which the models under each condition were estimated separately. The results are summarized in Figure S1 and Table S2 of supplementary material available at Biostatistics online. The comparison indicates that FHBIA significantly outperforms the existing methods, especially when the sample size is small. When the sample size is large, FHBIA, MRF, FGL, and GGL tend to perform similarly for the scale-free and hub networks. It is not surprising that FHBIA always outperforms the separated
-learning algorithm, which implies the importance of data integration for such high-dimensional problems.
Table S3 of supplementary material available at Biostatistics online reports the CPU time cost by FGL, GGL, MRF, separated
-learning and FHBIA for one dataset of AR(2) structure, where the CPU time was measured on a Linux desktop with Inter Core i7-4790 CPU3.6Ghz. All computations reported in this article were done on the same computer. The CPU times of these methods for the other two graph structures are about the same. FGL is extremely slow for this example, as it needs to search over a grid of possible values for an optimal setting of
. The grid we used consists of 100 different pairs of
. Moreover, for each pair of
, it needs to solve a generalized fused Lasso problem for which a closed-form solution does not exist when
is greater than 2. Solving the generalized fused Lasso problem is time consuming and has a computational complexity of
. The GGL is better as for which there exists a closed-form solution to the regularized parameter optimization problem under each setting of
, although the optimal setting of
also needs to be searched over a grid of 100 points. The computational complexity of MRF is of
(Lin and others, 2017) while FHBIA is of
, which can be pretty fast for a small value of
. The separated
-learning is a little more time consuming than FHBIA because it needs to conduct multiple hypothesis tests under each condition.
3.2. Scenario with spatial priors
As in the scenario with temporal priors, we considered three types of network structures: AR(2), scale-free, and hub. For each type of structures, we set
and
, and tried two sample sizes
and
. For AR(2), we first generated the precision matrix
according to (2.1). Conditioned on
, we generated the precision matrices
,
, independently using the random edge deleting–adding procedure as described in the scenario of temporal priors. For the other two types of structures, we generated the precision matrices
using the R package huge, and then generated
,
independently using the random edge deleting–adding procedure. Given the precision matrices, we then generated 10 independent datasets of size
by drawing from the multivariate Gaussian distribution
for each condition
.
The FHBIA, MRF, FGL, GGL, separated
-learning and graphical EM (Xie and others, 2016) methods were applied to this example. The graphical EM algorithm was specially designed for jointly estimating multiple dependent Gaussian graphical models under this scenario.
Figure S2 of supplementary material available at Biostatistics online shows the precision-recall curves produced for two datasets by FHBIA, MRF, FGL, GGL, separated
-learning, and graphical EM. Table S4 of supplementary material available at Biostatistics online summarizes the performance of these methods for all simulated datasets of this example. The comparison indicates that FHBIA significantly outperforms all other methods, especially when the sample size is small.
Table S5 of supplementary material available at Biostatistics online reports the CPU time cost by MRF, FGL, GGL, separated
-learning, graphical EM, and FHBIA for one dataset of AR(2) structure. The CPU times for the other two graph structures are about the same. For FGL, this example is even more time consuming than the previous one, although it was run under exactly the same setting for the two examples. One reason is that
has increased from 4 to 5. For FHBIA, the CPU time is not much increased compared to the previous example.
4. TEDDY data analysis
This section applied the FHBIA method to the mRNA gene expression data collected in the study of TEDDY. In the study, to reduce potential bias and retain study power while reducing the costs by limiting the numbers of samples requiring laboratory analyses, the gene expression data were collected from the nested matched case–control cohort. A subject who developed two primary outcomes, persistent confirmed islet autoimmunity (i.e., the presence of one confirmed autoantibody, GADA65A, IA-2A, or IAA, on two or more consecutive samples) and/or T1D, was defined as a case. The controls are randomly selected among cohort members who have not yet developed the disease at the time a case is diagnosed. For each subject, the gene expression data were collected at multiple time points within 4 years of age. Refer to Lee and others (2014) for the detailed description for the study. Our goal is to integrate all the data to construct one gene network under each distinct condition.
The dataset consists of 21 285 genes and 742 samples collected at multiple time points from a total of 313 subjects. Among the 742 samples, half of them are for the case and half of them are for the control. The dataset also contains some external variables for each patient, which include age (the time of data collected), gender, race, race ethnicity, season of birth, number of older siblings, and country. To simplify the analysis, we first filtered out some non-differentially expressed genes across the case and control conditions. This was done by conducting a paired
-test for each gene at each time point and then applied the multiple hypothesis test method by Liang and Zhang (2008) to identify the set of genes that are significantly differentially expressed under the two conditions at least at one time point. With this filtering process, 572 genes were selected for further study. Figure S3 of supplementary material available at Biostatistics online shows the histogram of the ages of the samples. Based on this histogram, we selected only the samples fallen into the first nine groups for the further analysis, where each mode of the histogram is treated as a group. The respective group sizes are 29, 40, 49, 43, 32, 27, 27, 23, and 21, which are the same for both the case and control. Since the samples were grouped in ages, the index
can be understood as the time of experiments. In grouping the samples, we have ensured that in each group, each sample corresponds to a different patient and thus the samples within the same group can be treated as mutually independent. Since the sample size of each group is small, we set
and
, which are smaller than the default values.
To adjust the effect of external variables, we adopted the method proposed by Liang and others (2015). Let
denote the external variables observed at condition
. To adjust for their effects, we can replace the empirical correlation coefficient used in the
-score calculation step by the p-value obtained in testing the hypotheses
for the regression
![]() |
(4.11) |
where
denotes the expression value of gene
measured at condition
, and
denotes a vector of Gaussian random errors. Similarly, we can replace the
-partial correlation coefficient calculated in the
-score calculation step by the p-value obtained in testing the hypotheses
for the regression
![]() |
(4.12) |
where
is the separator of
and
under condition
. With the
-values, we can define the adjusted
-score as
, where
is the p-value obtained from equation (4.12) for edge
at condition
.
For this dataset, the effect of all available demographical variables, including age (the time of data collection), gender, race, race ethnicity, season of birth, number of older siblings, and country, have been adjusted. With the adjusted
-scores, the FHBIA method is ready to be applied to construct the gene networks. Given the complexity of the dataset, which contains case and control groups and multiple time points for each group, we calculated the integrated
-scores in two steps. First, we integrated the
-scores across nine time points under the case and control, separately. Then, for each time point, we integrated the
-scores across the case and control conditions. In this way, all information of the data collected under the 18 conditions were integrated together. Figure 1 shows a schematic diagram for this two-step procedure. Finally, we applied the multiple hypothesis test to the Bayesian integrated
-scores to determine the structure of the gene networks under the 18 conditions. The total CPU time cost by FHBIA was 19.2 h, which is pretty long as
is large. For a larger value of
, we might resort to MCMC for estimating the posterior probabilities
’s.
Figure 4 shows the networks constructed by FHBIA for the case samples at nine time points. The networks have identified quite a few hub genes, which refer to the genes with high connectivity. Table 1 shows the top 5 hub genes identified at each time point for the case samples. The lists of hub genes are pretty stable. For example, RPS26P11 and RPS26 consistently appear as top 2 genes at all time points, the gene ADAM10 appeared at five out of nine time points, and quite a few genes appeared twice or more times, such as PRF1, POGZ, BCL11B, GGNBP2, and TMEM159. Note that RPS26P11 is a pseudo-gene, which represents a segment of the gene RPS26.
Fig. 4.
Gene networks produced by FHBIA for the case TEEDY samples at nine time points. The red edge lines denote new connections appearing in the current network compared with the network at the previous time point; the blue edge lines denote the disappearing connections in the network of the next time point; the green edge lines denote the lines that are both newly appearing connections and disappearing connections; the gray edge lines denote the unchanged connections in the current network and network at the previous time point. (a) Time 1, (b) time 2, (c) time 3, (d) time 4, (e) time 5, (f) time 6, (g) time 7, (h) time 8, and (i) time 9.
Table 1.
Top 5 hub genes identified by FHBIA for the case TEDDY samples at nine time points: “Links” denotes the number of links of the gene to other genes,
is the index of time points, *indicates that there exist other genes which has the same number of links with this genes,
indicates that this gene has been verified as a T1D-related gene in the literature
| Case group | ||||||||
|---|---|---|---|---|---|---|---|---|
| Gene | Links | Gene | Links | Gene | Links | |||
|
RPS26 |
104 |
RPS26 |
68 |
RPS26 |
64 | ||
RPS26P11 |
40 |
RPS26P11 |
15 |
RPS26P11 |
12 | |||
ADAM10 |
4 |
|
ADAM10 |
5 |
|
ADAM10 |
5 | |
POGZ |
3 |
PRF1 |
4 | U2SURP | 4 | |||
TMEM159* |
3 |
POGZ |
3 |
BCL11B* |
3 | |||
|
RPS26 |
99 |
RPS26 |
91 |
RPS26 |
86 | ||
RPS26P11 |
14 |
RPS26P11 |
18 |
RPS26P11 |
42 | |||
ADAM10 |
6 |
|
ADAM10 |
4 |
|
BCL11B |
3 | |
BCL11B |
3 |
BCL11B |
3 | GNPTG | 3 | |||
POGZ* |
3 |
POGZ* |
3 |
GGNBP2 |
3 | |||
|
RPS26 |
78 |
RPS26 |
70 |
RPS26 |
61 | ||
RPS26P11 |
46 |
RPS26P11 |
39 |
RPS26P11 |
30 | |||
BCL11B |
3 |
|
PRF1 |
4 |
|
TMEM159 |
3 | |
TMEM159 |
3 |
BCL11B |
3 |
GGNBP2 |
3 | |||
GGNBP2 |
3 |
GGNBP2 |
3 |
OGT* |
2 | |||
Table 1 includes 11 different genes in total. Among the 11 genes, 9 genes have been verified in the literature to be T1D associated genes. For example, Schadt and others (2008) reported that RPS26 is a T1D causal gene, and Ma and Hart (2013) reported that the gene O-GlcNAc transferase (OGT) is directly linked to many metabolic diseases including diabetes. Other than identifying some verified T1D associated genes, we have also some new findings such as gene PRF1. Orilieri and others (2008) claimed that PRF1 variations are susceptibility factors for T1D development. In Table 1, PRF1 appeared as a hub gene twice, which suggests that the connection between PRF1 and T1D might be worth to be further explored. Moreover, we also identifies some connection changes in the networks. As showed in Figure 4, the new appearing and disappearing connections are marked in different colors at each time point, which identify some evolvement patterns of the network.
For comparison, the GGL method was also applied to this example, for which the regularization parameters were chosen according to the minimum AIC criterion. The total CPU time cost by the method was 20.2 h. FGL was not applied to this example, as it would take extremely long CPU time. Figure S4 available at Biostatistics online shows the networks constructed by GGL for the case samples at all nine time points. Table S6 available at Biostatistics online shows the top 5 hub genes identified by GGL at each time point for the case samples. The lists of hub genes are pretty stable, which consists of seven different genes only. Among the seven genes, only three genes RPS26, OGT, and JMJD1C have been verified in the literature as T1D-associated genes. Moreover, as showed in Figure S4 available at Biostatistics online, the hub genes in networks are almost identical at each time point. In summary, FHBIA tends to outperform GGL for this real data example which can identify more hub genes which are associated with T1D.
From the perspective of data analysis, one might also be interested in estimating the gene networks constructed from the controls, as well as the differences between the networks from the cases and controls. For comparing the networks from the cases and controls, we can adopt the method described in Section 6 of Liang and others (2015). However, since the method by Liang and others (2015) requires that the two networks under comparison are independent, the sample information from the cases and controls should not be integrated in this case. We left this work to the future.
5. Discussion
We have proposed the FHBIA method for jointly estimating multiple GGMs under distinct conditions and applied it to TEDDY data. The FHBIA method consists of a few important steps, which is to first summarize the graph structure information contained in the data using the
-learning algorithm, then integrate information via a meta-analysis procedure under the Bayesian framework, and finally determine the structures of multiple graphs via a multiple hypothesis test. Compared to the existing methods, FHBIA has a few significant advantages. First, FHBIA includes a meta-analysis procedure to explicitly integrate information across distinct conditions. In contrast, the existing methods integrate information through prior distributions or penalty function, which is often less efficient. Second, FHBIA can be run very fast, especially when
is small. The overall computational complexity of FHBIA is
, where the factor
is the total number of possible configurations of an edge across all
conditions. When
is large, we need to resort to MCMC for an efficient estimation of the posterior probabilities
’s for
. Since
’s can be estimated for each
independently, this step can be done in parallel. In addition, we note that the correlation coefficients and
-scores can also be calculated in parallel. Hence, the whole method can be executed very fast on a parallel architecture. Moreover, instead of working on the original data, the Bayesian integration step chooses to work on the edge-wise
-scores, which avoids to invert high-dimensional covariance matrices and thus can be very fast. Note that, in calculation of
-scores, the
-learning algorithm also successfully avoids to invert high-dimensional covariance matrices through correlation screening. Third, the empirical Bayesian method that FHBIA employed for multiple hypothesis tests produces a
-value (Storey, 2002) for each potential edge of the multiple graphs. The
-value provides an uncertainty measure for each potential edge. This has been beyond the ability of many of the existing methods, especially when
is large.
The FHBIA method has a very flexible framework, which can be easily extended to joint estimation of multiple mixed graphical models. For example, consider the scenario that the data consist of only Gaussian and multinomial random variables, for which the joint distribution is well defined (Lee and Hastie, 2015). For such mixed data, the
-learning algorithm can be performed under the framework of generalized linear models; that is, we can replace the correlation coefficients and
-partial correlation coefficients used in the algorithm by the corresponding
-values obtained in the marginal variable screening tests (Fan and Song, 2010) and conditional independence tests. Then, we can replace the
-scores by the
-scores corresponding to the
-values of the conditional independence tests. For other types of continuous random variables, we can apply the nonparanormal transformation (Liu and others, 2009) to Gaussianize them prior to the application of the FHBIA method. For certain types of discrete data, e.g., next generation sequencing data, we can apply the transformations developed in (Jia and others, 2017) to continuize and Gaussianize them prior to the application of the FHBIA method.
Finally, we would like to mention that the sparsity assumption imposed on the networks does not limit the applications of the FHBIA method. Sparsity is a just trick adopted by people for making statistical inference where there is no enough amount of data available, e.g., in dealing with small-n-large-p problems. In this article, the sparsity assumption A4 (in the supplementary material available at Biostatistics online) is given in the form of the sample size
, which leads to a bounded neighborhood size of
for each node, see Lemma 3.2 and the
-learning algorithm presented in the supplementary material available at Biostatistics online. Therefore, when the sample size
is large enough, e.g., when using the UK biobank data which consists of over 500 K samples to construct GRNs, FHBIA can be applied to learn dense genetic networks (Boyle and others, 2017). For UK biobank data, we have
, which is much larger than the number of genes (
K) we usually considered and thus neighborhood truncation in the correlation screening step of
-learning will not be triggered. When the data size is not large enough, the FHBIA will identify only the strongest connections in terms of partial correlation coefficients. This property is inherited from the
-learning algorithm.
Supplementary Material
Acknowledgments
The authors thank the editor, associate editor, two referees, and Dr. George Tseng for their constructive comments which have led to significant improvement of this article. Members of the TEDDY Study Group are listed in the Supplementary File. Conflict of Interest: None declared.
6. Software
The software accompanied with this article is available as a module called JGGM in the R package equSA at https://cran.r-project.org/web/packages/equSA/index.html.
Funding
Leona M. and Harry B. Helmsley Charitable Trust (2015PG-T1D050); F.L.’s research was support in part by the grants USF-ITN-15-11-MH, DMS-1612924, DMS/NIH R01-GM117597, and NIH R01-GM126089. The TEDDY Study is funded by U01 DK63829, U01 DK63861, U01 DK63821, U01 DK63865, U01 DK63863, U01 DK63836, U01 DK63790, UC4 DK63829, UC4 DK63861, UC4 DK63821, UC4 DK63865, UC4 DK63863, UC4 DK63836, UC4 DK95300, UC4 DK100238, UC4 DK106955, UC4 DK112243, UC4 DK117483, and Contract No. HHSN267200700014C from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of Allergy and Infectious Diseases (NIAID), National Institute of Child Health and Human Development (NICHD), National Institute of Environmental Health Sciences (NIEHS), Centers for Disease Control and Prevention (CDC), and JDRF. NIH/NCATS Clinical and Translational Science Awards to the University of Florida (UL1 TR000064) and the University of Colorado (UL1 TR001082), in part.
References
- Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29, 1165–1188. [Google Scholar]
- Boyle, E. A.,Li, Y. I. and Pritchard, J. K. (2017). An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chun, H.,Zhang, X. and Zhao, H. (2015). Gene regulation network inference with joint sparse Gaussian graphical models. Journal of Computational and Graphical Statistics 24, 954–974. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Danaher, P.,Wang, P. and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 373–397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davis, J. and Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, USA. New York, NY, USA: ACM, pp. 233–240. [Google Scholar]
- Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with NP-dimensionality. Annals of Statistics 38, 3567–3604. [Google Scholar]
- Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, 721–741. [DOI] [PubMed] [Google Scholar]
- Guo, J.,Levina, E.,Michailidis, G. and Zhu, J. (2011). Joint estimation of multiple graphical models. Biometrika 98, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jia, B.,Xu, S.,Xiao, G.,Lamba, V. and Liang, F. (2017). Learning gene regulatory networks from next generation sequencing data. Biometrics 73, 1221–1230. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee, H. S.,, Burkhardt, B. R.,, McLeod, W.,, Smith, S.,, Eberhard, C.,, Lynch, K.,, Hadley, D.,, Rewers, M.,, Simell, O.,, She, J. X. and others (2014). Biomarker discovery study design for type 1 diabetes in The Environmental Determinants of Diabetes in the Young (TEDDY) study. Diabetes/Metabolism Research and Reviews 30, 424–434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee, J. and Hastie, T.J. (2015). Learning the structure of mixed graphical models. Journal of Computational and Graphical Statistics 24, 230–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang, F.,Song, Q. and Qiu, P. (2015). An equivalent measure of partial correlation coefficients for high dimensional Gaussian graphical models. Journal of the American Statistical Association 110, 1248–1265. [Google Scholar]
- Liang, F. and Zhang, J. (2008). Estimating the false discovery rate using the stochastic approximation algorithm. Biometrika 95, 961–977. [Google Scholar]
- Lin, Z.,Wang, T.,Yang, C. and Zhao, H. (2017). On joint estimation of Gaussian graphical models for spatial and temporal data. Biometrics 73, 769–779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, H.,Lafferty, J. and Wasserman, L. (2009). The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research 10, 2295–2328. [PMC free article] [PubMed] [Google Scholar]
- Ma, J. and Hart, G. W. (2013). Protein O-GlcNAcylation in diabetes and diabetic complications. Expert Review of Proteomics 10, 365–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Annals of Statistics 34, 1436–1462. [Google Scholar]
- Orilieri, E.,, Cappellano, G.,, Clementi, R.,, Cometa, A.,, Ferretti, M.,, Cerutti, E.,, Cadario, F.,, Martinetti, M.,, Larizza, D.,, Calcaterra, V. and others (2008). Variations of the perforin gene in patients with type 1 diabetes. Diabetes 57, 1078–1083. [DOI] [PubMed] [Google Scholar]
- Peterson, C.,Stingo, F. C. and Vannucci, M. (2015). Bayesian inference of multiple Gaussian graphical models. Journal of the American Statistical Association 110, 159–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qiu, H.,Han, F.,Liu, H. and Caffo, B. (2016). Joint estimation of multiple graphical models from high dimensional time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78, 487–504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saito, T. and Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 10, e0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schadt, E. E.,, Molony, C.,, Chudin, E.,, Hao, K.,, Yang, X.,, Lum, P. Y.,, Kasarskis, A.,, Zhang, B.,, Wang, S.,, Suver, C. and others (2008). Mapping the genetic architecture of gene expression in human liver. PLoS Biology 6, e107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64, 479–498. [Google Scholar]
- Stouffer, S. A.,Suchman, E. A.,Devinney, L. C.,Star, S. A., and Williams, R. M., Jr.. and others (1949). The American Soldier: Adjustment during Army Life, (Studies in social psychology in World War II) Volume 1. Oxford, England: Princeton University Press. [Google Scholar]
- Xie, Y.,Liu, Y. and Valdar, W. (2016). Joint estimation of multiple dependent Gaussian graphical models with applications to mouse genomics. Biometrika 103, 493–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu, S.,Jia, B. and Liang, F. (2019). Learning moral graphs in construction of high-dimensional Bayesian networks for mixed data. Neural Computation 31, 1183–1214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaykin, D. V. (2011). Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. Journal of Evolutionary Biology 24, 1836–1841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou, S.,Lafferty, J. and Wasserman, L. (2010). Time varying undirected graphs. Machine Learning 80, 295–319. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.















































































