Skip to main content
EURASIP Journal on Bioinformatics and Systems Biology logoLink to EURASIP Journal on Bioinformatics and Systems Biology
. 2010 Aug 9;2010(1):947564. doi: 10.1155/2010/947564

A Hypothesis Test for Equality of Bayesian Network Models

Anthony Almudevar 1,
PMCID: PMC3171365  PMID: 20981254


Bayesian network models are commonly used to model gene expression data. Some applications require a comparison of the network structure of a set of genes between varying phenotypes. In principle, separately fit models can be directly compared, but it is difficult to assign statistical significance to any observed differences. There would therefore be an advantage to the development of a rigorous hypothesis test for homogeneity of network structure. In this paper, a generalized likelihood ratio test based on Bayesian network models is developed, with significance level estimated using permutation replications. In order to be computationally feasible, a number of algorithms are introduced. First, a method for approximating multivariate distributions due to Chow and Liu (1968) is adapted, permitting the polynomial-time calculation of a maximum likelihood Bayesian network with maximum indegree of one. Second, sequential testing principles are applied to the permutation test, allowing significant reduction of computation time while preserving reported error rates used in multiple testing. The method is applied to gene-set analysis, using two sets of experimental data, and some advantage to a pathway modelling approach to this problem is reported.

1. Introduction

Graphical models play a central role in modelling genomic data, largely because the pathway structure governing the interactions of cellular components induces statistical dependence naturally described by directed or undirected graphs [13]. These models vary in their formal structure. While a Boolean network can be interpreted as a set of state transition rules, Bayesian or Markov networks reduce to static multivariate densities on random vectors extracted from genomic data. Such densities are designed to model coexpression patterns resulting from functional cooperation. Our concern will be with this type of multivariate model. Although the ideas presented here extend naturally to various forms of genomic data, to fix ideas we will refer specifically to multivariate samples of microarray gene expression data.

In this paper, we consider the problem of comparing network models for a common set of genes under varying phenotypes. In principle, separately fit models can be directly compared. This approach is discussed in [3] and is based on distances definable on a space of graphs. Significance levels are estimated using replications of random graphs similar in structure to the estimated models.

The algorithm proposed below differs significantly from the direct graph approach. We will formulate the problem as a two-sample test in which significance levels are estimated by randomly permuting phenotypes. This requires only the minimal assumption of independence with respect to subjects.

Our strategy will be to confine attention to Bayesian network models (Section 2). Fitting Bayesian networks is computationally difficult, so a simplified model is developed for which a polynomial-time algorithm exists for maximum likelihood calculations. A two-sample hypotheses test based on the general likelihood ratio test statistic is introduced in Section 3. In Section 4, we discuss the application of sequential testing principles to permutation replications. This may be done in a way which permits the reporting of error rates commonly used in multiple testing procedures. In Section 5, the methodology is applied to the problem of gene set (GS) analysis, in which high dimensional arrays of gene expression data are screened for differential expression (DE) by comparing gene sets defined by known functional relationships, in place of individual gene expressions. This follows the paradigm originally proposed in gene set enrichment analysis (GSEA) [46]. The method will be applied to two well-known microarray data sets.

An R library of source code implementing the algorithms proposed here may be downloaded at

2. Network Models

A graphical model is developed by defining each of Inline graphic genes as a graph node, labelled by gene expression level Inline graphic for gene Inline graphic. The model incorporates two elements, first, a topologyInline graphic (a directed or undirected graph on the Inline graphic nodes), then, a multivariate distribution Inline graphic for Inline graphic which conforms to Inline graphic in some well defined sense. In a Bayesian network (BN), model Inline graphic is a directed acyclic graph (DAG), and Inline graphic assumes the form

graphic file with name 1687-4153-2010-947564-i11.gif (1)

where Inline graphic is the set of parents of node Inline graphic. Intuitively, Inline graphic describes a causal relationship between node Inline graphic and nodes Inline graphic.

The advantage of (1) is the reduction in the degrees of freedom of the model while preserving coexpression structure. Also, some flexibility is available with respect to the choice of the conditional densities of (1), with Gaussian, multinomial, and Gamma forms commonly used [7]. We note that BNs are commonly used in many genomic applications [79].

2.1. Gaussian Bayesian Network Model

For this application, we will use the Gaussian BN. These models are naturally expressed using a linear regression model of node Inline graphic data Inline graphic on the data Inline graphic, Inline graphic. In [10], it is noted that in microarray data gene expression levels are aggregated over large numbers of individual cells. Linear correlations are preserved under this process, but other forms of dependence generally will not be, so we can expect linear regression to capture the dominant forms of interaction which are statistically observable. In this case the maximum log-likelihood function for a given topology reduces to

graphic file with name 1687-4153-2010-947564-i21.gif (2)

where Inline graphic is the mean squared error of a linear regression fit of the offspring expressions onto those of the parents.

2.2. Restricted Bayesian Networks

Fitting BNs involves optimization over the space of topologies and hence is computationally intensive [9]. While exact algorithms are available [11], they will generally require too great a computation time for the application described below. A recent application of exact techniques to the problem of pedigree reconstruction (a BN with maximum indegree of 2) was described in [12]. Using methods proposed in [13] the exact computation of the maximum likelihood of a pedigree with 29 individuals (nodes) required 8 minutes. The author of [12] agrees with the conclusion reported in [13], that the method is not viable for BNs with greater than 32 nodes.

It is possible to control the size of the computation by placing a cap Inline graphic on the permissable indegree of each node, though the problem remains difficult even for Inline graphic (see, e.g., [14]). On the other hand, a method for fitting BNs with constraint Inline graphic in polynomial time is available under certain assumptions satisfied in our application. This method is based on the equivalence of the approximation of multivariate probability models using tree-structured dependence and the minimum spanning tree (MST) problem as described in [15]. The objective is the minimization of an information difference Inline graphic, where Inline graphic is the target density, and Inline graphic is selected from a class of tree-structured approximating densities. Interest in [15] is restricted to discrete densities. We find, however, that the basic idea extends to general BNs in a natural way. See [16] for further discussion of this model.

Many heuristic or approximate methods exist for fitting Bayesian networks. See [17] for a recent survey. Such algorithms are usually based on MCMC techniques or heuristic algorithms such as TABU searches [18]. We note that the proposed hypothesis test will depend on the calculation of a maximum likelihood ratio, hence it is important to have reasonable guarantees that a maximum has been reached. Thus, given the choice between an exact solution of a restricted class of models or an approximate solution of a general class of models, the former seems preferable. Considering also that in the application described below a solution is required for cases number in "10 s or 100 s'' of thousands, a polynomial time exact solution to a restricted class of models appears to be the best choice.

Suppose we are given an Inline graphic-dimensional random vector Inline graphic. We will assume that the density is taken from a parametric family Inline graphic, Inline graphic. We write first- and second-order marginal densities Inline graphic and Inline graphic, with conditional densities Inline graphic. For convenience, we introduce a dummy vector component Inline graphic, for which Inline graphic. Let Inline graphic be the set of DAGs on nodes Inline graphic with maximum indegree 1. This means that a graph Inline graphic may be written as a mapping Inline graphic. If Inline graphic has indegree 0 set Inline graphic, otherwise Inline graphic is the parent node of Inline graphic. We must have Inline graphic for at least one Inline graphic. For each Inline graphic let Inline graphic be the set of parameters admitting the BN decomposition

graphic file with name 1687-4153-2010-947564-i50.gif (3)

Now suppose we are given Inline graphic independent and complete replicates Inline graphic of Inline graphic. Write components Inline graphic. The log likelihood function becomes, for Inline graphic,

graphic file with name 1687-4153-2010-947564-i56.gif (4)

Suppose we may construct estimators Inline graphic, Inline graphic. We then assume there is some selection rule Inline graphic for each Inline graphic. This will typically be the exact or approximate maximum likelihood estimate (MLE) on parameter space Inline graphic. We will need the following assumptions.

  (A1) For each Inline graphic, Inline graphic, and Inline graphic.

  (A2) For each Inline graphic we have Inline graphic.

We now consider the problem of maximizing Inline graphic over Inline graphic. It will be convenient to isolate the term

graphic file with name 1687-4153-2010-947564-i71.gif (5)

A spanning tree on nodes Inline graphic is an acyclic connected undirected graph. Given edge weights Inline graphic, a minimum spanning tree (MST) is any spanning tree minimizing the sum of its edge weights among all spanning trees. A number of well-known polynomial time algorithms exist to construct a MST. Two that are commonly described are Prim's and Kruskal's algorithms [19]. Kruskal's algorithm is described in [15]. In the following theorem, the problem of maximizing Inline graphic is expressed as a MST problem.

Theorem 1.

If assumptions Inline graphic hold, then maximizing Inline graphic over Inline graphic is equivalent to determining the MST for edge weights Inline graphic.


Under assumption (A1), from definition (4) it follows that Inline graphic depends on Inline graphic only through the term Inline graphic. Then suppose Inline graphic maximizes Inline graphic. For any spanning tree Inline graphic define Inline graphic and suppose Inline graphic minimizes Inline graphic. Assume Inline graphic is not connected. There must be at least two nodes Inline graphic for which Inline graphic, and for which the respective subgraphs containing Inline graphic are unconnected. In this case, extend Inline graphic to Inline graphic by adding directed edge Inline graphic. We must have Inline graphic, and by (A2) we have Inline graphic. We may therefore assume Inline graphic is connected. The undirected graph of Inline graphic is a spanning tree, so Inline graphic.

Next, note that Inline graphic can be identified with an element of Inline graphic by defining any node as a root node, enumerating all paths from the root node to terminal nodes, then assigning edge directions to conform to these paths. This implies Inline graphic, which in turn implies Inline graphic, and that Inline graphic may be selected so that Inline graphic can be identified with Inline graphic.

Remark 1.

In general, the optimizing graph from Inline graphic will not be unique. First, the solution to the MST problem need not be unique. Second, there will always be at least two extensions of a spanning tree to a BN.

Marginal means, variances and, correlations of Inline graphic are denoted Inline graphic, leading to parameters Inline graphic, Inline graphic. Each parameter in the set Inline graphic represents the class of Gaussian BNs which conform to graph Inline graphic. Following the construction in assumption (A1), let Inline graphic, Inline graphic using summary statistics Inline graphic, Inline graphic, Inline graphic. Under the usual parameterization, it can be shown that (omitting constants)

graphic file with name 1687-4153-2010-947564-i119.gif (6)

noting that, since Inline graphic, assumption (A2) holds.

3. General Maximum Likelihood Ratio Test

Identification of nonhomogeneity between two Bayesian networks will be based on a general maximum likelihood ratio test (MLRT). It is important to note the properties of the MLRT are well understood in parametric inference of limited dimension, and a sampling distribution can be accurately approximated with a large enough sample size. These known properties no longer apply in the type of problem considered here, primarily due to the small sample size, large number of parameters, and the fact that optimization over a discrete space is performed. In addition, the maximum likelihood principle itself favors spurious complexity when no model selection principles are used. While we cannot claim that the MLRT possesses any optimum properties in this application, the use of a permutation procedure will permit accurate estimates of the observed significance level while the use of the restricted model class will control to some degree the degrees of freedom of the model. See, for example, [20] for a general discussion of these issues.

Suppose Inline graphic is a family of densities defined on some parameter set Inline graphic. We are given two random samples Inline graphic and Inline graphic from respective densities Inline graphic and Inline graphic. Denote pooled sample Inline graphic. The density of Inline graphic and Inline graphic, respectively, are Inline graphic and Inline graphic. We consider null hypothesis Inline graphic. Under Inline graphic the joint density of Inline graphic is Inline graphic for some parameter Inline graphic. Assume the existence of maximum likelihood estimators Inline graphic, Inline graphic, and Inline graphic. The general likelihood ratio statistic in logarithmic scale is then (with large values rejecting Inline graphic)

graphic file with name 1687-4153-2010-947564-i141.gif (7)

Asymptotic distribution theory is not relevant here due to small sample size and the fact that optimization is performed in part over a discrete space of models, so a two sample permutation procedure will be used. Permutations will be approximately balanced to reduce spurious variability when a true difference in expression pattern exists (see, e. g., [21] for discussion). This can be done by changing group labels of Inline graphic randomly selecting sample vectors from each of Inline graphic and Inline graphic. This results in permutation replicate samples Inline graphic and Inline graphic. The balanced procedure ensures that each permutation replicate sample contains approximately equal proportions of the original samples.

We now define Algorithm 1.

Algorithm 1.

(1) Determine Inline graphic by maximizing Inline graphic, Inline graphic, Inline graphic (MST algorithm).

(Inline graphic) Set Inline graphic.

(Inline graphic) Construct Inline graphic replications Inline graphic in the following way. For each replication Inline graphic, create random replicate samples Inline graphic and Inline graphic, then determine Inline graphic which maximize Inline graphic, Inline graphic. Set Inline graphic.

(Inline graphic) Set Inline graphic-value

graphic file with name 1687-4153-2010-947564-i165.gif (8)

Note that the quantity Inline graphic is permutation invariant and hence need not be recalculated within the permutation procedure.

4. Permutation Tests with Stopping Rules

Permutation or bootstrap tests usually reduce to the estimation of a binomial probability by direct simulation. Since interest is usually in identifying small values, it would seem redundant to continue sampling when, for example, the first ten simulations lead to an estimate of 1/2. This suggests that a stopping rule may be applied to permutation sampling, resulting in significant reduction in computation time, provided it can be incorporated into a valid inference statement. A variety of such procedures have been described in the literature but do not seem to have been widely adopted in genomic discovery applications [2224].

Suppose, as in Algorithm 1, we have an observed test statistic Inline graphic, and can simulate indefinitely a sequence Inline graphic from a null distribution Inline graphic. By convention we assume that large values of Inline graphic tend to reject the null hypothesis. To develop a stopping rule for this sequence set

graphic file with name 1687-4153-2010-947564-i171.gif (9)

Formally, Inline graphic is a stopping time if the occurrence of event Inline graphic can be determined from Inline graphic. We may then design an algorithm which terminates after sampling a sequence of exactly length Inline graphic from Inline graphic, then outputs Inline graphic, from which the hypothesis decision is resolved. We refer to such a procedure as a stopped procedure. A fixed procedure (such as Algorithm 1) can be regarded as a special case of a stopped procedure in which Inline graphic.

An important distinction will have to be made between a single test and a multiple testing procedure (MTP), which is a collection of Inline graphic hypothesis tests with rejection rules that control for a global error rate such as false discovery rate (FDR), family-wise error rate (FWER), or per family error rate (PFER) [25]. In the single test application, we may set a fixed significance level Inline graphic and continue replications until we conclude that the Inline graphic-value is above or below Inline graphic. For an MTP, it will be important to be able to estimate small Inline graphic-values, so a stopping rule which permits this is needed. Although the two cases have different structure, in our development they will both be based on the sequential probability ratio test (SPRT), first proposed in [26], which we now describe.

4.1. Sequential Probability Ratio Test (SPRT)

Formally (see [27, Chapter 2]) the SPRT tests between two simple alternatives Inline graphic: Inline graphic versus Inline graphic: Inline graphic, where Inline graphic parametrizes a family of distributions Inline graphic. We assume there is a sequence of Inline graphic observations Inline graphic from Inline graphic where Inline graphic. Let Inline graphic be the likelihood function based on Inline graphic and define the likelihood ratio statistic Inline graphic. For two constants Inline graphic, define stopping time

graphic file with name 1687-4153-2010-947564-i198.gif (10)

It can be shown that Inline graphic. If Inline graphic we conclude Inline graphic and conclude Inline graphic otherwise. We define errors Inline graphic and Inline graphic. It turns out that the SPRT is optimal under the given assumptions in the sense that it minimizes Inline graphic among all sequential tests (which includes fixed sample tests) with respective error probabilities no larger than Inline graphic. Approximate formulae for Inline graphic and Inline graphic are given in [27].

Hypothesis testing usually involves composite hypotheses, with distinct interpretations for the null and alternative hypothesis. One method of adapting the SPRT to this case is to select surrogate simple hypotheses. For example, to test Inline graphic versus Inline graphic, we could select simple hypotheses Inline graphic and Inline graphic. In this case, we would need to know the entire power function, which may be estimated using simulations.

An additional issue then arises in that the expected stopping time may be very large for Inline graphic. This can be accommodated using truncation. Suppose a reasonable choice for a fixed sample size is Inline graphic. We would then use truncated stopping time Inline graphic, with Inline graphic defined in (10). When Inline graphic, we could, for example, select hypothesis Inline graphic if Inline graphic. These modifications are discussed in [27].

4.2. Single Hypothesis Test

Suppose we adopt a fixed significance level Inline graphic for a single hypothesis test. If Inline graphic is the (unknown) true significance level, we are interested in resolving the hypothesis Inline graphic:Inline graphic. The properties of the test are summarized in a power curve, that is, the probability of deciding Inline graphic is true for each Inline graphic. An example of this procedure is given in [28], for Inline graphic, using a SPRT with parameters Inline graphic, Inline graphic, Inline graphic, Inline graphic, and truncation at Inline graphic. Hypothesis Inline graphic is concluded if Inline graphic when Inline graphic; otherwise when Inline graphic.

4.3. Multiple Hypothesis Tests

We next assume that we have Inline graphic hypothesis tests based on sequences of the form (9). We wish to report a global error rate, in which case specific values of small Inline graphic-values are of importance. We will consider specifically the class of MTPs referred to as either step-up or step-down procedures. If we are given a sequence of Inline graphicInline graphic-values Inline graphic which have ranks Inline graphic, then adjustedInline graphic-values, Inline graphic are given by:

graphic file with name 1687-4153-2010-947564-i244.gif (11)

where the quantity Inline graphic defines the particular MTP. It is assumed that Inline graphic is an increasing function of Inline graphic for all Inline graphic. The procedure is implemented by rejecting all null hypotheses for which Inline graphic. Depending on the MTP, various forms of error, usually either family-wise error rate (FWER) or false discovery rate (FDR), are controlled at the Inline graphic level. For example, the Benjamini-Hochberg (BH) procedure is a step-up procedure defined by Inline graphicInline graphicInline graphic and controls for FDR for independent hypothesis tests. A comprehensive treatment of this topic is given in, for example, [25].

Suppose we have Inline graphic probabilities Inline graphic (Inline graphic-values associated with Inline graphic tests). For each test Inline graphic, we may generate Inline graphic as the cumulative sum defined in (9). Now suppose we define any stopping time Inline graphic, bounded by Inline graphic, for each sequence Inline graphic (this may or may not be related to the SPRT). Then define estimates Inline graphic, with Inline graphic.

For a fixed MTP, the estimates Inline graphic would replace the true values in (11), yielding estimated adjusted Inline graphic-values Inline graphic while for the stopped MTP adjusted Inline graphic-values Inline graphic are produced in the same manner using Inline graphic. It is easily seen that Inline graphic while the rankings of Inline graphic (accounting for ties) are equal to the rankings of Inline graphic. Furthermore, the formulae in (11) are monotone in Inline graphic, so we must have Inline graphic. Thus, the stopped procedure may be seen as being embedded in the fixed procedure. It inherits whatever error control is given for the fixed MTP, with the advantage that the calculation of the adjusted Inline graphic-values Inline graphic uses only the first Inline graphic replications for the Inline graphicth test.

The procedure will always be correct in that it is strictly more conservative than the fixed MTP in which it is embedded, no matter which stopping time is used. The remaining issue is the selection of Inline graphic which will equal Inline graphic for small enough values of Inline graphic but will also have Inline graphic for larger values of Inline graphic. It is a simple matter, then, to modify the SPRT described in Section 4.2 by eliminating the lower bound Inline graphic (equivalently Inline graphic). We will adopt this design in this paper. This gives Algorithm 2.

Algorithm 2.

(1) Same as Algorithm 1, step 1.

(Inline graphic) Same as Algorithm 1, step 2.

(Inline graphic) Simulate replicates Inline graphic in Algorithm 1, step 3, until the following stopping criterion is met. Set Inline graphic, and let Inline graphic, where Inline graphic. Stop sampling at the Inline graphicth replication if Inline graphic, where Inline graphic, or until Inline graphic, whichever occurs first.

(Inline graphic) Let Inline graphic be the number of replications in step 3. If Inline graphic, set

graphic file with name 1687-4153-2010-947564-i300.gif (12)

otherwise set Inline graphic.

The values Inline graphic generated by Algorithm 2 can then be used in a stopped MTP as described in this section.

5. Gene-Set Analysis

A recent trend in the analysis of microarray data has been to base the discovery of phenotype-induced DE on gene sets rather than individual genes. The reasoning is that if genes in a given set are related by common pathway membership or other transcriptional process, then there should be an aggregate change in gene expression pattern. This should give increased statistical power, as well as enhanced interpretability, especially given the lack of reproducibility in univariate gene discovery due to the stringent requirements imposed by multiple testing adjustments. Thus, the discovery process reduces to a much smaller number of hypothesis tests with more direct biological meaning. Some objections may be raised concerning the selection of the gene sets when theses sets are themselves determined experimentally. Additionally, gene sets may overlap. While these problems need to be addressed, it is also true that such gene set methods have been shown to detect DE not uncovered by univariate screens.

A crucial problem in gene set analysis is the choice of test statistic. The problem of testing against equality of random vectors in Inline graphic, Inline graphic, is fundamentally different from the univariate case Inline graphic. The range of statistics one would consider for Inline graphic is reasonably limited, the choice being largely driven by distributional considerations. For Inline graphic, new structural or geometric considerations arise. For example, we may have differential expression between some but not all genes in the gene set, which makes selection of a single optimal test statistic impossible. Alternatively, the experimental random vectors may differ in their level of coexpression independently of their level of marginal DE.

In fact, almost all GS procedures directly measure aggregate DE, so an important question is whether or not phenotypic variation is almost completely expressible as DE. If so, then a DE based statistic will have fewer degrees of freedom, hence more power, than one based on a more complex model. Otherwise, a reasonable conjecture is that a compound GS analysis will work best, employing a DE statistic as well as one more sensitive to changes in coexpression patterns.

Correlations have been used in a number of gene discovery applications. They may be used to associate genes of unknown function with known pathways [29, 30]. Additionally, a number of GS procedures exist which incorporate correlation structure into the procedure [3133]. However, a direct comparison of correlations is not practical due to the large number (Inline graphic) of distinct correlation parameters. Therefore, there is a considerable advantage to the statistic (7) based on the reduced BN model, in that the correlation structure can be summarized by the Inline graphic correlation parameters output by the MST algorithm, yielding a transitive dependence model similar to that effectively exploited in [29].

It is important to refer to a methodological characterization given in [34]. A distinction is made between two types of null hypotheses. Suppose we are given samples of expression levels from a gene set Inline graphic from two phenotypes. Suppose also that for each gene in Inline graphic and its complement Inline graphic, a statistical measure of differential expression is available. For a competitive test, the null hypothesis Inline graphic is that the prevalence of differential expression in Inline graphic is no greater than in Inline graphic. For a self-contained test, the null hypothesis Inline graphic is that no genes in Inline graphic are differentially expressed. In the GSEA method of [4, 5] concern is with Inline graphic. In most subsequent methods, including the one proposed here, Inline graphic is used.

For general discussions of the issues raised here, see [3537]. Comprehensive surveys of specific methods can be found in [38] or [39].

5.1. Experimental Data

We will demonstrate the algorithm proposed here on two data sets examined elsewhere in the literature. These were obtained from the GSEA website [6]. In [5], a data set p53 is extracted from the NCI-60 collection of cancer cell lines, with 17 cell lines classified as normal, and 33 classified as carrying mutations of p53. We also examine the DIABETES data set introduced in [4], consisting of microaray profiles of skeletal muscle biopsies from 43 males. For the DIABETES data set used here, there were 17 normal glucose tolerance (NGT) subjects and 17 diabetes (DMT) subjects. For gene sets, we used one of the gene set lists compiled in [5], denoted Inline graphic, consisting of 472 gene sets with products collectively involved in various metabolic and signalling pathways, as well as 50 sets containing genes exhibiting coregulated response to various perturbations. In our analyses, FDR will be estimated using the BH procedure.

5.1.1. P53 Data

A Inline graphic-test was performed on each of the 10,100 genes. Only 1 gene had an adjusted Inline graphic-value less than FDR = 0.25 (bax, Inline graphic, Inline graphic). Several GS analyses for this data set (using Inline graphic) have been reported. We cite the GSEA analysis in [5] and a modification of the GSEA proposed in [40]. Also, in [38], this data set is used to test three procedures, each using various standardization procedures. Two are based on logistic regression (Global test [41] ANCOVA Global test [42]). The third is an extension of the Significance Analysis of Microarray (SAM) procedure [43] to gene sets proposed in [44] (SAM-GS).

Table 1 lists pathways selected from Inline graphic for the analysis proposed here using FDR Inline graphic 0.25, including unadjusted and adjusted Inline graphic-values. For each entry we indicate whether or not the pathway was selected under the analyses reported in [5] (Sub, FDR Inline graphic 0.25), [40] (Efr, FDR Inline graphic 0.1) and [38] (Liu, nominal Inline graphic-value Inline graphic in at least one procedure). It is important to note that the results indicated with an asterisk (*) are not directly comparable due to differing MTP control, and are included for completeness.

Table 1.

P53 pathways, with GS size (Inline graphic), unadjusted and FDR adjusted Inline graphic-values (Inline graphic)

Pathway Inline graphic Inline graphic Inline graphic Sub Efr Liu
SA_G1_AND_S_PHASES 14 Inline graphic.001 .08 n y n
atmPathway 19 Inline graphic.001 .08 n n y
g2Pathway 23 Inline graphic.001 .08 n n n
p53Pathway 16 Inline graphic.001 .08 y y y
cell_cycle_checkpointII 10 Inline graphic.001 .08 n n n
SA_FAS_SIGNALLING 9 .002 .14 n n* n*
cellcyclePathway 23 .002 .16 n n* n*
DNA_Inline graphic 90 .003 .17 n n* n*
SA_TRKA_RECEPTOR 16 .003 .17 n n* y*
radiation_sensitivity 26 .003 .17 y y* y*
ngfPathway 19 .004 .17 n y* n*
GO_ROS 23 .004 .17 n n* n*
etsPathway 16 .004 .17 n n* n*
ck1Pathway 15 .006 .21 n n* n*
erkPathway 29 .007 .23 n n* n*
Inline graphic 18 .007 .23 n n* n*
arfPathway 13 .007 .23 n n* n*

Inclusion in analyses cited in Section 5.1 indicated. Inline graphicThe complete name of DNA_DAMAGE is DNA_DAMAGE_SIGNALLING. Inline graphicThe complete name of MAP00562 is MAP00562_Inositol_phosphate_metabolism. *Inclusion criterion based on control rate of original analysis.

The first five pathways are directly comparable. Of these, two were not detected in any other analysis. Our procedure was repeated for these pathways using the sum of the squared t-statistics across genes. The nominal Inline graphic-values for g2 Pathway and cell cycle checkpoint II were.0044 and Inline graphic.05, respectively. Since we are interested in identifying pathways which may be detectable by pathway methods, but not DE based methods we will examine cell cycle checkpoint II more closely. Applying a univariate Inline graphic-test to each of the 10 genes yields one Inline graphic-value of 0.001 (cdkn2a), with the remaining Inline graphic-values greater than 0.1 hence a DE-based approach is unlikely to select this pathway. Furthermore, Inline graphic-values under 0.05 for change in correlation are reported for rbbp8/rb1, nbs1/ccng2, atr/ccne2, nbs1/tp53, and ccng2/tb53 (Inline graphic,   .006,   .008,   .035,  and  .036). Clearly, the difference in gene expression pattern is determined by change in coexpression pattern. In Figure 1, the correlations for all gene pairs for wild-type and mutation groups are indicated. A clear pattern is evident, by which correlation structure present in the wildtype class does not exist in the mutation class.

Figure 1.

Figure 1

Scatterplot of correlations for all gene pairs in cell_cycle_checkpoint_II pathway, using wildtype and mutation axes. Genes with nominal significance levels for differential coexpression Inline graphic (Inline graphic) and Inline graphic (Inline graphic) are indicated separately.

Figure 2.

Figure 2

Bayesian network fits for mutation data for cycle checkpoint II pathway using (a) Minimum Spanning Tree algorithm (maximum indegree of 1); (b) Bayesian Information Criterion (maximum indegree of 2).

Figure 3.

Figure 3

Bayesian network fits for wildtype data for cycle checkpoint II pathway using (a) Minimum Spanning Tree algorithm (maximum indegree of 1). (b) Bayesian Information Criterion (maximum indegree of 2).

To further clarify the procedure, we compare the BN model obtained from the data for the ten genes associated with the cell cycle checkpoint II pathway, separately for mutation and wildtype conditions. If there is interest in a post-hoc analysis of any particular pathway, the rational for the MST algorithm no longer holds, since only one fit is required. It is therefore instructive to compare the MST model to a more commonly used method. In this case, we will use the Bayesian Information Criterion (BIC) (see, e.g., [7]), with a maximum indegree of 2. To fit the model we use a simulated annealing algorithm adapted from [45]. The resulting graphs are shown in Figures 2 (mutation) and 3 (wildtype). The MST and BIC fits are labelled (a) and (b) respectively. For the mutation fit, there is a very close correspondence between the topologies produced by the respective methods. For the wildtype data, some correspondence still exists, but less so then for the mutation data. The topologies between the conditions differ more significantly, as predicted by the hypothesis test.

5.1.2. Diabetes

No pathways were detected at a FDR of 0.25. The two pathways with the smallest Inline graphic-values were atrbrca Pathway and MAP00252 Alanine and aspartate metabolism (Inline graphic). In [33] the latter pathway was the single pathway reported with PFER = 1. The comparable PFER rate of the two pathways reported here would be 1.36 and 1.57. The atrbrca Pathway contains 25 genes. Of these, only fance differentially expressed at a 0.05 significance level (Inline graphic). For each gene pair, correlation coefficients were calculated and tested for equality between classes NGT and DMT. Table 2 lists the 10 highest ranking gene pairs in terms of correlation magnitude within the NGT class. Also listed is the corresponding correlation within the DMT class, as well as the two-sample Inline graphic-value for correlation difference. The analysis is repeated after exchanging classes, also in Table 2. We note that for a sample size of 17, an approximate 95% confidence interval for a reported correlation of Inline graphic is Inline graphic whereas the standard deviation of a sample correlation coefficient of mean zero is approximately 0.27. There is likely to be considerable statistical variation in graphical structure under the null hypothesis.

Table 2.

Correlation analysis for DIABETES data

atr brca pathway Alanine pathway
NGT cor NGT cor
genes ngt dmt Inline graphic genes ngt dmt Inline graphic

fancc/rad17 83 69 349 crat/got1 81 30 031
fancc/brca2 76 44 156 nars/dars 80 Inline graphic24 Inline graphic1
rad9a/rad17 76 87 338 crat/gpt 75 15 028
chek2/rad17 71 35 172 got2/adss Inline graphic75 Inline graphic02 012
brca1/hus1 Inline graphic69 Inline graphic29 148 got2/abat Inline graphic73 34 001
rad17/brca2 67 56 632 ddx3x/got1 72 Inline graphic17 004
atr/mre11a Inline graphic64 Inline graphic41 403 crat/ass 72 12 037
chek1/nbs1 Inline graphic62 09 030 ddx3x/dars 71 12 043
rad51/rad1 Inline graphic62 Inline graphic23 198 gpt/got1 70 33 175
rad9a/fancc 59 76 388 ddx3x/abat Inline graphic68 Inline graphic41 305

DMT cor DMT cor
genes dmt ngt Inline graphic genes dmt ngt Inline graphic

rad9a/rad17 87 76 338 ddx3x/aars Inline graphic76 Inline graphic55 325
fanca/fance 81 14 009 crat/nars 74 26 074
rad9a/fancc 76 59 388 ddx3x/nars 73 66 715
fanca/hus1 Inline graphic72 27 002 asns/ddo 60 42 502
brca1/mre11a 71 11 039 pc/aars Inline graphic58 15 031
fancc/rad17 69 83 349 crat/pc 58 53 862
fancf/hus1 67 53 563 crat/ddx3x 58 51 813
brca1/atr Inline graphic67 16 011 got1/dars Inline graphic56 40 006
rad17/mre11a 64 11 086 pc/nars 55 18 244
fancg/rad51 64 22 160 asns/gad2 Inline graphic54 Inline graphic44 723

For each pathway and phenotype, 10 gene pairs with the largest correlation (Inline graphic100) magnitudes; correlation (Inline graphic100) of alternative phenotype; and Inline graphic-value (Inline graphic1000) against equality.

Examining the first table, differences in correlation appear to be explainable by sampling variation. In the second there are two gene pairs fanca/fance and fanca/hus1 with small Inline graphic-values (.009,   .002). We note that they share a common gene fanca and that they involve the only gene fance exhibiting differential expression. The correlation patterns within the two samples are otherwise similar, suggesting a specific alteration of the network model.

The situation differs for the pathway MAP00252 Alanine and aspartate metabolism, summarized in Table 2 using the same analysis. The change in correlation is more widespread. The 8 gene pairs with the highest correlation magnitudes within the NGT sample differ between NGT and DMT at a 0.05 significance level. Furthermore, the number of gene pairs with correlation magnitudes exceeding 0.7 is 9 in the NGT sample, but only 3 in the DMT sample.

5.1.3. Comparison of Fixed and Stopped Procedures

Both the fixed and stopped procedures were applied to the preceding analysis. The SPRT used parameters Inline graphic, Inline graphic, Inline graphic, Inline graphic, and truncation at Inline graphic. Table 3 summarizes the computation times for each method as well as the selection agreement. In these examples, the stopped procedure required significantly less computation time with no apparent loss in power.

Table 3.

For stopped (St) and fixed (Fx) procedures, the table gives computation times; mean number of replications; % gene sets completely sampled; number of pathways with Inline graphic-values Inline graphic; 01; and number of such pathways in agreement.

Data Time (hrs) Mean rep % comp # Inline graphic
St Fx St Fx St Fx St Fx Both
diab 3.7 35.8 341.0 5000 5.4 100 6 6 6
p53 2.1 30.0 612.3 5000 10.5 100 18 19 18

6. Conclusion

We have introduced a two-sample general likelihood ratio test for the equality of Bayesian network models. Significance levels are estimated using a permutation procedure. The algorithm was proposed as an alternative form of gene-set analysis. It was noted that the fitting of Bayesian networks is computationally time consuming, hence a need for the efficient calculation of a model fit was identified, particularly for this application.

Two procedures were introduced to meet this requirement. First, we implemented a version of a minimum spanning tree algorithm first proposed in [15] which permits the polynomial-time calculation of the maximum likelihood Bayesian network among those with maximum indegree of one. Second, we introduced sequential testing principles to the problem of multiple testing, finding that a straightforward stopping rule could be developed which preserves group error rates for a wide range of procedures.

We may expect this form of test to be especially sensitive to changes in coexpression patterns, in contrast to most gene-set procedures, which directly measure aggregate differential expression. In an application of the algorithm to two data sets considered in [5], a number of selected gene-sets exhibited clear differences in coexpression patterns while exhibiting very little differential expression. This leads to the conjecture that the optimal approach to gene-set analysis is to couple a test which directly measures aggregate differential expression with one designed to detect differential coexpression.


This paper was supported by NIH Grant no. R21HG004648. The Clinical Translational Science Institute of the University of Rochester Medical Center also provided funding for this research.


  1. Dougherty ER, Shmulevich I, Chen J, Wang ZJ. Genomic Signal Processing and Statistics, EURASIP Book Series on Signal Processing and Communications. Vol. 2. Hindawi Publishing Corporation, New York, NY, USA; 2005. [Google Scholar]
  2. Shmulevich I, Dougherty ER. Genomic Signal Processing. Princeton University Press, Princeton, NJ, USA; 2007. [Google Scholar]
  3. Emmert-Streib F, Dehmer M. In: Analysis of Microarray Data: A Network-Based Approach. Emmert-Streib F, Dehmer M, editor. Wiley-VCH, Weinheim, Germany; 2008. Detecting pathological pathways of a complex disease by a comparitive analysis of networks; pp. 285–305. [Google Scholar]
  4. Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003;34(3):267–273. doi: 10.1038/ng1180. [DOI] [PubMed] [Google Scholar]
  5. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Subramanian A, Kuehn H, Gould J, Tamayo P, Mesirov JP. GSEA-P: a desktop application for gene set enrichment analysis. Bioinformatics. 2007;23(23):3251–3253. doi: 10.1093/bioinformatics/btm369. [DOI] [PubMed] [Google Scholar]
  7. Sebastiani P, Abad M, Ramoni MF. In: Genomic Signal Processing and Statistics, EURASIP Book Series on Signal Processing and Communications. Dougherty ER, Shmulevich I, Chen J, Wang ZJ, editor. Hindawi Publishing Corporation, New York, NY, USA; 2005. Bayesian networks for genomic analysis. [Google Scholar]
  8. Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze expression data. Journal of Computational Biology. 2000;7(3-4):601–620. doi: 10.1089/106652700750050961. [DOI] [PubMed] [Google Scholar]
  9. Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR. A primer on learning in Bayesian networks for computational biology. PLoS Computational Biology. 2007;3(8):e129. doi: 10.1371/journal.pcbi.0030129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chu T, Glymour C, Scheines R, Spirtes P. A statistical problem for inference to regulatory structure from associations of gene expression measurements with microarrays. Bioinformatics. 2003;19(9):1147–1152. doi: 10.1093/bioinformatics/btg011. [DOI] [PubMed] [Google Scholar]
  11. Cowell RG, Dawid P, Lauritzen SL, Spiegelhalter DJ. Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks, Information Science and Statistics. Spring, New York, NY, USA; 1999. [Google Scholar]
  12. Cowell RG. Efficient maximum likelihood pedigree reconstruction. Theoretical Population Biology. 2009;76(4):285–291. doi: 10.1016/j.tpb.2009.09.002. [DOI] [PubMed] [Google Scholar]
  13. Silander T, Myllymki P. In: Proceedings of the 22nd Conference on Artificial intelligence (UAI '06), 2006. Dechter R, Richardson T, editor. AUAI Press; A simple approach to finding the globally optimal bayesian network structure; pp. 445–452. [Google Scholar]
  14. Chickering DM. In: Learning from Data: Artificial Intelligence and Statistics V. Fisher D, Lenz H, editor. Springer, New York, NY, USA; 1996. Learning Bayesian net- works is NP-complete; pp. 121–130. [Google Scholar]
  15. Chow CK, Liu CN. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory. 1968;14:462–467. doi: 10.1109/TIT.1968.1054142. [DOI] [Google Scholar]
  16. Abbeel P, Koller D, Ng AY. Learning factor graphs in polynomial time and sample complexity. Journal of Machine Learning Research. 2006;7:1743–1788. [Google Scholar]
  17. Murphy K. Software packages for graphical models bayesian networks. Bulletin of the International Society for Bayesian Analysis. 2007;14:13–15. [Google Scholar]
  18. Teyssier M, Koller D. Ordering-based search: a simple and effective algorithm for learning bayesian networks. Proceedings of the 21st Conference on Uncertainty in AI (UAI '05), 2005. pp. 584–590.
  19. Papadimitriou CH, Steiglitz K. Combinatorial Optimization: Algorithms and Complexity. Prentice-Hall, Englewood Cliffs, NJ, USA; 1982. [Google Scholar]
  20. Walsh AH. Aspects of Statistical Inference. John Wiley & Sons, New York, NY, USA; 1996. [Google Scholar]
  21. Efron B. Robbins, empirical Bayes and microarrays. Annals of Statistics. 2003;31(2):366–378. doi: 10.1214/aos/1051027871. [DOI] [Google Scholar]
  22. Besag J, Clifford P. Sequential monte carlo p-values. Biometrika. 1991;78:301–304. [Google Scholar]
  23. Lock RH. A sequential approximation to a permutation test. Communications in Statistics. Simulation and Computation. 1991;20(1):341–363. doi: 10.1080/03610919108812956. [DOI] [Google Scholar]
  24. Fay MP, Follmann DA. Designing Monte Carlo implementations of permutation or bootstrap hypothesis tests. American Statistician. 2002;56(1):63–70. doi: 10.1198/000313002753631385. [DOI] [Google Scholar]
  25. Dudoit S, van der Laan MJ. Multiple Testing Procedures with Applications to Genomics. Springer, New York, NY, USA; 2008. [Google Scholar]
  26. Wald A. Sequential Analysis. John Wiley & Sons, New York, NY, USA; 1947. [Google Scholar]
  27. Siegmund D. Sequential Analysis: Tests and Confidence Intervals. Springer, New York, NY, USA; 1985. [Google Scholar]
  28. Almudevar A. Exact confidence regions for species assignment based on DNA markers. Canadian Journal of Statistics. 2000;28(1):81–95. [Google Scholar]
  29. Zhou X, Kao M-CJ, Wong WH. Transitive functional annotation by shortest-path analysis of gene expression data. Proceedings of the National Academy of Sciences of the United States of America. 2002;99(20):12783–12788. doi: 10.1073/pnas.192159399. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Braun R, Cope L, Parmigiani G. Identifying differential correlation in gene/pathway combinations. BMC Bioinformatics. 2008;9 doi: 10.1186/1471-2105-9-488. article no. 488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Barry WT, Nobel AB, Wright FA. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics. 2005;21(9):1943–1949. doi: 10.1093/bioinformatics/bti260. [DOI] [PubMed] [Google Scholar]
  32. Jiang Z, Gentleman R. Extensions to gene set enrichment. Bioinformatics. 2007;23(3):306–313. doi: 10.1093/bioinformatics/btl599. [DOI] [PubMed] [Google Scholar]
  33. Klebanov L, Glazko G, Salzman P, Yakovlev A, Xiao Y. A multivariate extension of the gene set enrichment analysis. Journal of Bioinformatics and Computational Biology. 2007;5(5):1139–1153. doi: 10.1142/S0219720007003041. [DOI] [PubMed] [Google Scholar]
  34. Goeman JJ, Bühlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23(8):980–987. doi: 10.1093/bioinformatics/btm051. [DOI] [PubMed] [Google Scholar]
  35. Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. 2006;7(1):55–65. doi: 10.1038/nrg1749. [DOI] [PubMed] [Google Scholar]
  36. Bild A, Febbo PG. Application of a priori established gene sets to discover biologically important differential expression in microarray data. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(43):15278–15279. doi: 10.1073/pnas.0507477102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Manoli T, Gretz N, Gröne H-J, Kenzelmann M, Eils R, Brors B. Group testing for pathway analysis improves comparability of different microarray datasets. Bioinformatics. 2006;22(20):2500–2506. doi: 10.1093/bioinformatics/btl424. [DOI] [PubMed] [Google Scholar]
  38. Liu Q, Dinu I, Adewale AJ, Potter JD, Yasui Y. Comparative evaluation of gene-set analysis methods. BMC Bioinformatics. 2007;8 doi: 10.1186/1471-2105-8-431. article no. 431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009;10 doi: 10.1186/1471-2105-10-47. article no. 47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Efron B, Tibshirani R. On testing the significance of sets of genes. Annals of Applied Statistics. 2007;1:107–129. doi: 10.1214/07-AOAS101. [DOI] [Google Scholar]
  41. Goeman JJ, van de Geer S, de Kort F, van Houwellingen HC. A global test for groups fo genes: testing association with a clinical outcome. Bioinformatics. 2004;20(1):93–99. doi: 10.1093/bioinformatics/btg382. [DOI] [PubMed] [Google Scholar]
  42. Mansmann U, Meister R. Testing differential gene expression in functional groups: goeman's global test versus an ANCOVA approach. Methods of Information in Medicine. 2005;44(3):449–453. [PubMed] [Google Scholar]
  43. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America. 2001;98(9):5116–5121. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics. 2007;8 doi: 10.1186/1471-2105-8-242. article 242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Almudevar A. A simulated annealing algorithm for maximum likelihood pedigree reconstruction. Theoretical Population Biology. 2003;63(2):63–75. doi: 10.1016/S0040-5809(02)00048-5. [DOI] [PubMed] [Google Scholar]

Articles from EURASIP Journal on Bioinformatics and Systems Biology are provided here courtesy of Springer