Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 May 1.
Published in final edited form as: Stat Biosci. 2012 Nov 21;6(1):38–54. doi: 10.1007/s12561-012-9076-3

Likelihood-Based Approach to Gene Set Enrichment Analysis with a Finite Mixture Model

Sang Mee Lee 1, Baolin Wu 2,, John H Kersey 3
PMCID: PMC4039382  NIHMSID: NIHMS424218  PMID: 24891922

Abstract

In this paper, we study a parametric modeling approach to gene set enrichment analysis. Existing methods have largely relied on nonparametric approaches employing, e.g., categorization, permutation or resampling-based significance analysis methods. These methods have proven useful yet might not be powerful. By formulating the enrichment analysis into a model comparison problem, we adopt the likelihood ratio-based testing approach to assess significance of enrichment. Through simulation studies and application to gene expression data, we will illustrate the competitive performance of the proposed method.

Keywords: Gene set enrichment analysis, Finite mixture model, EM

1 Introduction

Differential gene expression data analysis is a mainstream of microarray experiments. The classical statistical method is to test one gene at a time, compute a p-value for each gene and then adjust to a multiple comparison through controlling the familywise error rate or false discovery rate (FDR, [3]). Although single gene analysis gives many important insights, it has a few limitations [25]. A number of genes which contribute to subtle changes in expression may not be detected because cut-off is determined after a correction for multiple testing. On the other hand, statistical analysis results in a long list of significant genes, and it is not easy to interpret and figure out any genetic patterns. Often a set of genes jointly influence a biological process or a critical function of metabolic pathways, and single-gene manner may ignore these. Recently many researchers have proposed methods to address challenges of gene set-based analyses. These approaches are often based on gene sets which have already been annotated by functional categories and yield more biologically interpretable result. One of the main research questions for gene set inference is called gene set enrichment analysis (GSEA): we want to evaluate whether the gene set is enriched in terms of certain characteristic of our interest (e.g., differential expression) relative to the other (random) gene sets.

A widely used approach starts from the list of differentially expressed genes derived from single gene analysis, and then evaluates over-representation of a gene set within a list of genes using Fisher’s exact test, hypergeometric test, or other independent tests in a 2 × 2 contingency table. This approach has been modified by many authors (see, e.g., [13] for a review), but the results of significance could be highly dependent on the selected cutoff value and we may lose information as a result of discretizing continuous values. An alternative approach is based on distribution comparisons. Typically a gene score, known as the local statistic for each gene that measures the difference of that gene’s expression across different experimental conditions, is computed. Then a gene set score (global statistic) associated with local statistics within a gene set is compared to those of its complement. Several different variations of testing methods have been developed (see, e.g., [2, 19, 22, 25] and [6]). Among the existing methods, the random set-based methods proposed by [9] and [20] have standardized test statistics, which are then compared to random gene sets with significance assessed by permutation and random sampling. These random set-based methods are state-of-the-art currently in the field. In this paper, we will approach the GSEA under a likelihood-based testing framework, and develop a parametric statistical method for enrichment analysis, which could offer very competitive performance by combining information across all genes.

The rest of the paper is organized as follows. Statistical methods are introduced in Sect. 2, and we develop efficient numerical algorithms for model estimation in Sect. 3. Section 4 is devoted to simulation studies and Sect. 5 discusses applications to a leukemia and p53 gene expression data. We end the paper with a discussion in Sect. 6. All technical details are delegated to the Appendix.

2 Statistical Methods

Gene pathway typically consists of a set of genes that jointly influence the system function. Genes are often divided into sets with similar functions based on their annotation information (e.g., the Gene Ontology, [1]). For the following discussion, we will summarize them as providing the gene set information.

Consider a two-class microarray data, and denote the normal transformed two-sample t -statistics for testing differential expressions as zi for gene i = 1, …, m. We propose to model zi with the following finite normal mixture model:

k=0Kθkfk(z),fk=N(μk,σ02),θk>0,k=0Kθk=1. (1)

Here the first component, N(μ0,σ02), empirically models null genes, which is different from theoretical null (standard normal distribution) and could take into account the potential dependence among genes [7]. We can interpret θ0 as the proportion of null genes, and θk the proportion of genes with μk magnitude of differential expression. In principle, the collection of all μk will capture the heterogeneity of differential expressions across all genes. We choose K based on BIC [23].

In enrichment analysis, we try to test whether a given gene set A is significantly different from any random gene set. Note that a random gene set can be treated as a random sampling from all genes. Thus comparing the given set A to a random set is equivalent to comparing A to all genes, which is again equivalent to comparing A to other genes (since A is a subset of all genes). Conceptually the (modified) two-sample t -statistics of genes in a given set can be modeled by a similar finite mixture model with different proportions of each component,

k=0Kνkfk(z),k=0Kνk=1. (2)

Under no enrichment, the gene set A and any random gene set have the same proportion of differentially expressed genes. Therefore gene set A and all the other genes (denoted as Ac) can be modeled, respectively, with

ν0f0(z)+k=1Kνjkfk(z),ν0+k=1Kνjk=1,j=1,2. (3)

Under enrichment, the gene sets A and Ac have different proportions of differentially expressed genes, and hence can be modeled separately with

k=0Kηjkfk(z),k=0Kηjk=1,j=1,2. (4)

Enrichment analysis corresponds to evaluating η10 = η20, which can be tested by likelihood ratio statistic, eA, comparing models (3) and (4). The significance of eA can be approximately assessed using chi-square distribution with one degree of freedom. Enrichment analysis is an one-sided test: whether gene set A is enriched with more differentially expressed genes compared to a random set. Therefore we adjust the p-value calculation as 0.5 + F(eA; 1)/2 when η̂10η̂20, and 0.5 − F(eA; 1)/2 otherwise, where F·(·; df) is the χdf2 distribution function.

In the proposed model, we have assumed that the variance of individual gene test statistics zi conditional on the mean expression is fixed and the different mixing proportions will capture the varying variation of different gene sets. Therefore it is important that we allow the individual mixing proportions to vary across different gene sets.

In the following we discuss estimation of the empirical null distribution, and EM algorithms [5] for solving the proposed models (1), (2), and (3).

3 Model Estimation

3.1 Empirical Null Distribution Estimation and Finite Mixture Model Fitting

Efron [8] proposed two methods for estimating (θ0, μ0, σ0): the geometric and analytical approaches. The geometric approach approximates the marginal log density with a quadratic curve near zero. The analytical approach is based on a truncated normal model by assuming non-null distribution has zero support in a pre-chosen small interval around zero. The geometric approach yields almost unbiased estimates if θ0 exceeds 0.9, but it has large variation for estimating μ0. The analytical approach generally gives more stable estimates while it depends on the pre-chosen interval. Both methods have been implemented in the R package, locfdr. In our simulation studies, we have observed that the analytical approach gives satisfactory results.

Given K and estimated empirical null distribution parameters (θ^0,μ^0,σ^02), we can estimate (μk, θk) for model (1) iteratively based on the EM algorithm as follows (see the Appendix for technical details)

θk(1θ^0)i=1mTk,ij=1Ki=1mTj,i,μki=1mTk,izii=1mTk,i,k=1,,K,

where

Tk,i=θkfk(zi;μk,σ^02)θ^0f0(zi;μ^0,σ^02)+j=1Kθjfj(zi;μj,σ^02),k>0,
fj(z;μ,σ2)=1σϕ(zμσ).

Here ϕ (·) is the standard normal distribution density function.

Occasionally, analytical approach implemented in locfdr gives abnormal estimate of θ̂0 which is larger than 1. We then estimate θ0 in the EM algorithm together with other parameters as follows:

θk1mi=1mTk,i,k0,μki=1mTk,izii=1mTk,i,k>0,

where

T0,i=θ0f0(zi;μ^0,σ^02)θ0f0(zi;μ^0,σ^02)+j=1Kθjfj(zi;μj,σ^02).

3.2 Gene Set Model Fitting

Given estimated (μ^0,σ^02,μ^k) based on all genes, we can estimate the individual model (2) for a given set A iteratively as follows (see the Appendix for technical details)

νkiATk,imA,

where mA is the size of set A and for gene i in set A

Tk,i=νkfk(zi;μ^k,σ^02)j=0Kfj(zi;μ^j,σ^02).

3.3 Model Fitting for a Gene Set and All the Other Genes Under no Enrichment

Under no enrichment, we can similarly estimate the mixture model (3) using the EM algorithm. Denote by Ac the complement of set A. Let

T0,i=ν0f0(zi;μ^0,σ^02)ν0f0(zi;μ^0,σ^02)+l=1Kν1lfl(zi;μ^l,σ^02),iA,
Tk,i=ν1kfk(zi;μ^k,σ^02)ν0f0(zi;μ^0,σ^02)+l=1Kν1lfl(zi;μ^l,σ^02),iA,k>0,
T0,j=ν0f0(zj;μ^0,σ^02)ν0f0(zj;μ^0,σ^02)+l=1Kν2lfl(zj;μ^l,σ^02),jAc,
Tk,j=ν2kfl(zj;μ^k,σ^02)ν0f0(zj;μ^0,σ^02)+l=1Kν2lfl(zj;μ^l,σ^02),jAc,k>0.

We then iteratively solve parameters as follows (see the Appendix for technical details)

ν01m(iAT0,i+jAcT0,j),
ν1k(1ν0)iATk,il=1KiATl,i,
ν2k(1ν0)jAcTk,jl=1KjAcTl,j.

Next we conduct a simulation study to compare the proposed likelihood-based method (denoted as Lrt) to the GSA approach (using the maxmean test statistic) studied at [9].

4 Simulation Study

For 2 × 104 genes from two groups each with n samples, we simulate their expressions based on the conditional normal distribution. Expression variance σj2 is simulated individually for each gene from a χ2 distribution with 10 degrees of freedom. This mimics the commonly observed large variation of gene variances in microarray data. We simulate the dependence by dividing genes into 200 blocks each with mg = 100 genes and within-block pairwise gene correlation being ρg. Gene block correlation parameter ρg is randomly simulated from a Beta distribution, Beta(2, 2). We randomly set mgθ0 genes in each block as null. The standardized differences of non-null genes, (μ1jμ2j)/σj are randomly simulated from a mixture of two scaled Beta distributions, 0.5 + Beta(2, 2) and −0.5−Beta(2, 2), with equal probabilities.

We consider three types of gene set, each with me genes and different dependence structures. The first type has similar dependence structure as all genes and is sampled from all G = 200 blocks. The other two types of gene set exhibit relatively stronger dependence and are sampled from the first G = 30 and 50 blocks respectively. This mimics the commonly observed gene pathways with genes highly interacting with each other. For every type of gene set, we consider two enrichment scenarios. Firstly, the non-null genes in the gene set are randomly sampled from all differentially expressed genes. Secondly, the non-null genes in the gene set are all up-regulated (i.e., the gene set is enriched with different differential expression categories compared to all the other genes).

For size evaluation, we randomly sample meθ0 null and me(1−θ0) non-null genes, and compute the enrichment p-values based on Lrt and GSA in each simulation. For power comparison, we consider gene set with randomly sampled meθe null and me (1 − θe) non-null genes.

In the simulation, we set n = 15, θ0 = 0.9, and consider two sets of scenarios: (1) θe = (0.86, 0.82, 0.78) and me = (100, 200, 300), and (2) θe = (0.8, 0.7, 0.6), and me = (10, 20, 50), which will investigate the performance under different gene set sizes. In the second scenario with relatively small gene set, θe is selected to define a meaningful number of differentially expressed genes.

The proposed Lrt performs better than GSA under all simulation settings, and we have observed similar patterns. Here we report the results for me = (100, 200, 300) and non-null genes are sampled from all differentially expressed genes. The complete results are provided at the supplementary materials.

Table 1 summarizes the estimated sizes for true Type I error α = (0.01, 0.05, 0.10) over 1000 simulations. We can see that both methods have approximately the right size. The proposed Lrt in general is more conservative compared to GSA, which could over estimate the Type I error under relatively large significance level.

Table 1.

Estimated type I error of Lrt and GSA over 1000 simulations (listed within parentheses are the standard errors). Non-null genes are randomly sampled from all differentially expressed genes

α̂
α 0.01 0.05 0.1
me = 100 G = 200 Lrt
GSA
0.002 (5e-5)
0.009 (3e-4)
0.014 (4e-4)
0.076 (2e-3)
0.038 (1e-3)
0.176 (5e-3)
G = 50 Lrt
GSA
0.005 (2e-4)
0.013 (4e-4)
0.019 (6e-4)
0.085 (2e-3)
0.046 (1e-3)
0.172 (5e-3)
G = 30 Lrt
GSA
0.005 (2e-4)
0.012 (4e-4)
0.026 (8e-4)
0.071 (2e-3)
0.068 (2e-3)
0.156 (4e-3)
me = 200 G=200 Lrt
GSA
0.001 (3e-5)
0.012 (4e-4)
0.011 (3e-4)
0.077 (2e-3)
0.035 (1e-3)
0.166 (4e-3)
G = 50 Lrt
GSA
0.002 (6e-5)
0.012 (4e-4)
0.034 (1e-3)
0.066 (2e-3)
0.071 (2e-3)
0.173 (5e-3)
G = 30 Lrt
GSA
0.012 (4e-4)
0.009 (3e-4)
0.046 (1e-3)
0.067 (2e-3)
0.089 (3e-3)
0.140 (4e-3)
me = 300 G = 200 Lrt
GSA
0.001 (3e-5)
0.007 (2e-4)
0.009 (3e-4)
0.072 (2e-3)
0.030 (9e-4)
0.158 (4e-3)
G = 50 Lrt
GSA
0.007 (2e-4)
0.006 (2e-4)
0.028 (9e-4)
0.060 (2e-3)
0.072 (2e-3)
0.149 (4e-3)
G = 30 Lrt
GSA
0.020 (6e-4)
0.008 (3e-4)
0.051 (2e-3)
0.055 (2e-3)
0.105 (3e-3)
0.126 (3e-3)

Figures 12 and 3 summarize the power averaged over 1000 simulations for me = (300, 200, 100), respectively. The red solid/dashed/dotted lines are estimated power for Lrt under θe = (0.86, 0.82, 0.78), and black lines are the corresponding power for GSA. Overall we can see that the proposed Lrt has very competitive performance compared to GSA under all settings. In general both methods have reduced power with increasing gene interactions within a given set and decreasing gene set size me. With increasing gene set size me, we observe relatively larger performance difference between the two methods.

Fig. 1.

Fig. 1

Power of Lrt and GSA averaged over 1000 simulations for me = 300. The horizontal axis corresponds to type I error

Fig. 2.

Fig. 2

Power of Lrt and GSA averaged over 1000 simulations for me = 200. The horizontal axis corresponds to type I error

Fig. 3.

Fig. 3

Power of Lrt and GSA averaged over 1000 simulations for me = 100. The horizontal axis corresponds to type I error

Next we analyze a leukemia and p53 gene expression microarray data to illustrate the relative performance of the proposed likelihood-based method and GSA.

5 Application to Leukemia and p53 Gene Expression Data

The leukemia gene expression data reported at [15] measured the expressions of 45101 genes from five paired controls and Meis1-knockdown cases. We identified 522 gene pathways from C2 functional collection in the Molecular Signature Database [25]. Pathway sizes range from 2 to 365 genes. We analyze in total 357 pathways that have more than 10 genes.

To improve the accuracy of the normal distribution approximation, we apply the empirical Bayes modeling approach of [24], which computed a moderated t -statistic, ti, for gene i by pooling information across all genes for an improved sample variance estimate (implemented in the R package, limma). We then apply the normal distribution transformation to the moderated t -statistic, zi = ϕ−1 (Td (ti)), where ϕ (·) is the standard normal distribution function and Td (·) is the t -distribution function with d degrees of freedom. Here, the degree of freedom d is estimated from all genes using the empirical Bayes modeling approach.

When applied to the leukemia microarray data, controlling FDR at 0.05/0.1, the proposed Lrt detected 29/51 significant gene sets, while no gene pathway is identified as significant with GSA. Figure 4 shows the number of significant pathways versus the estimated FDR for Lrt and GSA.

Fig. 4.

Fig. 4

The number of significant pathways versus FDR for the leukemia data

Table 2 lists the top 29 significant pathways identified by the proposed method. Many of them are closely related to cancer development. For example, several identified pathways are related to cell cycle, which is known to play an important role in cancer development: cell cycle machinery controls cell proliferation, and cancer is a disease of inappropriate cell proliferation [4]. The atrbrcaPathway is also closely related to cell cycle and cancer. Specifically the ATR gene serves as a checkpoint kinase that halts cell cycle progression and induces DNA repair when DNA is damaged. Loss of ATR results in a loss of checkpoint control in response to DNA damage, leading to cell death (see http://www.biocarta.com/pathfiles/h_ATRBRCAPATHWAY.asp). Liu et al. [18] have shown the important role of ATR in cell cycle control in MLL/Meis1 leukemia. The DNA damage signaling pathway is linked to DNA repair, cell-cycle control, growth arrest, and plays an important role in cancer development.

Table 2.

Top 29 most significant pathways identified with the proposed likelihood-based method

Pathway # genes p-value
Cell_Cycle 73 2E-13
CR_CELL_CYCLE 74 5E-11
atrbrcaPathway 18 7E-07
CR_REPAIR 35 4E-06
GLUT_DOWN 230 5E-06
cell_cycle_checkpoint 22 1E-05
DNA_DAMAGE_SIGNALING 85 1E-05
HTERT_UP 94 2E-05
CR_DNA_MET_AND_MOD 20 3E-05
LEU_DOWN 130 3E-05
cell_cycle_regulator 20 6E-05
rbPathway 11 0.0001
cell_cycle_arrest 27 0.0005
hdacPathway 28 0.0008
RAP_DOWN 169 0.0010
SA_REG_CASCADE_OF_CYCLIN_EXPR 12 0.0011
il7Pathway 16 0.0015
mRNA_processing 40 0.0018
shh_lisa 15 0.0019
GLUCOSE_DOWN 122 0.0020
MAP00020_Citrate_cycle_TCA_cycle 16 0.0022
cellcyclePathway 22 0.0022
mRNA_splicing 45 0.0023
SIG_IL4RECEPTOR_IN_B_LYPHOCYTES 26 0.0025
caspasePathway 21 0.0028
crebPathway 25 0.0030
eif4Pathway 24 0.0034
MAP00240_Pyrimidine_metabolism 38 0.0035
nfatPathway 49 0.0040

The p53 expression data are available at http://www.broadinstitute.org/gsea/datasets.jsp, and consists of 12625 genes from 33 p53 mutant and 17 p53+ cancer cell lines. We analyze in total 453 pathways that have more than 10 genes from the C2 functional collection.

Controlling FDR at 0.01/0.05, the proposed Lrt detected 26/50 significant gene sets, and GSA detected 3/8 significant gene sets. Figure 5 shows the number of significant pathways versus the estimated FDR for Lrt and GSA. Table 3 listed the top ranked pathways by Lrt and GSA (controlling FDR at 0.05).

Fig. 5.

Fig. 5

The number of significant pathways versus FDR for the p53 data

Table 3.

Significantly enriched pathways for the p53 data identified by Lrt and GSA (FDR ≤ 0.05)

Lrt
Pathway # genes p-value
P53_UP 49 6.3E-09
p53Pathway 43 1.2E-08
rasPathway 41 4.3E-08
GLUT_UP 294 4.0E-07
SA_PROGRAMMED_CELL_DEATH 24 5.5E-07
mitochondriaPathway 32 6.4E-07
HTERT_UP 135 6.7E-07
p53hypoxiaPathway 36 7.3E-07
SA_G1_AND_S_PHASES 26 2.3E-06
ceramidePathway 48 1.2E-05
radiation_sensitivity 61 1.3E-05
fmlppathway 65 1.3E-05
hivnefPathway 100 3.0E-05
DNA_DAMAGE_SIGNALING 154 3.0E-05
hsp27Pathway 34 3.9E-05
XINACT_MERGED 26 3.9E-05
insulinPathway 44 1.1E-04
badPathway 43 1.3E-04
integrinPathway 60 1.6E-04
igf1Pathway 47 2.9E-04
g2Pathway 41 3.4E-04
atmPathway 43 3.6E-04
tcrPathway 85 3.6E-04
Glycogen_Metabolism 50 4.6E-04
tsp1Pathway 17 4.6E-04
tall1Pathway 23 5.0E-04
cdmacPathway 32 8.2E-04
metPathway 71 1.0E-03
at1rPathway 64 1.2E-03
ngfPathway 36 1.3E-03
cxcr4Pathway 41 1.3E-03
bcl2family_and_reg_network 50 1.4E-03
eif2Pathway 12 1.5E-03
mef2dPathway 27 1.5E-03
spryPathway 27 1.8E-03
eea1Pathway 12 1.8E-03
CR_DEATH 114 2.1E-03
pgc1aPathway 35 2.6E-03
relaPathway 26 2.7E-03
rnaPathway 17 3.2E-03
ecmPathway 36 3.3E-03
INSULIN_2F_UP 200 3.5E-03
pyk2Pathway 57 3.7E-03
SA_FAS_SIGNALING 14 4.0E-03
chemicalPathway 46 4.0E-03
deathPathway 56 4.5E-03
Cell_Cycle 115 4.6E-03
breast_cancer_estrogen_signaling 162 4.7E-03
tollPathway 45 5.1E-03
SA_B_CELL_RECEPTOR_COMPLEXES 46 5.3E-03
GSA
Pathway # genes p-value Lrt rank
P53_UP 49 8.0E-6 1
p53Pathway 43 9.0E-6 2
p53hypoxiaPathway 36 1.0E-5 8
badPathway 43 1.1E-4 18
radiation_sensitivity 61 3.3E-4 11
SA_PROGRAMMED_CELL_DEATH 24 3.5E-4 5
rasPathway 41 7.2E-4 3
SA_G1_AND_S_PHASES 26 7.4E-4 9

We can see that the eight significant pathways identified by GSA are all detected by Lrt. Many of the identified pathways by Lrt are Biocarta pathways, e.g., ceramidePathway, fmlppathway, hivnefPathway, hsp27Pathway, insulinPathway, badPathway, integrinPathway, igf1Pathway, g2Pathway, and atmPathway etc. Most of them have been studied and shown related to the p53 gene (see, e.g., [1012, 14, 16, 17, 21]). For example, the ATM gene interacts with p53 gene to cause the disease ataxia telangiectasia which involves an inherited predisposition to some cancers (http://www.biocarta.com/pathfiles/h_atmPathway.asp). The hsp27 gene modulated the p53 signaling [21]. The igf1Pathway highly interacts with the p53 signaling pathway and they regulate cell growth, proliferation, and death [16]. The g2Pathway consists of genes involved in the cell cycle G2/M checkpoint event, and the p53 gene plays an important role (http://www.biocarta.com/pathfiles/h_g2Pathway.asp).

6 Discussion

The GSEA approach firstly proposed and studied at [19] and [25] provides a very novel way to interpret the large-scale gene expression data. Compared to individual gene oriented analysis, gene set-based inference can often produce meaningful and easy to interpret results and provide additional insights into the underlying biological processes. Many simple and ad hoc statistical methods based on categorization are becoming routinely used in practice (e.g., the widely used hypergeometric testing approach) for gene set significance assessment. Nonparametric methods based on permutation and random sampling have been proposed and proven to be more powerful but might be quite computing intensive. We approach the GSEA from a likelihood framework and transform it into a model comparison problem, which can be addressed using the powerful likelihood ratio test approach. Through applications and simulation studies we have demonstrated the competitive performance of the proposed method. An interesting extension is to develop similar method for multi-group comparison problems, which can be approached using a finite chi-square distribution mixture model. We will report the results elsewhere in the future.

Supplementary Material

1

Acknowledgements

This research was supported in part by a Biomedical Informatics and Computational Biology research grant from the University of Minnesota-Rochester, and National Institute of Health grant CA134848 and GM083345. We are grateful to the University of Minnesota Supercomputing Institute for assistance with the computations. We would like to thank the associate editor and two anonymous referees for their constructive comments, which have dramatically improved the presentation of the paper.

Appendix

EM Algorithm for Estimating the Finite Mixture Model

We begin with the finite mixture model in (1) given (θ^0,μ^0,σ^02) and K. Define indicators wik ∈ {0, 1} following a multinomial distribution, Pr(wik = 1) = θk, k=0Kwik=1, and conditionally we assume zi |wik = 1 ~ fk. The complete data likelihood function for (zi, wik) can be written as

i=1m{θ^0f0(zi;μ^0,σ^02)}wi0k=1K{θkfk(zi;μk,σ^02)}wik.

In the E-step, the conditional probabilities can be checked to be

T0,i=θ^0f0(zi;μ^0,σ^02)θ^0f0(zi;μ^0,σ^02)+k=1Kθkfk(zi;μk,σ^02),
Tk,i=θkfk(zi;μk,σ^02)θ^0f0(zi;μ^0,σ^02)+j=1Kθjfj(zi;μj,σ^02).

In the M-step, the conditional expected log likelihood can be checked to be proportional to

i=1m{T0,i(logθ^0logσ^0(ziμ^0)22σ^02)+k=1KTk,i(logθklogσ^0(ziμk)22σ^02)},

which can be easily verified to be maximized by

θ^k=(1θ^0)i=1mTk,ij=1Ki=1mTj,i,μ^k=i=1mTk,izii=1mTk,i,k1.

Given only (μ^0,σ^02) with θ0 also being a parameter, we have

(θ^0,,θ^K)=arg maxθkk=0Ki=1mTk,ilogθk,
μ^k=arg minμki=1mk=1KTk,i(ziμk)2,k>0.

We can easily check that

θ^k=1mi=1mTk,i,k0,μ^k=i=1mTk,izii=1mTk,i,k>0.

EM Algorithm for Estimating the Gene Set Model

The complete data likelihood function for a gene set A given (θ^0,μ^0,σ^02,μ^k) is

iAk=0K{νkfk(zi;μ^k,σ^02)}wik.

The conditional expected log likelihood can easily be checked to be

iAk=0KTk,ilogνk,Tk,i=νkf0(zi;μ^k,σ^02)0=1Kνjfj(zi;μ^j,σ^02),k0.

We can easily verify that

ν^k=iATk,imA,k0.

EM Algorithm for Estimating the Model Under no Enrichment

The complete data likelihood can be written as

iA{ν0f0(zi;μ^0,σ^02)}wi0k=1K{ν1kfk(zi;μ^k,σ^02)}wikjAc{ν0f0(zj;μ^0,σ^02)}wj0×k=1K{ν2kfk(zj;μ^k,σ^02)}wjk,

where ν0+k=1Kνlk=1,l=1,2. The conditional expected log likelihood can be easily checked to be

iA{T0,ilogν0+k=1KTk,ilogν1k}+jAc{T0,jlogν0+k=1KTk,jlogν2k},

where

T0,i=v0f0(zi;μ^0,σ^02)ν0f0(zi;μ^0,σ^02)+l=1Kν1lfl(zi;μ^l,σ^02),iA,
Tk,i=ν1kfk(zi;μ^k,σ^02)ν0f0(zi;μ^0,σ^02)+l=1Kν1lfl(zi;μ^l,σ^02),iA,k>0,
T0,j=ν0f0(zj;μ^0,σ^02)ν0f0(zj;μ^0,σ^02)+l=1Kν2lfl(zj;μ^l,σ^02),jAc,
Tk,j=ν2kfk(zj;μ^k,σ^02)ν0f0(zj;μ^0,σ^02)+l=1Kν2lfl(zj;μ^l,σ^02),jAc,k>0.

To maximize the conditional log likelihood, we use the Lagrange multiplier method

Q=iA{T0,ilogν0+k=1KTk,ilogν1k}+jAc{T0,jlogν0+k=1KTk,jlogν2k}λ1(ν0+k=1Kν1k1)λ2(ν0+k=1Kν2k1).

Setting the gradient vector ∇Q = 0 yields the following equations:

Qν0=iAT0,iν0+jAcT0,hν0λ1λ2=0,
Qν1k=iATk,iν1kλ1=0,k>0,
Qν2k=jAcTk,jν2kλ2=0,k>0,
Qλ1=ν0+k=1Kν1k1=0,
Qλ2=ν0+k=1Kν2k1=0.

From the first three equations we can obtain

ν0=iAT0,i+jAcT0,jλ1+λ2,ν1k=iAT0,iλ1,
ν2k=jAcT0,jλ2,k>0.

When plugging these into the last two equations, we obtain

ν^0=iAT0,i+jAcT0,jm,

and

ν^1k=(1ν^0)iATk,il=1KiATl,i,ν^2k=(1ν^0)jAcTk,jl=1KjAcT1,j.

Contributor Information

Sang Mee Lee, Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building MMC 303, 420 Delaware St SE, Minneapolis, MN 55455, USA.

Baolin Wu, Email: baolin@umn.edu, Division of Biostatistics, School of Public Health, University of Minnesota, A460 Mayo Building MMC 303, 420 Delaware St SE, Minneapolis, MN 55455, USA.

John H. Kersey, Masonic Cancer Center, University of Minnesota, Minneapolis, MN 55455, USA

References

  • 1.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Barry WT, Nobel AB, Wright FA. A statistical framework for testing functional categories in microarray data. Ann Appl Stat. 2008;2(1):286–315. [Google Scholar]
  • 3.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B. 1995;57:289–300. [Google Scholar]
  • 4.Collins K, Jacks T, Pavletich NP. The cell cycle and cancer. Proc Natl Acad Sci USA. 1997;94(7):2776–2778. doi: 10.1073/pnas.94.7.2776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B. 1977;39(1):1–38. [Google Scholar]
  • 6.Dørum G, Snipen L, Solheim M, Saebø S. Rotation testing in gene set enrichment analysis for small direct comparison experiments. Stat Appl Genet Mol Biol. 2009;8:34. doi: 10.2202/1544-6115.1418. [DOI] [PubMed] [Google Scholar]
  • 7.Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc. 2004;99:96–104. [Google Scholar]
  • 8.Efron B. Correlation and large-scale simultaneous significance testing. J Am Stat Assoc. 2007;102:93–103. [Google Scholar]
  • 9.Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat. 2007;1(1):107–129. [Google Scholar]
  • 10.Ferbeyre G, Stanchina ED, Lin AW, Querido E,McCurrach ME, Hannon GJ, Lowe SW. Oncogenic ras and p53 cooperate to induce cellular senescence. Mol Cell Biol. 2002;22(10):3497–3508. doi: 10.1128/MCB.22.10.3497-3508.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Greenway AL, McPhee DA, Allen K, Johnstone R, Holloway G, Mills J, Azad A, Sankovich S, Lambert P. Human immunodeficiency virus type 1 nef binds to tumor suppressor p53 and protects cells against p53-mediated apoptosis. J Virol. 2002;76(6):2692–2702. doi: 10.1128/JVI.76.6.2692-2702.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Jiang P, Du W, Wu M. p53 and bad: remote strangers become close friends. Cell Res. 2000;17(4):283–285. doi: 10.1038/cr.2007.19. [DOI] [PubMed] [Google Scholar]
  • 13.Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21(18):3587–3595. doi: 10.1093/bioinformatics/bti565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kim SS, Chae HS, Bach JH, Lee MW, Kim KY, Lee WB, Jung YM, Bonventre JV, Suh YH. p53 mediates ceramide-induced apoptosis in SKN-SH cells. Oncogene. 2002;21(13):2020–2028. doi: 10.1038/sj.onc.1205037. [DOI] [PubMed] [Google Scholar]
  • 15.Kumar AR, Li Q, HudsonWA, ChenW, Sam T, Yao Q, Lund EA,Wu B, Kowal BJ, Kersey JH. A role for MEIS1 in MLL-fusion gene leukemia. Blood. 2009;113(8):1756–1758. doi: 10.1182/blood-2008-06-163287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Levine AJ, Feng Z, Mak TW, You H, Jin S. Coordination and communication between the p53 and IGF-1-AKT-TOR signal transduction pathways. Genes Dev. 2006;20(3):267–275. doi: 10.1101/gad.1363206. [DOI] [PubMed] [Google Scholar]
  • 17.Lewis JM, Truong TN, Schwartz MA. Integrins regulate the apoptotic response to DNA damage through modulation of p53. Proc Natl Acad Sci USA. 2002;99(6):3627–3632. doi: 10.1073/pnas.062698499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Liu H, Takeda S, Kumar R, Westergard TD, Brown EJ, Pandita TK, Cheng EH, Hsieh JJ. Phosphorylation of MLL by ATR is required for execution of mammalian s-phase checkpoint. Nature. 2010;467:343–346. doi: 10.1038/nature09350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Mootha V, Lindgren C, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly M, Patterson N, Mesirov J, Golub T, Tamayo P, Spiegelman B, Lander E, Hirschhorn J, Altshuler D, Groop L. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003;34:267–273. doi: 10.1038/ng1180. [DOI] [PubMed] [Google Scholar]
  • 20.Newton M, Quintana F, den Boon J, Sengupta S, Ahlquist P. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat. 2007;1(1):85–106. [Google Scholar]
  • 21.O’Callaghan-Sunol C, Gabai VL, Sherman MY. Hsp27 modulates p53 signaling and suppresses cellular senescence. Cancer Res. 2007;67(24):11779–11788. doi: 10.1158/0008-5472.CAN-07-2441. [DOI] [PubMed] [Google Scholar]
  • 22.Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E. Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res. 2004;29(6):1213–1222. doi: 10.1023/b:nere.0000023608.29741.45. [DOI] [PubMed] [Google Scholar]
  • 23.Schwarz G. Estimating the dimension of a model. Ann Stat. 1978;6:461–464. [Google Scholar]
  • 24.Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol. 2004;3:1. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
  • 25.Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. From the Cover: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES