Abstract
An increasing challenge in analysis of microarray data is how to interpret and gain biological insight of profiles of thousands of genes. This article provides a review of statistical methods for analysis of microarray data by incorporating prior biological knowledge using gene sets and biological pathways, which consist of groups of biologically similar genes. We first discuss issues of individual gene analysis. We compare several methods for analysis of gene sets including over-representation anlaysis, gene set enrichment analysis, principal component analysis, global test and kernel machine. We discuss the assumptions of these methods and their pros and cons. We illustrate these methods by application to a type II diabetes data set.
1 Introduction
Since the intial work of Schena et al.1 in 1995, microarrays have become a commonly used tool in biological and medical research due to their ability to simultaneously profile the expression of thousands of genes. Initial experiments were relatively simple with no replication, one array per condition, and only crude measurements for differential expression were used – a fold change of 1.5 indicated up-regulation while a fold change of 0.75 indicated down-regulation. As the complexity of these studies increased with their popularity, the need for more sophisticated tools became clear.
Currently, a standard microarray experiment consists of the simultaneous expression profiling of thousands of genes across various experimental conditions. Unless otherwise stated, we generally assume two conditions. Low-level analyses typically include image analysis (grid alignment, target detection, intensity extraction and local background correction), normalisation and computation of a gene expression value for each probe on the chip. Significant work has been done in this area2–6 as all further analyses are contingent on proper low-level processing.
High-level analyses typically begin by calculating a statistic (often a t-statistic) for each gene on the chip, measuring differential expression between experimental conditions. A p-value is usually generated for each gene, based on the statistic, via permutation or a parametric distribution. To account for the thousands of comparisons performed, procedures controlling the family wise error rate (FWER) or the false discovery rate (FDR)7, 8 are performed. Genes that survive the correction for multiple comparisons are then considered differentially expressed while genes that fail to meet the criterion for significance are non-differentially expressed. The list of differentially expressed genes is often the final goal for the statistician and once obtained, it is the responsibility of the biological or biomedical researcher to draw further conclusions.
These traditional approaches have yielded a wealth of information regarding gene interactions, functions and pathways. Recently, biologists have become interested in exploiting this information to facilitate and improve the analyses performed. The knowledge can be used to varying degrees,9 but at the most basic level, it is known that most biological phenomena occur through the concerted expression of multiple genes. We can thus use our prior knowledge of what genes belong to various signalling pathways or functional groups and focus our analyses on sets of related genes, called gene sets. Numerous databases containing gene groupings based on various criteria have been developed. Examples include KEGG10 and the Gene Ontology (GO) Consortium.11
Use of information derived from the GO consortium database is the most popular, so we briefly describe their database. The GO consortium contains three principal ontologies: biological processes, cellular components and molecular functions. Each ontology is a directed acyclic graph, creating a hierarchy of terms, called GO terms, that range from very broad functions, such as ‘physiological process’, down to more specific levels, such as ‘microtubule depolymerisation’. Each ontology and GO term has a comprehensive list of genes previously demonstrated to be associated with that ontology or GO term. A number of tools have been created for mining and using the data from the GO consortium.12
Using groupings from the GO consortium or other annotation databases, our analyses no longer consider individual genes, but rather groups of genes. This mode of analysis overcomes a number of drawbacks, which we will explore later, associated with traditional approaches to microarray analysis and is biologically more meaningful. The goal of this article is to review methods that test for the differential expression of gene sets defined by prior knowledge.
In the next section, we will briefly review the drawbacks associated with the traditional approach and discuss the use of prior biological knowledge as a remedy for some of the problems. Section 3 introduces a few of the many popular methods using prior biological knowledge. Section 4 presents statistical issues other authors have identified regarding these methods. In Section 5, we apply these methods to a real data set and compare their performances. Finally, in Section 6, we briefly summarise the main points of the article and discuss other practical issues regarding the use of prior biological knowledge.
2 Problems associated with traditional approaches
Advocates have suggested many different reasons for incorporating prior biological knowledge into the analysis of gene expression data.13–17 We present some of these arguments here.
In terms of biological rationale for testing gene sets, it is well known that most pathways are not driven by a single gene, but rather by a combination of multiple genes acting in a concerted fashion. Thus, individual gene analysis may miss important pathway effects since genes that demonstrate a high level of differential expression between conditions may not be as important as a group of genes that each shows only moderate differences between conditions. In particular highly differentially expressed genes tend to be ‘downstream’ genes. Many upstream proteins, such as transcription factors and other regulatory proteins, may only show very moderate changes, especially in contrast to high abundance proteins expressed at the end of the biological cascade. If attention is restricted to only the most highly differentially expressed genes, upstream effects are likely to be missed, despite the crucial role they play acting as activators and gatekeepers.
Practicality also motivates the use of prior biological knowledge. Many investigators have been faced with the problem after correction for multiple comparisons, no genes meet the threshhold for statistical significance. Given that sample sizes for microarray experiments tend to be small, if the signal in the data is not strong relative to the noise, as in situations where exposures are mild, perhaps due to toxicity restrictions on patients, or where the biological response is simply weak by nature, then finding highly differentially expressed genes may be quite difficult. Multiple comparisons exacerbate the problem when high correlations exist in the data: the typical FDR and FWER controlling procedures assume that all of the hypotheses are independent, but genes are known to work together in a concerted fashion so tests may be overly conservative. Use of the empirical null hypothesis serves as a means of correcting for correlation,18 but assumptions regarding the tail behaviour of the null distribution may not hold. Regardless, when using traditional approaches, failure to detect differentially expressed genes lead to failure in drawing conclusions.
The alternative to not detecting any differentially expressed genes is to find that even after correction for multiple comparisons, a long list of ‘differentially expressed’ genes remains. Although biological collaborators often prefer a long list of genes that meet the threshhold for statistical significance, this presents a problem in terms of interpretation. Often it is difficult to draw out a specific theme or message, or to identify potential mechanisms based on a long gene list. What conclusions are found also tend to be very subjective.
Further, comparisons of gene lists between different experiments have shown little overlap. Despite similar exposure conditions, experiments from different groups have shown dissimilar results when gene lists are compared.19 This presents a confusing picture of what is going on biologically since each group is presenting hypotheses based on their own lists of genes called differentially expressed.
Using prior biological knowledge immediately preempts the problem that no genes are individually differentially expressed after multiple comparisons as we are no longer interested in any individual gene’s significance. Furthermore, as we are no longer performing thousands of comparisons, but rather restricting attention to comparatively few pathways of specific interest, corrections for multiple comparisons need not be as extreme. Interpretation of results is facilitated as pathways are often the primary interest, and this provides a means by which the same conclusions will be drawn by different researchers presented with the same data, improving objectivity. Depending on the method applied, moderate changes can potentially be detected, and for a pathway shown to be differentially expressed, what genes are driving the difference can possibly be elucidated as well, identifying which genes are the upstream regulator genes. Finally, while a single gene is likely to show great variability in differential expression level from experiment to experiment, a pathway that contains many genes is less likely to demonstrate such variable behavior, giving more consistent results between experiments.16
3 Prior knowledge-based methods
Numerous methods utilising prior biological knowledge have been and are being developed. We here, review a few of the many methods but emphasise that this is by no means a complete catalogue. We will compare these methods and discuss their assumptions and pros and cons in Section 4.
3.1 Over-representation analysis
Over-representation analysis (ORA) refers to an entire class of methods which are, by far, the most commonly used as they are the earliest and the simplest developed. These methods start from a list of genes that are called differentially expressed, D, and the list of genes in the gene set of interest, S. Dc and Sc represent the set of genes not differentially expressed and the set of genes not in the gene set, respectively. Based on these, the researcher looks for an over-representation of the genes in the gene set among differentially expresssed genes, or equivalently, over-representation of differentially expressed genes in the gene set. Practically speaking, this is done by creating a 2 × 2 contingency table based on membership in D and membership in S. Letting N be the total number of genes, and for any sets A and B, NA denotes the cardinality of A and NAB denotes the cardinality of A ∩ B, then we can build Table 1. To generate a p-value for over-representation, we test for independence between membership in D and membership in S using a Fisher’s exact test,20 specifically:
Table 1.
2 × 2 ORA Contingency table based on membership in the list of differentially expressed genes (D) and the list of genes in the gene set (S)
| Diff. expressed | Not diff. expressed | ||
|---|---|---|---|
| In gene set | NSD | NSDc | NS |
| Not in gene set | NSc D | NSc Dc | NSc |
| Total | ND | NDc | N |
Many variations on this method have been developed.21 Differences focus on the construction of the list of differentially expressed genes and the test used once a 2 × 2 table has been constructed. Tests besides Fisher’s exact test include the chi-square test, the hypergeometric test and the binomial proportions z-test. In practice, the choice of test is unimportant. However, as will be demonstrated later, how the cutoff distinguishing differentially expressed genes from non-differentially expressed genes is constructed strongly influences whether or not a pathway is called differentially expressed. Criteria for differential expression may be based on the multiple comparisons threshold, but can be much simpler, e.g. using the 100 genes with smallest individual p-value, the top 5% most differentially expressed genes, or all genes with fold change greater than 2. For a full discussion of ORA methods, see Khatri and Draghici.21
3.2 Gene set enrichment analysis
While ORA is attractive because of its simplicity, it relies heavily a potentially arbitrary hard cutoff. A method that remedies this is gene set enrichment analysis (GSEA). Instead of using a set cutoff, GSEA ranks all the genes on the chip based on some signed measure of differential expression from individual gene analysis and then tests the null that the genes in the gene set are uniformly distributed throughout the list of ranked genes against the alternative that the genes in the gene set tend to be closer to the top or bottom of the list. The assumption is that if a gene set is differentially expressed, then the component genes are likely to be more differentially expressed and thus clustered towards either the top or bottom of the list. This assumes that the direction of differential expression for genes in a differentially expressed gene set is the same.
The original GSEA approach was developed by Mootha et al.22. Using the same notation as before, the basic algorithm is as follows:
Rank the N genes on the chip based on a differential expression measurement, such as t-statistic, to obtain L, the ranked gene list.
-
An enrichment score (ES) is then calculated for the date set. For gene Gi (the i-th gene in L), let:
where NS is the number of genes in Set S and NSc is the number of genes not in set S. Define the enrichment score .
-
To determine significance, permutation is used to generate the null distribution:
Randomly permute the class labels.
Re-rank the genes to generate a new ranked gene list L*.
Calculate ES(S)*, the enrichment score based on L*.
Repeat the above for 1000 permuations.
A p-value is generated by comparing our original ES(S) to the distribution of the ES(S)*.
The ES is essentially a modified Kolmogorov–Smirnov statistic. Several improvements have been made to the method. Sweet-cordero et al.23 extended GSEA to multiple gene sets and multiple data sets and the Subramanian et al.13 modified the enrichment score so that each gene’s contribution is weighted by its correlation with the phenotypic outcome.
Many methods similar to GSEA have been developed. The gene set analysis (GSA) of Efron and Tibshirani24 is based on the GSEA method, but uses a ‘maxmean statistic’, M, instead of the Kolmogorov–Smirnov statistic for the enrichment score, potentially leading to greater power. If ti is the differential expression measurement (t-statistic) for the i-th gene in the gene set, then the max mean statistic is given by:
Note that the denominator is NS. For evaluation of significance, GSA argues for permutation of genes in addition to the permuation of class labels. A method by Smythe25 and Tian et al.26 uses the averaged t-statistic for the enrichment score. Tian et al. also makes further modifications to GSEA in the case where one wishes to compare differential expression of one gene set to differential expression of another. Other methods by Pavlidis et al.27 and Rhanenfhrer et al.28 are similar in flavour.
3.3 Global test method using generalised linear models
The global test14 does not rely on the potentially unstable individual gene analyses. This method exploits the duality beween association and prediction: if a gene set can be used to predict the clinical outcome, its expression pattern must differ for different outcomes. If Y is the outcome of interest (possibly continuous or possibly 1/0 for case/control status), and letting X be the n × NS matrix of gene expression values for the gene set (where n is the number of samples) so that xij is the gene expression value of the j-th gene of the i-th sample, the global test is motivated by a regression model to predict the outcome based on gene set expression:
| (1) |
where g(·) is a link function in generalised linear models,29 such as the logit link for the two group comparison, and α is an intercept. Then testing for an overall predictive effect for the gene set is equivalent to testing:
In most cases, the number of genes in the gene set is large relative to the sample size, so an additional assumption that the β’s are iid with mean 0 and variance τ2 is made. Under this assumption, our null hypothesis is simply:
An alternative interpretation is to rewrite the earlier model by setting . Then under the null, r = (r1, …, rn) where n is the number of samples, has mean 0 and covariance τ2XX′. We thus rewrite our model as:
| (2) |
which corresponds to a random effects model. Assuming α is known, a score statistic for testing H0: τ2 = 0 is given as
where R = 1/NSXX′, μ = g−1(α), and μ2 and μ4 are the second and fourth central moments of Y under the null.14, 30 T can be approximated by:
which is also assymptotically normal under H0. However, since the sample size is likely to be low, Goeman et al. suggest that significance be evaluated by permuting the class labels to obtain a null distribution for Q and then comparing the original statistic to the permuted distribution. Since α is never known in real situations, some adjustments are necessary to estimate μ and μ2.
A nice by-product of the global test statistic is that i-th gene’s contribution to Q is simply:
as Q = 1/NSΣi Qi. Thus, if a pathway is determined to be significantly differentially expressed, by estimating the contribution (influence score) for each gene, researchers can determine which genes are driving the difference, giving more information as to the biological mechanisms involved.
3.4 Global test method using kernel machines
The global test of Goeman et al.14 assume in model (1) the effects of genes within a gene set are additive. Genes within a pathway are often correlated and interact to each other. The additive assumption hence might be too strong in practice. Liu et al.17 proposed modelling the gene set effects using kernel machines, which allow joint flexible non-linear effects of genes within a pathway on a phenotype. Specifically, we replace the generalised linear model (1) by the generalised non-linear model
| (3) |
where h(·) is a linear or non-linear smooth function and its functional form can be estimated from the data. When h(·) is linear in x’s, (3) reduces to the generalised linear model (1).
Liu et al. proposed to estimate h(·) using kernel machines, which can proceed by estimating h(·) using (2) assuming the r are random effects with mean 0 and covariance τ2K, where K is a kernel matrix whose (i, i′)-th element can be viewed as a measure of similarity of the gene profile of the i-th subject and that of the i′-th subject. If h(·) is a linear function in x’s, then the (i, i′)-th element of K is ( ) and K reduces to XX′ if c = 0. If h(·) is a qudratic function of x’s including two-way interactions of the x’s, the (i, i′)-th component of K is . If h(·) is a smooth function of x’s expanded by radius basis, the (i, i′)-th element of K is exp{(xi − xi′)T (xi − xi′)/c}. A variance component test for H0: τ2 = 0 under (3) corresponds to a global test for no gene set effect by allowing for non-linear effects when an appropriate kernel function is assumed.
3.5 Principal components analysis
Another set of approaches that do not rely on individual gene analyses are based on the principle of dimension reduction. Although gene sets already contain far fewer genes than the total number on a chip, the dimensionality of the gene set still often exceeds the sample size. By sufficiently reducing the dimentionality of the data, standard univariate or multivariate methods can be applied. The most commonly used means of dimension reduction is principal components analysis (PCA).
PCA seeks to identify the b directions of greatest variability in the data and then project the data onto the space spanned by these directions. Mathematically, these directions are given by the eigenvectors of the sample covariance matrix (Σ̂) corresponding to the b largest eigenvalues of Σ̂. Suppose that the expression value for each gene has been centered by their respective sample means, then letting V = [v1, v2, … vp], E = diag(e1, e2, …, ep), and e1 ≥ e2 ≥ ··· ≥ ep, where vj is the eigenvector corresponding to the j-th largest eigenvalue, ej, V and E can be found by the singular value decomposition of X:
Projecting the data into a smaller subspace reduces the dimensions of the data while keeping the most information since the directions of greatest variability are retained.
To our knowledge, the first method of this type to apply these ideas to detection of differentially expressed gene sets is the method proposed by Tomfohr et al.,15 which we will refer to as PCAT. The idea is to reduce the gene set to its first principal component, so that we have a single ‘supergene’ or ‘metagene’. The supergene’s expression value for the i-th subject is the first component score:
Since Xnew is now one dimensional, we can then use a standard two-sample univariate test, e.g. t-test or Wilcoxon test, to evaluate the significance of the supergene. If the supergene is found to be differentially expressed, then the entire gene set is considered to be differentially expressed.
Often times, the first principal component may be insufficient for summarising a gene set’s activity or it may capture variability not associated with differences resulting from clinical outcomes. For instance, it has been demonstrated in the genome wide association testing literature that the first principal component identifies variability resulting from differences in ancestry among subjects.31 Thus, a natural extension of PCAT is to consider additional higher order components and reduce the gene set to the first b principal components. This approach was first published by Kong et al.32 and we refer to it as PCAK. Instead of a single supergene, b supergenes summarise the gene set. Choices for b are briefly discussed below, but b is necessarily less than the number of positive eigenvalues, d = rank(Σ̂). If V(b) = [v1, v2, …, vb] then the new component scores, expression values for the super genes, for the i-th subject are:
Since X is now an n × b matrix, one can use Hotelling’s T2 test to evaluate significance. For completeness, the Hotelling’s T2 statistic is found by:
where nj is the number of subjects with clinical outcome j, is the vector of mean expression values for the supergenes among subjects with clinical outcome j, and Σ̂p = ((n1 − 1)Σ̂1 + (n2 − 1)Σ̂2)/(n − 2) is the pooled covariance matrix (Σ̂j is the covariance matrix of the supergenes among subjects with outcome j). To generate a p-value, one can either permute the class labels and generate a permutation distribution for T2, or alternatively, under the commonly used assumption of normality, it is known that .
A fundamental issue always present when using principal components is the choice of b, the number of components to use. Kong et al. simply use a hard threshold on the eigenvalues but admit that this may not be optimal. This problem has been studied in various applied settings by many authors.33–37 Suggested rules for choosing b are:
First component only: b = 1 as in Tomfohr et al.’s method.
-
Proportion of variability explained: The proportion of variability explained by the first q principal components is given by: . Typical cutoffs range between 70% and 90%. The number of components to that explain 70% of the total variability is found by: b = argminq{rq > 0.70}.
Zhu’s Method: A commonly used method of estimating the number of components is to generate a Scree plot (a barplot of the eigenvalues) and then look for an ‘elbow’ or ‘big gap’ in the graph. An elbow between the q-th and (q + 1)-th eigenvalue suggests that there is a rapid decrease in the relative importance of the components. In the past, this method tended to be subjective and not practical in many situations because it was not automated, but Zhu and Ghodsi propose a simple algorithm for identifying elbows.
Suppose we want to see if there is a gap between the q-th and (q + 1)-th eigenvalues. Let
= {e1, e2, …, eq} and
= {eq+1, eq+2, …, ed}. If there truly is a gap at the q-th position
and
can be considered as samples from two different distributions, f (e; θ1) and f (e; θ2), respectively. If we assume the two samples are independent, then the log-likelihood of our data is given by: For convenience, we use the normal density for f and we can obtain a profile log-likelihood by plugging in: θ̂1 = [ē1, s2] and θ̂2 = [ē2, s2], where , and with and equaling the variances of
and
, respectively. b is then set to the value of q that maximises the profile likelihood.Despite the naive, but convenient, assumptions of normality and independence, empirical results suggest that the overall algorithm is still effective.
Guttman–Kaiser’s average eigenvalue rule: All eigenvalues greater in magnitude than the average of the eigenvalues are retained. The method was initially designed for PCA based on the correlation matrix. If all of the genes were independent, then the principal components would be identical to the original data and have unit variance. Thus, any eigenvalue less than 1 in magnitude carries less information than one of the original variables and is not worth keeping. Noting that 1 is the mean of the eigenvalues from the correlation matrix, we instead compare the eigenvalues from the covariance matrix to the mean.
Jolliffe’s modified average eigenvalue rule: All eigenvalues greater in magnitude than 0.7 times the average of the eigenvalues are retained. The constant 0.7 was chosen based on simulation.
-
Bartlett’s test: This method sequentially tests for equality of eigenvalues starting from d down to 1. If the last d − q eigenvalues are equal, then they contain equally little information and should be discarded. To determine b, test whether the last d − q values are all equal against the alternative that there are at least two that are different. If we reject the null, then b is set to d − q + 1, otherwise we increase q by 1 and test again.
To actually test for the equality of the last d − q eigenvalues, we use the statistic:
which approximately follows a distribution with ν = 0.5(d − q + 2)(d − q − 1). improves the approximation.
The authors do not agree as to which rules are optimal. This may depend on the individual setting and correlation structure of the gene set.
If a pathway is determined to be significantly differentially expressed by PCAT, genes driving the difference are identified as the genes with the greatest loadings. Tomfohr et al. refer to these as ‘activity levels’. If PCAK is used, one can find activity levels by identifying which supergenes are most differentially expressed and examining the loadings for generating those supergenes. Multiple differentially expressed supergenes may suggest differing mechanisms for differential expression.
4 Statistical issues
Numerous methods other than those described exist. In the next section we will compare the described methods on a real data set, but here we first introduce two statistical considerations that have recently been identified. This work is largely the result of Goeman and Buhlmann.38
4.1 The null hypothesis
Each of the methods seeks to test for differential expression of the gene set between experimental conditions. However, as Goeman and Buhlmann38 and Tian et al.26 point out, there are two ways of actually formulating the null hypothesis:
Formulation 1: : The genes in the gene set S are at most as differentially expressed as the genes not in S.
Formulation 2: : The genes in the gene set S are not differentially expressed.
Methods that use formulation 1 are the most prevalent as they include all of the currently used over-representation analysis methods and the original GSEA method. The methods described testing formulation 2 are the global test, PCA, and some later variants of GSEA. While both null’s seem similar, they are actually quite different. Goeman and Buhlmann call formulation 1 a competitive null hypothesis while formulation 2 is a self-contained null hypothesis. Fundamentally, a competitive null pits the genes in S against all other genes in the experiment, while a self-contained null looks only at the genes in the gene set and ignores all of the other genes on the microarray. Both Tian et al. and Goeman and Buhlmann favour self-contained tests.
Criticism of tests using revolve primarily around issues of power. Generally, a self-contained test will tend to have more power than a competitive test, as tends to imply . Under the competitive setup, a gene set’s significance is penalised in experiments where many genes are differentially expressed as the standard for significance has been raised. Allison et al.39 and subsequently Goeman and Buhlmann describe the competitive framework as a ‘zero-sum game’. Asside from the power considerations, this creates negative correlation between p-values and is problematic as the standard false discovery rate corrections may not be valid under negative correlation. As self-contained nulls completely ignore other gene sets, this is not an issue.
Goeman and Buhlmann also note that self-contained tests are direct generalisations of single gene tests. This is a nice property as testing a gene set with a single gene is equivalent to testing the gene individually. This property does not hold for competitive tests. A related result is that a self-contained test can directly test the null that there are no differentially expressed genes on the chip, potentially serving as a quality check. A competitive null cannot treat the entire data set as a gene set as there would be no complement to compare with.
The main criticism of self contained tests is that they may be overly powerful in settings where many genes appear to be highly differentially expressed. In particular, large gene sets may contain a few genes that appear to be differentially expressed leading to a significant p-value for that gene set despite it’s biological irrelevance. In such cases, Goeman and Buhlmann suggest using self-contained tests as an initial screening and then following up with a competitive test in the second stage.
4.2 The sampling unit
Goeman and Buhlmann identify another important statistical issue that raises the most fundamental question: what is the sampling unit? Classical tests are based on experiments that have a population of subjects, and then subject sampling is performed: we consider our data a sample of subjects drawn from the population. Each subject has the same set of fixed attributes (in this case genes). In contrast, the over-representation methods described above perform gene sampling. The tests used still assume the samples are drawn from the population, but in this case the population considered is the set of genes. This reverses the classical setup: we now consider our data as a sample of genes coming from a fixed set of subjects. In the first set up, our sample size is the number of subjects (arrays) while in the latter it is actually the number of genes. GSEA, GSA, global test and PCA all sample subjects.
This has a dramatic impact on the interpretation of results obtained. Specifically, a significant p-value gives confidence that if we were to draw a new sample, a gene set would again be differentially expressed, i.e. results generalise to the population from which we sampled. Under the subject sampling scheme, this means that we are confident that the association between genes and the experimental conditions will be found for a new group of samples. In contrast, under the gene sampling scheme, a significant p-value gives confidence that for a new set of genes from the same subjects, a similar association between being in the gene set and being called ‘differentially expressed’ will be found.
The gene sampling scheme is usually not the preferred set up. Generally, experiments are performed with the intention of finding results that generalise beyond the sample of subjects. Indeed, a new experiment would likely take a new sample of subjects rather than a new sample of genes. Also, tests for both sampling schemes assume that sampling units are independent and identically distributed. Assuming genes are independent is extremely unrealistic: the entire purpose of using prior biological knowledge is to exploit the information that genes work together.
5 Comparison on type II diabetes data
This data set consists of gene expression profiles of muscle tissue from 17 subjects with type II diabetes and 17 subjects with normal glucose tolerance (a third group with impaired glucose tolerance was omitted from our analysis). It was first analyzed in Mootha et al.22 using GSEA and is available from the Broad Institute website (http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi). Details on low-level processing are available in the original manuscript. After removing all genes with no single measure greater than 100 (genes not expressed in the data), 10 983 genes remained. To compare the performance of the described methods, to each of 133 pathways (the original 149 considered by Mootha et al. less the pathways containing 0 or 1 gene only) we applied:
ORA considering the top 100 genes as ‘differentially expressed’ and testing for association using Fisher’s exact test.
ORA considering the top 100 genes as ‘differentially expressed’ and testing for association using the χ2-test.
ORA considering the top 5% most differentially expressed genes (as determined by two-sample t-test) as ‘differentially expressed’ and testing for association using Fisher’s exact test.
ORA considering the top 5% most differentially expressed genes as ‘differentially expressed’ and testing for association using the χ2-test.
The original GSEA method, with genes ranked by two-sample t-statistics.
GSA, with genes ranked by two-sample t-statistics (software available from the authors’ website: http://www-stat.stanford.edu/tibs/GSA/index.html).
Global test (software available from Bioconductor: http://www.bioconductor.org/).
PCAT
PCAK with the number of components to use determined by taking the maximum of 3 or the number of components necessary to account for 70% of the variability. For gene sets with fewer than three genes, the number of components used was equal to the number of genes in the gene set.
Results comparing the number of gene sets identified as differentially expressed, at the nominal α = 0.05 level, by each method are given in the diagonal of Table 2. Non-diagonal entries contain the number of gene sets simultaneously identified by the two methods. As an example, 18 pathways were identified as differentially expressed by GSEA and 9 pathways were indentified as differentially expressed by both GSEA and GSA.
Table 2.
Diabetes dataset results
| Top 100 genes |
Top 5% |
GSEA | GSA | Global | PCAT | PCAK | |||
|---|---|---|---|---|---|---|---|---|---|
| Fisher | χ2-test | Fisher | χ2-test | ||||||
| 100 (Fisher) | 7 | 7 | 7 | 7 | 2 | 5 | 2 | 2 | 2 |
| 100 (χ2-test) | 32 | 10 | 32 | 3 | 5 | 3 | 3 | 2 | |
| 5% (Fisher) | 60 | 60 | 8 | 17 | 4 | 3 | 14 | ||
| 5% (χ2-test) | 82 | 9 | 17 | 4 | 4 | 14 | |||
| GSEA | 18 | 9 | 1 | 0 | 6 | ||||
| GSA | 20 | 2 | 2 | 10 | |||||
| Global | 4 | 1 | 3 | ||||||
| PCAT | 5 | 1 | |||||||
| PCAK | 16 | ||||||||
Over-representation analysis is clearly very sensitive to the cut-off used. If the top 100 genes, as ranked by t-statistic, were called ‘differentially expressed’, only seven pathways were identified by Fisher’s exact test. In contrast, if the cutoff is lowered so that the top 5% genes are called ‘differentially expressed’, then 60 pathways are identified. When comparing the use of Fisher’s exact test to use of the χ2-test, it initially appears that the χ2-test is able to identify many more pathways at the same cutoff. However, the pathways identified by using the χ2-test and not identified by the fisher’s exact test all contain very few genes (less than five genes). In such situations, the χ2-test may not be appropriate. For pathways comprised of more genes, results were essentially the same.
GSA identifies only two more pathways as differentially expressed than GSEA, but the pathways identified by each overlap by only nine pathways despite GSA being developed based on GSEA.
Though theoretical justifications suggest self-contained tests are more powerful, yet the global test and PCAT identified only four and five pathways, respectively. The global test may not perform optimally in identifying pathways consisting of many genes with no predictive ability and only a few with predictive ability. Under the assumptions of the test, the global test should identify such pathways, but inclusion of many other genes may dilute the signal and introduce extraneous noise. A method employing variable selection may perform better.
Using PCAT assumes that the first principal component sufficiently summarises the entire pathway’s activity. Given that signal is low and noise often high in expression profiling experiments and that we have not accounted for other baseline effects such as ancestry, it is not surprising that the first principal component does not provide good separation of diabetics from normal patients. Indeed, Bair et al.40 suggest some filtering of genes is necessary in order for the first component to capture the difference of interest.
PCAK identifies more (16) pathways than PCAT, suggesting that use of more principal components provides a better summary of a gene set’s expression. On the other hand, of the five pathways identified as being differentially expressed by PCAT, only one was identified by PCAK. Truth of the PCAK null implies truth of the PCAT null, so rejection of the PCAT null should imply rejection of the PCAK null. However, in practice though the first component is different, this difference is diluted by the additional components used, suggesting that the number of components used is not optimal. A better method for determining the number of components to test needs to be developed. Of the 16 pathways identified by PCAK, 6 were also identified by the original GSEA programme. These six were also identified as being significantly differentially expressed by GSA. It is important to note, that GSEA and GSA assume that all of the genes in a significantly differentially expressed pathway will be differentially expressed in the same direction, i.e. the genes will tend to all be closer to the top of the ranked gene list or all be closer to the bottom of the ranked gene list. In contrast, PCAK (as well as PCAT and the global test) does not consider the directionality. This better matches biological assumptions: in a given pathway, certain genes will be turned on and others turned off in response to stimuli. Pathways identified by PCAK but not by GSEA or GSA may be such pathways in which direction of differential expression is different.
6 Discussion
When applied to the type II diabetes data, we see that using prior biological knowledge can potentially identify pathways of interest. If the traditional approach had been taken, no conclusions could have been drawn as no genes met the criteria for differential expression after controlling for the false discovery rate. The number of pathways identified by ORA tends to be somewhat unstable depending on the number of genes called ‘differentially expressed’, though if this method is used, the specific test used does not appear to make a difference as long as the sizes of the gene sets are not small. The interpretation of the ORA results is difficult, however, as they treat all genes as the sampling unit. These methods should be used very cautiously, if at all. Global test and PCAT may not perform well for gene sets that include many irrelevant genes. The enrichment scoring methods appear to function well as does PCAK. All three methods produce results that are biologically reasonable, but it is not clear which method is preferable in practice, despite the competitive nature of the enrichment scoring methods.
A major weakness that all prior biological knowledge based described suffer is the quality of the prior biological knowledge incorporated. The methods we described here all deal with analysing gene sets which are grouped based on some biological principle and the assumption is that all of the genes in the biological grouping are associated with each other in a biologically meaningful fashion. However, this assumption is not always true: the quality of the groupings is not always guaranteed. Databases such as KEGG are of good quality, as are other databases curated by humans. Databases curated by algorithms tend to contain the most inaccuracies and errors. For instance, data from the GO consortium is the most common source of gene set groupings, but the data also tend to include information from weaker sources. In particular, annotations based on IAE (Inferred from Electronic Annotation) are viewed quite skeptically, but according to the GO annotation website (http://www.geneontology.org) as of May, 2009, only 64 568 of 160 498 human GO annotations are from non-IAE sources. Mistakes also tend to arrise from inconsistencies and abmbiguity in gene names/symbols. Use of high quality groupings is absolutely essential.
Use of prior biological knowledge can alleviate some of the problems associated with analysis of gene expression profiles. Use of these methods has led to a better understanding of the biological mechanisms underlying phenotypic responses. These methods do, however, have problems of their own: in addition to relying heavily on the quality of the information used, we have seen that methods seeking to accomplish the same task provide differing results, all of which may be reasonable. Issues regarding multiple comparisons are also of concern since gene sets may be very highly correlated, differing by only a few genes. Clearly, more research is necessary to deal with the unresolved statistical issues and the problem of inconsistent results.
Acknowledgments
This research is supported by a grant from the National Cancer Institute (CA–76404).
References
- 1.Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270(5235):467. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]
- 2.Li C, Wong WH. Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proceedings of the National Academy of Sciences. 2001;98(1):31–36. doi: 10.1073/pnas.011404098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. Nucleic Acids Research. 2001;29(12):2549. doi: 10.1093/nar/29.12.2549. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Bolstad B, Collin F, Simpson K, Irizarry R, Speed T. Experimental design and low-level analysis of microarray data. International Review of Neurobiology. 2004;60:25. doi: 10.1016/S0074-7742(04)60002-X. [DOI] [PubMed] [Google Scholar]
- 5.Speed T. Statistical analysis of gene expression microarray data. CRC Press; Boca Raton, FL: 2003. [Google Scholar]
- 6.Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association. 2004;99 (468):909–17. [Google Scholar]
- 7.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B (Methodological) 1995;57:289–300. [Google Scholar]
- 8.Storey J. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2002;64(3):479–98. [Google Scholar]
- 9.Nacu S, Critchley-Thorne R, Lee P, Holmes S. Gene expression network analysis and applications to immunology. Bioinformatics. 2007;23(7):850–58. doi: 10.1093/bioinformatics/btm019. [DOI] [PubMed] [Google Scholar]
- 10.Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research. 2000;28(1):27. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ashburner M, Ball CA, Blake JA, et al. Gene ontology: tool for the unification of biology. Nature Genetics. 2000;25(1):25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gentleman R. Using GO for statistical analyses. In: Antoch J, editor. Proceedings in Computational Statistics COMPSTAT 2004. Physica Verlag; Heidelberg: 2004. pp. 171–180. [Google Scholar]
- 13.Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102(43):15545–50. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Goeman JJ, Van De Geer SA, De Kort F, Van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004;20(1):93–99. doi: 10.1093/bioinformatics/btg382. [DOI] [PubMed] [Google Scholar]
- 15.Tomfohr J, Lu J, Kepler TB. Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005;6(1):225. doi: 10.1186/1471-2105-6-225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Manoli T, Gretz N, Groene HJ, Marc K, Eils R, Brors B. Group testing for pathway analysis improves comparability of different microarray datasets. Bioinformatics. 2006;22(20):2500–06. doi: 10.1093/bioinformatics/btl424. [DOI] [PubMed] [Google Scholar]
- 17.Liu D, Lin X, Ghosh D. Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics. 2007;63(4):1079–88. doi: 10.1111/j.1541-0420.2007.00799.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Efron B. Correlation and large-scale simultaneous significance testing. Journal of the American Statistical Association. 2007;102(477):93–103. [Google Scholar]
- 19.Fortunel N, Otu H, Ng H, et al. Comment on “‘Stemness’: transcriptional profiling of embryonic and adult stem cells” and “a stem cell molecular signature”. Science. 2003;302(5644):393. doi: 10.1126/science.1086384. [DOI] [PubMed] [Google Scholar]
- 20.Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA. Global functional profiling of gene expression. Genomics. 2003;81(2):98–104. doi: 10.1016/s0888-7543(02)00021-6. [DOI] [PubMed] [Google Scholar]
- 21.Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005;21(18):3587–95. doi: 10.1093/bioinformatics/bti565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003;34(3):267–73. doi: 10.1038/ng1180. [DOI] [PubMed] [Google Scholar]
- 23.Sweet-Cordero A, Mukherjee S, Subramanian A, et al. An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis. Nature Genetics. 2004;37:48–55. doi: 10.1038/ng1490. [DOI] [PubMed] [Google Scholar]
- 24.Efron B, Tibshirani R. On testing the significance of sets of genes. Annals of Applied Statistics. 2007;1(1):107–29. [Google Scholar]
- 25.Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004;3(1):3. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
- 26.Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. Discovering statistically significant pathways in expression profiling studies. Proceedings of the National Academy of Sciences. 2005;102(38):13544–49. doi: 10.1073/pnas.0506577102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Pavlidis P, Lewis DP, Noble WS. Exploring gene expression data with class scores. Pacific Symposium on Biocomputing; New Jersey. 2002. pp. 474–85. [PubMed] [Google Scholar]
- 28.Rahnenfuhrer J, Domingues FS, Maydt J, Lengauer T. Calculating the statistical significance of changes in pathway activity from gene expression data. Statistical Applications in Genetics and Molecular Biology. 2004;3(1):16. doi: 10.2202/1544-6115.1055. [DOI] [PubMed] [Google Scholar]
- 29.Mccullagh P, Nelder JA. Generalized linear models monographs on statistics and applied probability. Chapman and Hall; London: 1989. [Google Scholar]
- 30.Lin X. Variance component testing in generalised linear models with random effects. Biometrika. 1997;84(2):309–26. [Google Scholar]
- 31.Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38(8):904–09. doi: 10.1038/ng1847. [DOI] [PubMed] [Google Scholar]
- 32.Kong SW, Pu WT, Park PJ. A multivariate approach for integrating genome-wide expression data and biological knowledge. Bioinformatics. 2006;22(19):2373. doi: 10.1093/bioinformatics/btl401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Cangelosi R, Goriely A. Component retention in principal component analysis with application to cDNA microarray data. Biology Direct. 2007;2(1):2. doi: 10.1186/1745-6150-2-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Peres-Neto PR, Jackson DA, Somers KM. How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Computational Statistics and Data Analysis. 2005;49(4):974–97. [Google Scholar]
- 35.Valle S, Li W, Qin SJ. Selection of the number of principal components: the variance of the reconstruction error criterion with a comparison to other methods. Industrial & Engineering Chemistry Research. 1999;38(11):4389–4401. [Google Scholar]
- 36.Jolliffe I. Principal component analysis. Springer; New York: 2002. [Google Scholar]
- 37.Zhu M, Ghodsi A. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Computational Statistics and Data Analysis. 2006;51(2):918–30. [Google Scholar]
- 38.Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23(8):980. doi: 10.1093/bioinformatics/btm051. [DOI] [PubMed] [Google Scholar]
- 39.Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nature Reviews Genetics. 2006;7(1):55–65. doi: 10.1038/nrg1749. [DOI] [PubMed] [Google Scholar]
- 40.Bair E, Hastie T, Paul D, Tibshirani R. Prediction by supervised principal components. Journal of the American Statistical Association. 2006;101(473):119–37. [Google Scholar]
