Skip to main content
Bioinformation logoLink to Bioinformation
. 2007 Apr 10;1(10):436–446. doi: 10.6026/97320630001436

False discovery rate paradigms for statistical analyses of microarray gene expression data

Cheng Cheng 1,*, Stan Pounds 1
PMCID: PMC1896060  PMID: 17597936

Abstract

The microarray gene expression applications have greatly stimulated the statistical research on the massive multiple hypothesis tests problem. There is now a large body of literature in this area and basically five paradigms of massive multiple tests: control of the false discovery rate (FDR), estimation of FDR, significance threshold criteria, control of family-wise error rate (FWER) or generalized FWER (gFWER), and empirical Bayes approaches. This paper contains a technical survey of the developments of the FDR-related paradigms, emphasizing precise formulation of the problem, concepts of error measurements, and considerations in applications. The goal is not to do an exhaustive literature survey, but rather to review the current state of the field.

Keywords: multiple tests, false discovery rate, q-value, significance threshold selection, profile information criterion, microarray, gene expression

1 Background

An important component in the analysis of a microarray gene expression experiment is to identify a list of genes that are differentially expressed under a few different biological conditions (including time course) or across several cell types (normal vs. cancer, different subtypes of a cancer, etc.), or are associated with one or more particular phenotypes of interest. This is often referred as gene expression profiling. In many studies the goal consists of identification and validation to some extent of the gene expression profile to elucidate the biological process [13]; in others the genes in the expression profile are used as biomarkers to build classifiers for a phenotype (e.g., treatment outcome). [4] This paper focuses on a particular statistical aspect in identifying a microarray gene expression profile – the massive multiple tests issue. Henceforth the terms “probe”, “probeset” and “gene” will be used interchangeably.

Gene expression profiling usually consists of four major steps: (1) generate and normalize expression signals; (2) test each probe for its differential expression or association with the phenotype; (3) apply proper statistical significance criteria to identify the gene expression profile, that is, a specific list of genes differentially expressed or associated with the phenotype; (4) investigate functions and pathways of the genes in the expression profile, and perform some sort of validation with wet-lab experiments, external data sets, permutation test, or cross validation. Although there are a number of statistical issues in each step, those in steps (2) and (3) are the topic of this paper.

The test of a probe for differential expression or association in step (2) is carried out by testing a statistical hypothesis properly formulated for the study. For example, gene expression profiling for comparison of normal versus a type of cancer cells would test if the mean or median expression level of each probe is the same in the two cell types (the null hypothesis) vs. the opposite (the alternative hypothesis). Gene expression profiling for association with a quantitative trait would use regression modeling appropriate for the phenotype. Because a statistical hypothesis is tested for each probe, and there are typically tens of thousands of probes, such analysis creates a problem of massive multiple hypothesis tests. It is then imperative to either control or effectively assess the levels of false positive (type-I) and false negative (type-II) errors in step (3) when statistical significance criteria are considered. The microarray gene expression applications have greatly stimulated the statistical research on the massive multiple hypothesis tests problem. There is now a giant body of literature in this area and basically five paradigms of massive multiple tests: control of the false discovery rate (FDR), estimation of FDR, significance threshold criteria, control of family-wise error rate (FWER) or generalized FWER (gFWER), and empirical Bayes approaches.

The traditional approaches to controlling the FWER have proven to be too conservative in applications of microarray data analyses. Recent attention has been focused on the control of FDR. A recent non-technical review of FDR methods is described elsewhere. [5] The goal of this paper is to provide an overview of a few advancements of the FDR-based inference and related methodology under a unified set of notation and assumptions pertinent to microarray gene expression applications, so as to reflect the essence of the current state of the field. With a representation of multiple hypotheses tests as an estimation problem, this paper provides a technical survey of the FDR paradigms commonly used in microarray gene expression data analyses. Section 2 contains a brief review of FDR and related error measurements for massive multiple tests; Section 3 contains a review of FDR control procedures; Section 4 contains a review of the estimation of the proportion of true null hypotheses; Section 5 contains a review of FDR estimation methods; Section 6 contains a brief review of data-driven significance threshold criteria; Sections 7 contains a brief review of sample size determination for FDR control; and some concluding remarks are made in section 8. More application-oriented readers can read Section 8 first to get a non-technical summary of the issues and the state of the field, and then read Sections 2 through 7 to obtain more technical details.

The PDF file is linked below

2 False discovery rate and related error measurements

The PDF file is linked below

3 FDR control

The PDF file is linked below

4 Estimation of the null proportion

The PDF file is linked below

5 FDR estimation

The PDF file is linked below

6 Data driven significance criteria

The PDF file is linked below

7 Sample size determination for FDR control

The PDF file is linked below

8 Conclusion

We have reviewed a few massive multiple hypothesis test paradigms related to FDR. This is by no means an exhaustive survey; other variations on the theme can be easily found in the literature, but the essence of the current state of the field has been well reflected.

For applications, it is not advisable to totally ignore the mathematical development of a concept. For example, an empirical q-value is often misinterpreted as an (frequentist) estimate of the FDR at the corresponding P value; whereas in fact it is not. A q-value is meaningful only under a specific Bayesian framework regarding the hypotheses random, then it is the probability of the corresponding null hypothesis given the data (observed P values); and the empirical q-value is an estimate of this probability. In their theoretical development, Storey (2002) [7] and Storey et al. (2003) [8] did not demonstrate that the empirical q-values can be used as (frequentist) FDR control quantities, but did demonstrate that the empirical q-values are conservative estimators of the population q-values (cf. Sections 2.2 and 5.1). Additionally, there has been empirical evidence that regarding the empirical q-values as FDR estimates gives downward-biased estimators; see the simulation study in Pounds and Cheng (2004). [30]

There are numerous methods for FDR control and FDR estimation. Thus, selecting a reasonable procedure for a specific application can be challenging. Pounds (2006) [5] notes that the choice can be simplified by a few basic application-specific considerations: whether FDR estimation or control is preferred, whether the p-values are onesided or discrete, and the correlation among p-values. In studies designed with adequate power for a pre-specified FDR coltrol level (Section 7), FDR-control procedures (Section 3) should be used because in these settings they typically offer greater power than do FDR-estimation methods. For undesigned (retrospective or observational) studies however, it is not always clear what an appropriate FDR control level should be. FDR-estimation methods (Section 5 and 4) are preferred because interpreting the results of FDR-control procedures as FDR estimates can under-represent the actual prevalence of false positives. The data-driven significance threshold criteria (Section 6) can provide a rough guideline for the P value cut-off or FDR level to consider, and obtaining estimates of the FDR and pFDR at the data-driven significance threshold should be a part of the analysis.. The sidedness, discreteness, and correlation of P values are important considerations to guide the selection of a method. Several methods have been shown to maintain their desirable statistical properties under mild or limited dependence among tests. Other methods have been developed to address strong or extensive dependence between the tests. Some methods implicitly assume that P values are continuously distributed in estimating π0; these methods perform very poorly when applied to discrete P values. [29] Additionally, Pounds and Cheng (2006) observed that one-sided P values may be stochastically greater than uniform under the null hypothesis, thus violating the assumption that P values are uniformly distributed under the null. [29] This violation can cause such methods to perform in unpredictable and undesirable ways. Thus, Pounds and Cheng (2006) [29] developed an FDR-estimation method specifically for applications involving P values that are discrete or onesided.

Certainly, there are still a number of open questions in the field. An important question is whether the correlation structure of microarray data satisfies the conditions required for the procedures to maintain their stated statistical properties. In our view, the answer to this question is likely to be specific to the particular application and methods under consideration. Yakovlev and colleagues [20,40] have used resampling and permutation techniques to study the performance of several FDR procedures for the analysis of a data set of gene expression in pediatric acute lymphoblastic leukemia. (Yeoh et al. 2002) [41] They found that gene-gene correlations may induce a high degree of variability in the number of rejections of many FDR procedures. However, by applying similar techniques to a data set of expression pediatric acute myeloid leukemia (Ross et al. 2004) [42], we observed that our robust FDR estimation method performs quite well (Pounds and Cheng 2006). [29] Subsequently, we believe it would be useful to develop tests that determine whether a data set provides significant evidence of departure from the assumptions of specific methods. Such tests could be helpful for determining when computationally-intensive resampling methods ([19,20]) are required.

Most research has focused on controlling or estimating the expected value of the ratio V/R or similar quantities. Future work should also attempt to estimate or control the variance of V/R; Owen (2005) [43] has done some initial work on this topic. As previously mentioned, some empirical studies of real microarray data sets have found that the variance in the number of rejections determined by multiple-testing procedures can be quite large. [20] This observation indicates that the interpretation of analysis results should be tempered by consideration of the variability of the FDR estimation or control procedures. Thus, it would be useful to develop procedures that also consider the variance of V/R. Additionally, incorporating variance considerations into the procedures may lead to interval estimates for the FDR. Storey (2002) [7] has mentioned that bootstrapping the P values is potentially one way to construct such an interval estimate.

It is also important to compare procedures performances against one another. So far, little effort has been invested in learning which methods are best suited for settings. Considering the biological and technical complexity of microarray data, it is unlikely that the assumptions of any method will strictly hold for any application. Certainly, no method will be superior across all applications. Thus, it is important to identify which procedures are best suited for use in certain sets of conditions. This research would likely involve a lengthy series of traditional simulation studies and simulation-like studies performed by resampling, perturbing, or permuting numerous real data sets.

Supplementary Material

Data 1
97320630001436S1.pdf (20.5KB, pdf)
Data 2
97320630001436S2.pdf (147.3KB, pdf)
Data 3
97320630001436S3.pdf (123.1KB, pdf)
Data 4
97320630001436S4.pdf (139.9KB, pdf)
Data 5
97320630001436S5.pdf (132KB, pdf)
Data 6
97320630001436S6.pdf (136.2KB, pdf)
Data 7
97320630001436S7.pdf (109.4KB, pdf)

Table 1. PKS and PW matrices of Murine B-cell data.

True Hypotheses Rejected Not Rejected Total
H0 V m0-V m0
HA S m1-S m1

Total R m-R m

Acknowledgments

This research is supported in part by the NIH/NIGMS Pharmacogenetics Research Network and Database (U01 GM61393, U01GM61374, http://pharmgkb.org/) from the National Institutes of Health (CC), Cancer Center Support Grant P30 CA-21765 (CC, SP), and the American Lebanese and Syrian Associated Charities (ALSAC).

Footnotes

Citation:Cheng & Pounds, Bioinformation 1(10): 436-446 (2007)

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data 1
97320630001436S1.pdf (20.5KB, pdf)
Data 2
97320630001436S2.pdf (147.3KB, pdf)
Data 3
97320630001436S3.pdf (123.1KB, pdf)
Data 4
97320630001436S4.pdf (139.9KB, pdf)
Data 5
97320630001436S5.pdf (132KB, pdf)
Data 6
97320630001436S6.pdf (136.2KB, pdf)
Data 7
97320630001436S7.pdf (109.4KB, pdf)

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group

RESOURCES