Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2012 Nov 18.
Published in final edited form as: Bioinformatics. 2010 Feb 25;26(8):1050–1056. doi: 10.1093/bioinformatics/btq085

Post-hoc power estimation in large-scale multiple testing problems

Sonja Zehetmayer 1, Martin Posch 1,*
PMCID: PMC3500624  EMSID: EMS33364  PMID: 20189938

Abstract

Background

The statistical power or multiple Type II error rate in large scale multiple testing problems as, for example, in gene expression microarray experiments, depends on typically unknown parameters and is therefore difficult to assess a priori. However, it has been suggested to estimate the multiple Type II error rate post-hoc, based on the observed data.

Methods

We consider a class of post-hoc estimators that are functions of the estimated proportion of true null hypotheses among all hypotheses. Numerous estimators for this proportion have been proposed and we investigate the statistical properties of the derived multiple Type II error rate estimators in an extensive simulation study.

Results

The performance of the estimators in terms of the mean squared error depends sensitively on the distributional scenario. Estimators based on empirical distributions of the null hypotheses are superior in the presence of strongly correlated test statistics.

1 INTRODUCTION

In genomic and proteomic research thousands of hypotheses are tested simultaneously. Numerous procedures have been proposed to control multiple Type I error rates, e.g. the False Discovery Rate (FDR) or the Family-wise Error Rate. The multiple Type II error rate or the statistical power, in contrast, have received less attention so far. The power of a multiple testing procedure can be defined in several ways (Senn and Bretz, 2007). For example, one can consider the probability to reject at least one alternative hypothesis or the probability to reject all alternative hypotheses. We consider an intermediate approach, the average power, defined as the arithmetic mean of the elementary power values of all alternative hypotheses. The average power can be interpreted as expectation of the proportion of rejected alternative hypotheses among all alternative hypotheses and corresponds to the so called False Negative Rate (FNR) (e.g., Delongchamp et al., 2004; Pawitan et al., 2005). The FNR is defined as the expectation of the False Negative Proportion (FNP), the proportion of retained true alternative hypotheses among all true alternative hypotheses and is related to the average power by Average Power = 1 – FNR. The FNR depends on a number of parameters as the sample size, the effect sizes, and the proportion of true null hypotheses among all hypotheses. Therefore, the FNR of an experiment is unknown a-priori and can only be guessed based on preliminary assumptions. However, if data have already been observed, the FNR can be estimated based on the observed test statistics. This concept of ’post-hoc power analysis’ has been criticized for the case of a single hypothesis test (e.g., Hoenig and Heisey, 2001) where the post-hoc power is just a 1:1 function of a single p-value. However, in experiments with a large number of hypotheses one can utilize the empirical distribution of the test statistics to estimate the FNR. Under suitable assumptions (e.g., if the elementary test statistics are sufficiently independent), the FNR is asymptotically equivalent to the FNP such that an estimator for the FNR can also be used to estimate the FNP. An essential element in the estimation of the FNR is the estimation of the proportion of true null hypotheses among all null hypotheses, denoted by π0. Numerous estimators for π0 have been proposed in the literature and we assess their performance in the estimation of the FNR.

The FNR is a measure for the fraction of undetected alternative hypotheses and is therefore a crucial parameter to interpret negative findings in experiments where a large number of hypotheses is investigated. In addition, in large-scale multiple testing problems FNR estimates can be used to define stopping rules for sequential testing. For example, one could continue sampling until the estimated FNR falls below a pre-specified threshold. As shown in Posch et al. (2009) under suitable assumptions such sequential testing asymptotically does not inflate the FDR if the sample size is increased for all hypotheses simultaneously and only the test at the final interim analysis determines which hypotheses are rejected.

The nomenclature for the FNR is not consistent in literature: We label the expected proportion of retained true alternatives under all alternatives ’False Negative Rate’ according to Pawitan et al. (2005) and Norris and Kahn (2006). Delongchamp et al. (2004) labels this quantity ’fraction of genes not selected’, Craiu and Sun (2008) Non-Discovery Rate. Additionally, the term FNR has also been used to denote the proportion of false negatives among all retained hypotheses (Sarkar, 2004), a quantity that has also been labeled, e.g., ’False Nondiscovery Rate’ (Genovese and Wasserman, 2002).

In Section 2.1 we introduce a family of estimators of the FNR that depend on the data only through the number of rejected null hypotheses, the critical value applied in the multiple testing procedure and an estimate of the proportion of true null hypotheses π0. In addition, we consider an estimator that is based on estimation of local false discovery rates. In Section 2.2 we review published and implemented estimators for π0. These π0 estimators and the resulting estimators of the FNR are evaluated in a simulation study in Section 3 for a variety of scenarios including different dependency structures of test statistics across hypotheses as independence, weak dependence and equi-correlation. Finally, we give several real data examples in Section 4.

2 METHODS

2.1 Estimating the FNR and FNP

Consider a multiple testing procedure to test m hypotheses of which m0 (m1) are true null (alternative) hypotheses. Then the proportion of true null hypotheses among all hypotheses is π0 = m0/m. Assume that for each of the elementary hypotheses a test with p-value pi, i = 1, … , m, is defined.

Let γ denote the critical value applied to the unadjusted elementary p-values. E.g., for the Bonferroni test γ = α/m while for the Benjamini-Hochberg (BH) test γ = dα/m, where d = argmaxi{p(i)iα/m} and p(1), … , p(m) denote the ordered p-values. We set γ = 0 if p(i) > iα/m for all i = 1, … , m.

Let R = R(γ) = #{piγ} denote the total number of rejected hypotheses and V the number of rejected true null hypotheses. Then the FDR, defined as the expected proportion of erroneous rejections among all rejections, is given by FDR = E(V/ max{R, 1} (Benjamini and Hochberg, 1995). Let S = RV denote the number of rejected true alternative hypotheses. The False Negative Proportion is defined as

FNP=1Sm1 (1)

and the False Negative Rate is given by

FNR=E(FNP)=1E(S)m1. (2)

2.1.1 A class of FNR estimators based on estimates of π0

To estimate the FNR we consider estimators of the quantities E(S) and m1. Let π̂0 denote an estimate of π0 (in Section 2.2 we discuss several such estimators). Let 1 = (1 − π̂0)m and estimate E(S) = E(R) − E(V) in two steps: E(R) is estimated by the observed number of rejections R, and V by mF0(γ)π̂0, where F0 denotes the cumulative distribution function (c.d.f) of the p-values corresponding to the true null hypothesis. Thus, an estimator of the FNR is given by

FNR^=1R(γ)mF0(γ)π^0m(1π^0). (3)

If the null distribution of the p-values is estimated from the data then F0 in (3) is replaced by its estimator F^0. Note that π0 = 1 implies that for all hypotheses the null hypothesis holds. Then m1 = 0 and we define FNR = 0, since in this case no false negative decision can be made. Similarly, for π̂0 = 1 we set FNR^ = 0 .

If we additionally assume that the p-values are uniformly distributed under H0 then the estimator simplifies to

FNR^=1R(γ)mγπ^0m(1π^0). (4)

This estimator has been considered by Posch et al. (2009), who use the Storey (2002) method to estimate π0, and Delongchamp et al. (2004), who apply a method by Hsueh et al. (2003) to estimate π0. If γ is chosen to control the FDR at level α, an alternative estimator is given by

FNR^=1R(γ)(1α)m(1π^0), (5)

which has been applied by Norris and Kahn (2006) and Craiu and Sun (2008) using the Storey estimator or the Smoother estimator (Storey and Tibshirani, 2003) of π0, respectively.

A large variety of estimators for π0 have been proposed, mainly as a tool to estimate the FDR. By (3) each such estimator defines also an estimator for the FNR. Given the underlying π0 estimators are consistent as m → ∞ and the observations are sufficiently independent across hypotheses (and some additional technical assumptions), the FNR estimators defined by (3) are consistent as well; if the π0 estimator is asymptotically biased, the asymptotic bias of the resulting FNR estimator has the opposite sign (see Section 1 in the supplementary material for details). In Section 2.2 we give a review of estimators for π0 and investigate the properties of the resulting FNR estimators for finite m in Section 3.

2.1.2 FNR estimators based on estimates of local false discovery rates

An alternative estimator of the FNR can be defined using Efron’s (2007b) empirical Bayes estimator of the density of the test statistics and estimates of the so called local false discovery rate. Consider the test of two-sided hypotheses Hi : μi = 0 versus Hi: μi ≠ 0, i = 1, … , m. Then z-values zi are calculated by taking the standard normal quantile of the one-sided p-values. According to a Bayesian mixture model it is assumed that the z-statistics belong to one of two classes, either corresponding to the true null hypothesis or the true alternative. The prior probability that a z-statistic corresponds to the true null hypothesis is π0. Denoting the density of the z-statistics under the null (alternative) hypothesis by f0(z) (f1(z)), the mixture density f(z) of the z-statistics is given by

f(z)=π0f0(z)+(1π0)f1(z). (6)

The local false discovery rate for hypothesis i is defined as the posterior probability in the Bayesian mixture model that for hypothesis i the null hypothesis holds. The local false discovery rate is given by fdr(z) = π0f0(z)/f(z), i = 1, … , m, such that f1(z)=(1fdr(z))f(z)(1fdr(z))f(z)dz (Efron, 2007b). Let c1−γ denote the (1 − γ)-quantile of the standard normal distribution. Based on this representation of f1(z) we define an FNR estimator by

FNR=cγ2c1γ2f1(z)dz=cγ2c1γ2(1fdr(z))f(z)dz(1fdr(z))f(z)dz (7)

replacing fdr(z) and f(z) by appropriate estimates fdr^(z) and f^(z). Since fdr(z) = π0f0(z)/f(z), the estimate of fdr depends on estimates of π̂0, f^0, and f^. As in Efron (2007b) we estimate f^(z) by a natural spline and consider two estimators for f0(z) and π0. First, it can be assumed that the null density is a normal distribution with mean δ and variance σ2 and that the z-values corresponding to alternative hypotheses have support outside a known interval [−x0, x0] only. Then maximum likelihood estimates of δ, σ2, and π0 can be derived based on all observations falling into [−x0, x0]. The threshold x0 is chosen based on a heuristic algorithm. For the second option it is assumed that f0(z) follows a standard normal distribution (the ’theoretical null distribution’ which corresponds to the assumption of uniformly distributed p-values in (4)) and π0 can be estimated as above.

For the actual computation we approximate fdr^ and f^ by piecewise constant functions provided by the R-package locfdr such that the integral simplifies to a sum. We consider two FNR estimators: LocThe based on the theoretical null distribution and LocMLE based on the estimated null distribution. Note that this FNR estimator is in [0,1] by construction. If fdr^(z)= 1 for all z, this suggests that the global null hypothesis holds and we set FDR^=0.

Additionally, we consider two FNR estimators based on (7) where fdr(z) and f(z) are replaced by corresponding estimates from the fdrtool package. See the paragraph on the fdrtool package in Section 2.2 for details.

2.1.3 Relationship between the two types of FNR estimators

The FNR estimators based on π0 estimators are closely related to those based on local false discovery rates. Each term in (3) corresponds to a term in (7): R(γ)/m is an estimator of the cumulative distribution function of the (two-sided) p-values at γ given by cγ2c1γ2f(z)dz, the term F0(γ)π̂0 is an estimator for F0(γ)π0=cγ2c1γ2fdr(z)f(z)dz. Finally, π̂0 in the denominator of (3) is an estimator of π0=fdr(z)f(z)dz. Thus, the estimators based on (3) and (7) have essentially the same structure. An advantage of the estimator based on (7) is that this FNR estimator is in [0,1] by construction. Additionally, the estimated fdrs are clipped to the range [0,1] which can reduce the variability of the resulting estimates especially if the actual fdr is close to 1. Finally, while in (3) E(R(γ)) is estimated by the empirical distribution function of the p-values, in (7) more sophisticated estimators are used. However, these estimators rely on specific assumptions and may be inefficient if these assumptions are violated.

2.2 Estimators for π0

In this section we give a short description of the considered π0 estimators. They were chosen based on their availability in R (R Development Core Team, 2008). As in Section 2.1.2 we assume that the distribution of the z-statistics and p-values can be modeled via a mixture model. According to Eq. (6), the mixture density f(p) of the p-values is given by

f(p)=π0f0(p)+(1π0)f1(p). (8)

We use the same symbols to refer to densities of z and p-values and distinguish them by the argument only. If we additionally assume that the p-values under the null hypothesis are uniformly distributed on [0, 1], then

f(p)=π0+(1π0)f1(p). (9)

2.2.1 Estimators based on p-value density estimation

In this section we consider estimators of π0 that are based on estimates of the density f(p). Under the mixture model (9), π̂0 = minp f(p) gives an estimate of π0 which is positively biased if minp f1(p) > 0. If f1 takes its minimum at p = 1, the estimate for π0 simplifies to f(1). The estimators for π0 considered in the following are based on different estimators f^ of the mixture density f.

Beta Uniform model (Bum)

Pounds and Morris (2003) fit the mixture of a uniform and a beta distribution f(x) = λ + (1 − λ)axa−1 to the observed p-values. Since the maximum likelihood estimates (MLE) λ̂ and â (obtained by numerical optimization) appear to have a high variability, λ̂ is not a useful estimate for π0. However, the density estimate f^=λ^+(1λ^)a^xa^1 appears to be less variable such that Pounds and Morris propose to estimate π0 by π^0=f^(1). R-package oompa at http://bioinformatics.mdanderson.org/Software/OOMPA.

Convest

Langaas et al. (2005) use a nonparametric maximum likelihood estimator f^ of the p-value density. It is based on the mixture model (9) and the assumption that f1 is decreasing and convex. π0 is then estimated by π^0=f^(1). R-function convest in bioconductor package limma.

Poisson regression approach (Pre)

Broberg (2005) constructs a histogram of the p-values and models the number of hits in each subinterval by an inhomogeneous poisson process. The expected number of hits in each subinterval is then fitted by a polynomial of low degree in the midpoints of the subintervals based on Poisson regression. This leads to an estimate for f and π0 is estimated by π^0=minpf^(p). R-function p0.mom in bioconductor package SAGx.

Spacing LOESS histogram (Splosh)

Pounds and Cheng (2004) estimate the density f by smoothing the slope of the empirical distribution function of the p-values with local regression and set π^0=minpf^(p). Splosh.R provided at http://www.stjuderesearch.org/depts/biostats/splosh.html.

Empirical Bayes (EfronThe, EfronMLE)

As described in Section 2.1.2, Efron (2007b) proposes to estimate π0 based on a mixture model of the z-transformed p-values. We consider two such estimators for π0: EfronThe, derived under the assumption of the theoretical null distribution, and EfronMLE, where also the null distribution is estimated from the data. For a reliable estimation of the null distribution Efron suggests to use the method if π0 > 0.9, only. R-package locfdr.

Fdrtool (Pfndr, Zfndr, Tfndr, Ppct0, Zpct0, Tpct0)

Strimmer (2008) applies a truncated maximum likelihood approach to estimate π0 from the empirical distribution of either p-values, z-scores, or t-statistics. If p-values are provided, the theoretical null distribution (i.e., uniform distribution on [0,1]) is assumed. For z-statistics f0 is assumed to follow a N(0,σ02) distribution and σ0 and π0 are estimated with the truncated maximum likelihood approach. Similarly, for t-statistics f0 is assumed to follow a t-distribution and the corresponding degrees of freedom and π0 are estimated. As in Efron’s approach, an interval [−x0, x0] is chosen which is assumed to contain only test statistics from true null hypotheses. Two different methods to determine x0 are proposed: The FNDR method is based on minimizing the estimated false nondiscovery rate (FNDR) which is defined as the expected proportion of retained true alternative hypotheses under all retained hypotheses. For the second method x0 is chosen such that a predefined fraction (default: 0.75) of the observed test statistics lies in the interval [−x0, x0]. Then π̂0 is obtained via MLE of the truncated data. We denote the π0 estimators applying the minimization of the FNDR by Pfndr, Zfndr, and Tfndr (based on either p-values, z-statistics or t-tests, respectively). Likewise, Ppct0, Zpct0, and Tpct0 denote the estimators based on the prefixed truncation fraction of 0.75. As outlined in Section 2.1.2 we considered also two FNR estimators, LocZfndr and LocZpct0, which are based on fdr estimators from the fdrtool package: Both fdr estimators are based on z-statistics. For LocZfndr, the interval [−x0, x0] is chosen based on FNDR estimates as described above. For LocZpct0, the central 75% of z-values are used to estimate the null distribution. R-package fdrtool.

EbayesThresh (ETLapl, ETCau)

Johnstone and Silverman (2004) propose an empirical Bayes approach to estimate π0 for sparse data. They assume that the observations for each hypothesis i are drawn independently from a normal distribution with mean μi and variance 1. The prior distributions of the μi are mixtures of a probability mass of π0 at zero with a given symmetric heavy-tailed distribution. The mixing weight π0 is estimated based on the marginal maximum likelihood. We considered the π0 estimates resulting from a Laplace (ETLapl) and quasi-Cauchy prior (ETCau). R-package EbayesThresh.

2.2.2 Estimation based on the c.d.f. of the p-values

The proportion of true null hypotheses can be estimated via

π^0(λ)=#{p>λ}m(1λ), (10)

where λ is a tuning parameter. This approach goes back to Schweder and Spjotvoll (1982), who gave a graphical motivation. The nominator gives the observed number of p-values larger than λ, the denominator is the expected number of p-values larger than λ, given that for all hypotheses the null hypothesis holds. Assuming that no p-value larger than λ corresponds to an alternative hypothesis, (10) is an unbiased estimate. There is a trade-off between bias (which decreases for λ → 1) and variance (which increases for λ → 1). Several choices for λ have been considered.

Storey

Storey (2002) proposes λ = 0.5.

Bootstrap

Storey (2002) suggests to choose λ such that the bootstrap estimate of the mean squared error of π̂0(λ) is minimized. R-function pval.estimate.eta0 in package fdrtool.

Smoother

Storey and Tibshirani (2003) fit a natural cubic spline y with 3 degrees of freedom to π̂0(λ) and set π̂0 = y(1). R-function pval.estimate.eta0 in package fdrtool.

Lowest Slope (LSL)

The lowest slope estimator (Benjamini and Hochberg, 2000) is based on the slope of the c.d.f. of the p-values F^(p). Let Si = (1 − p(i))/(m + 1 − i) denote the slope of the line from (p(i),F^(p(i))) to the point (1, 1). Here p(i) denote the ordered p-values. Let i denote the smallest i such that Si < Si−1 and define π̂0 = min[1, 1/m + 1/(Sim)]. R-function pval.estimate.eta0 in package fdrtool.

Howmany

Meinshausen and Rice (2006) suggest an upper confidence bound π̂α for π0 such that P(π̂απ0) ≥ 1 − α. The bound can be written as

π^α=inft(0,1)π^0(t)+βαt(1t), (11)

where π̂0(λ) is defined in (10), βα = α−1[E−1(1 − α) + b], E is the c. d. f. of the Gumbel distribution, a=2mloglogm, and b = 2 log log m + 0.5 log log log m − 0.5 log 4π (where π denotes the circle constant). It is shown that under suitable conditions this bound is a consistent estimator of π0 as m → ∞. For finite m the estimator can be improved if the infimum in (11) is taken over (, 1 − ) for some small in the order of 1/m. For the simulation study in Section 3 we set α = 0.5 such that the estimate (11) is median unbiased if the confidence bound has exact coverage probability. R-package howmany.

2.2.3 Other Estimators

Jin

As for the EfronMLE estimator, the null distribution is estimated from the data. Under the true null hypothesis, the z-scores are assumed to be N(μ0,σ02) distributed. The parameters μ0, σ0 as well as π0 are estimated based on the empirical characteristic function of the z-transformed p-values (Jin and Cai, 2007). Under suitable conditions, if 1 − π0 is asymptotically larger than 1m, it is shown that the resulting estimate is consistent for m → ∞. R-function provided by the authors: http://www.stat.cmu.edu/~jiashun/Research/software/NullandProp

Location based estimator (Lbe)

Based on the mixture model (9), Dalmasso et al. (2005) construct a family of estimators

π^0(φ)=1mi=1mφ(pi)E0(φ(P)), (12)

where φ is a real valued function and E0(φ(P)) denotes the expectation over the p-value distribution under the null hypothesis. Dalmasso et al. (2005) give conditions for φ that lead to a non-negatively biased estimate of π0 and propose the choice φ(p) = −log(1 − p). Note that the estimator (10) can be written in the form (12) with φ(p) = I(p > λ), where I(·) denotes the indicator function. Bioconductor package LBE.

Moment generating function (Mgf)

Based on the mixture model (9), Broberg (2005) constructs an estimator for π0 using the estimated moment generating function of the p-value distribution. The moment generating function is represented as a weighted sum of the moment generating function of the uniform distribution and the unknown p-value distribution under the alternative. The latter is estimated by an recursive algorithm. R-function p0.mom in the Bioconductor package SAGx.

2.3 Comparison of FNR estimators

By (3) each estimate of π0 leads to an estimate of the FNR. Either the theoretical null distribution F0 can be used in the estimator or an estimated null distribution F^0. We investigate the FNR estimators defined by (3) resulting from different estimators of π0 in a simulation study and report the square root of the mean squared error (RMSE) and the bias for each estimator under a wide range of scenarios. Additionally, we include the estimators LocThe, LocMLE, LocZfndr, and LocZpct0 based on local false discovery rates defined in Section 2.1.2. Note that the π0 estimators corresponding to the LocThe, LocMLE, LocZfndr, and LocZpct0 estimators are the same as for the EfronThe, EfronMLE, Zfndr, and Zpct0 estimators, respectively, and are not reported separately in the tables and figures.

For the simulation we consider the test of m null hypotheses Hi : μi = 0 for the mean of normally distributed observations with mean μi and variance σi2 against the alternatives Hi0, i = 1, … , m, with two-sided one-sample t-tests. The critical value is determined by the BH step-up procedure at level α = 0.05 applied to elementary p-values based on the central t-distribution. Note that the BH procedure has been shown to control the FDR also under a wide range of dependency structures (Benjamini and Yekutieli, 2001). For π̂0 estimators involving tuning parameters the default values recommended by the respective authors are used except for the howmany estimator where we set α = 0.5 and for the estimators based on Efron’s locfdr procedure (EfronThe, EfronMLE, LocThe, LocMLE) where the value of the degrees of freedom for fitting the estimated density f(z) is set to 14. Additionally, we restrict the π0 and FNR estimates to the interval [0,1].

The simulations are performed for six reference scenarios. Here we assume that m = 10000, n = 20 and consider three different proportions of true null hypotheses: π0 ∈ {0.9, 0.95, 0.99}. For the alternative hypotheses we assume that the data are N(δj, 1) distributed where the δj are alternating −Δ, −3Δ/4, −Δ/2, −Δ/4 and Δ, 3Δ/4, Δ/2, Δ/4 for the (1 − π0)m alternatives. For the reference scenarios we consider Δ ∈ {1, 2}. The actual FNRs in these scenarios are given in Table 1. For the scenarios with correlated test statistics the FNRs are practically identical.

Table 1.

The actual FNRs under different scenarios applying the BH test at level α = 0.05 and assuming independent test statistics

π0 = 0.9 π0 = 0.95 π0 = 0.99
Δ = 1 0.67 0.74 0.88
Δ = 2 0.24 0.27 0.34

To investigate the impact of correlated test statistics on the properties of the FNR estimates we performed the simulations for independent test statistics, equi-correlated test statistics (as in Benjamini et al., 2006) and a block correlation structure (Storey et al., 2004). For the latter we assume that the test statistics are correlated in blocks of 10 hypotheses. Within one block, the correlation between the test statistics of hypotheses Hj and Hi is ρ, if i, j ≤ 5 or i, j > 5, and −ρ if i ≤ 5 and j > 5 or vice versa. For each block ρ is drawn from a uniform distribution on [0, 1].

3 RESULTS

3.1 Comparison of the RMSE of the FNR estimators

Table 2 shows the maximum RMSE and maximum bias of FNR^ for π0 = 0.9, 0.99 for independent and equi-correlated test statistics. The maximum is taken over the alternatives Δ in {1, 2}. Tables for all reference scenarios (RMSE and bias for π̂0 and FNR^) can be found in the supplementary material. In all considered scenarios π̂0 has much lower RMSE than FNR^. None of the FNR estimators has a uniformly lowest RMSE. Under independence, the Convest, LocThe, and Pfndr estimator have the lowest maximal RMSE across the considered scenarios. For the scenario with block correlation we observe somewhat larger RMSE than in the independent case (see supplementary material). For the equi-correlated case (with ρ = 0.5) all estimators with the exception of the LocMLE and EfronMLE estimator have distinctively larger RMSE and bias compared to the independent case. Also the Jin estimator shows low RMSE in scenarios with equi-correlation but, as discussed below, its RMSE appears to depend very sensitively on the underlying parameters. Figure 1 shows the RMSE of FNR^ of the Convest, Pfndr, LocThe, LocMLE, and EfronMLE estimators when varying one of the parameters m, π0, Δ, or ρ in the reference scenario m = 10000, π0 = 0.9, ρ = 0, Δ = 2, n = 20. The figures for the remaining FNR estimators as well as for the π0 estimators are given in the supplementary material.

Table 2.

The maximum RMSE (maximum bias) of the considered FNR^ estimators for independent or equi-correlated (ρ = 0.5) test statistics for π0 = 0.9, 0.99 and m = 10000. The maximum is taken over the alternatives Δ in {1, 2}. The bias of the scenario with the largest absolute bias is reported. 5000 simulation runs per scenario

Independence Equi-correlation
π0 = 0.9 π0 = 0.99 π0 = 0.9 π0 = 0.99
Storey 0.10 (−0.08) 0.42 (−0.21) 0.52 (−0.30) 0.71 (−0.52)
LSL 0.55 (−0.54) 0.44 (−0.40) 0.53 (−0.50) 0.45 (−0.35)
Bootstrap 0.11 ( 0.05) 0.32 ( 0.21) 0.48 (−0.25) 0.70 (−0.51)
Smoother 0.16 (−0.09) 0.56 (−0.33) 0.51 (−0.27) 0.70 (−0.51)
Pfndr 0.09 (−0.08) 0.20 (−0.05) 0.51 (−0.29) 0.71 (−0.53)
Mgf 0.11 (−0.10) 0.33 (−0.15) 0.52 (−0.32) 0.71 (−0.53)
Pre 0.28 ( 0.27) 0.29 ( 0.22) 0.37 ( 0.18) 0.51 ( 0.36)
Lbe 0.31 (−0.16) 0.59 (−0.36) 0.51 (−0.28) 0.71 (−0.50)
Ppct0 0.12 (−0.11) 0.30 (−0.14) 0.52 (−0.39) 0.72 (−0.55)
Tpct0 0.22 (−0.12) 0.59 (−0.38) 0.53 (−0.42) 0.72 (−0.56)
Zfndr 0.61 (−0.60) 0.22 (−0.15) 0.39 (−0.18) 0.46 ( 0.18)
Convest 0.07 (−0.05) 0.20 ( 0.11) 0.37 (−0.11) 0.70 (−0.52)
Splosh 0.22 (−0.12) 0.47 ( 0.45) 0.55 ( 0.51) 0.65 ( 0.65)
Bum 0.14 ( 0.14) 0.22 ( 0.22) 0.25 ( 0.13) 0.66 (−0.44)
EfronMLE 0.24 (−0.22) 0.51 (−0.31) 0.15 (−0.08) 0.31 ( 0.12)
Jin 0.17 (−0.16) 0.24 (−0.12) 0.19 (−0.05) 0.33 ( 0.28)
Howmany 0.14 (−0.13) 0.25 (−0.18) 0.37 (−0.15) 0.48 (−0.17)
Zpct0 0.28 (−0.24) 0.53 (−0.30) 0.49 (−0.38) 0.74 (−0.59)
Tfndr 0.67 (−0.67) 0.33 (−0.22) 0.50 (−0.27) 0.66 (−0.47)
EfronThe 0.09 (−0.08) 0.43 (−0.22) 0.67 (−0.67) 0.88 (−0.88)
LocThe 0.08 (−0.07) 0.21 ( 0.06) 0.33 (−0.20) 0.41 (−0.16)
LocMLE 0.22 (−0.21) 0.24 ( 0.03) 0.14 (−0.08) 0.28 ( 0.22)
ETLapl 0.37 ( 0.37) 0.43 ( 0.43) 0.38 (0.27) 0.46 ( 0.24)
ETCau 0.39 ( 0.39) 0.41 ( 0.41) 0.39 (0.29) 0.42 ( 0.24)
LocZfndr 0.56 (−0.56) 0.22 (−0.16) 0.39 (−0.18) 0.47 ( 0.24)
LocZpct0 0.27 (−0.23) 0.57 (−0.35) 0.49 (−0.39) 0.72 (−0.56)

Fig. 1.

Fig. 1

RMSE of FNR^ estimators. In the scenario with m = 10000, π0 = 0.9, Δ = 2, ρ = 0 in each of the graphs one of the parameters is varied. In the graph with varying Δ only values Δ > 0 are plotted (Δ = 0 corresponds to the case π0 = 1). The graphs with varying ρ refers to the case of equi-correlated test statistics with correlation coefficient ρ. The dotted line shows the true FNR. 500 simulation runs for each value on the x-axis.

Because of the factor 1/(1 − π̂0) in the FNR estimators defined by (3), the RMSE is large for π0 close to one even though the RMSE of π̂0 decreases with π0. If π0 = 1, with a large probability π̂0 = 1 and the FNR is correctly estimated as 0. However, if the estimate π̂0 is less than 1, the FNR estimate is typically 1 (since under the global null hypothesis the BH test guarantees that with probability 1 − α no hypothesis is rejected such that R(γ) = 0). Thus, in this setting, the distribution of FNR^ is concentrated on 0 and 1 which is reflected in a large RMSE. For π0 << 1 the estimators LocMLE, EfronMLE based on estimated null distributions have a larger RMSE than the estimators based on theoretical null distributions.

Increasing the effect sizes of the alternative hypotheses does not lead to a consistent decrease of the RMSE of the FNR estimators. In contrast, for all FNR estimators in Figure 1 the RMSE reaches a maximum for Δ ~ 1. For increasing m, the RMSE of most estimators slightly decreases. For increasing ρ, the (absolute) bias and RMSE for both π̂0 and FNR^ increase, with the exception of theRMSEs of EfronMLE and LocMLE which are nearly constant in ρ.

In several of the plots the Jin estimate shows an erratic behavior (see supplementary material). The estimator involves a tuning parameter γ and appears to depend quite sensitively on its value. In the simulation study we used γ = 0.1 as recommended in Jin and Cai (2007). However, for some scenarios, choosing a slightly different value has a large impact on the RMSE (data not shown). In the simulation study we applied the BH procedure which does not rely on a π0 estimate but is strictly conservative if π0 < 1. This choice guarantees that the actual FNR is the same regardless of the π0 estimator considered for the FNR estimation and allows for a better comparability of the investigated methods. Additional simulations for the Storey, Pfndr, and Convest methods, where the FDR was controlled based on the same π0 estimator as used in the FNR estimation, gave very similar RMSEs (data not shown).

Normalization

It is well known that normalization can reduce the correlation in microarray data sets. We made further simulations for the five estimators considered above where the observations were standardized per chip such that they have mean 0 and variance 1. This standardization had hardly any impact in the independent and block correlated case. In the equi-correlated case, though, the correlation is practically removed by standardization and the resulting FNR estimators have similar RMSEs as in the independent case. Note, however, that standardization may introduce a bias if a larger fraction of genes is differentially expressed and the effect of over- and underexpressed genes does not cancel out. To demonstrate that standardization cannot remove the impact of correlation in general we performed additional simulations for the Convest, Pfndr, LocThe, LocMLE, and EfronMLE estimators under a block correlation structure with larger block size. When increasing the block size to 100, standardization has practically no impact on the RMSE. However, the maximal RMSE (across Δ = 1, 2) increases compared to the scenarios with block size 10: For π0 = 0.9 to 0.18, 0.19, 0.16 (for Convest, Pfndr, LocThe) and to 0.3, 0.36 (for LocMLE, EfronMLE); for π0 = 0.99 to 0.38, 0.38, 0.34 (for Convest, Pfndr, LocThe) and to 0.39, 0.57 (for LocMLE, EfronMLE). Note that in contrast to the equi-correlated case, for block correlations with large blocks the estimators based on the empirical null hypothesis show no advantage.

3.2 Assessing Correlations

The simulation study shows that the RMSE of the estimators depends on the strength of dependence of the observations across genes. While the results under block correlation and independence match closely, strong correlation (as in the equi-correlation scenario) leads for all estimators based on theoretical null distributions to much larger RMSEs. To assess the correlation between the observed expression levels of different genes one can investigate the distribution of pair-wise correlation coefficients for all pairs of genes (see, e.g., Owen, 2005; Efron, 2007a). As in Efron (2007a) we apply Fisher’s z-transformation to the correlation coefficients and standardize by the approximate theoretical standard deviation 1(n3) such that the theoretical distribution of the transformed correlation coefficients is N(0, 1) given that the true correlation is zero for all genes. The observed mean μ̂ and standard deviation σ̂ of the transformed correlation coefficients can then be compared to the theoretical values μ = 0 and σ = 1 under the assumption of independence.

We performed a simulation study (1000 runs) to investigate how reliable the mean and variance of the pairwise correlation coefficients can be estimated. Assuming a per-group sample size of n = 20 we estimated the mean and standard deviation of pairwise correlation coefficients under different correlation structures. For computational feasibility we consider the pairwise correlations between the first 5000 genes in each data set, only. The mean correlation coefficients showed very low variability under independence and block correlation (with block sizes up to 250) where 95% of the mean z-transformed correlation coefficients were in (−0.0008, 0.0006). Under equi-correlation (with ρ = 0.2, 0.5 corresponding to z-transformed values of 0.84 and 2.26) the variation was somewhat larger and 95% of the z-transformed average correlation coefficients were in (0.45, 1.34) or (1.46, 3.25), respectively. This indicates that the order of magnitude of the mean correlation can be well estimated in the considered scenarios. Similarly, under independence the observed standard deviation of the z-transformed correlations was close to the nominal value 1 under independence. However, under equi-correlation (ρ ∈ {0.2, 0.5}) the estimated standard deviation was somewhat lower (95% of the estimates in (0.94, 0.99) or (0.83, 0.92)). Under block correlation for block sizes of 10, 20, 100, and 250, 95% of the estimated standard deviations lie in (1.01, 1.014), (1.02, 1.03), (1.08, 1.21) and (1.14, 1.59), respectively.

Standardizing the observations per chip as described in the previous section centers the distribution of z-values around zero (Efron, 2009). Consequently, the estimated mean correlation coefficients for the standardized data are close to zero under independence and under equi-correlation. For the estimated standard deviations the impact of standardization is more intricate. While for equi-correlated data the standard deviation becomes practically 1 with very low variation, for block correlation standardization has only a marginal impact on the standard deviation.

4 REAL DATA EXAMPLES

We estimate the FNR of two gene expression microarray experiments that were reanalyzed in Pavlidis et al. (2003). Based on a data set by Gruvberger et al. (2001) we compare gene expression measurements of breast cancer in patients with positive and negative estrogen receptor status by two-sided two-sample t-tests (m = 3389, n = 28 per group). In the second example based on Huang et al. (2001), we compare gene expression measurements of patients with papillary versus normal thyroid carcinoma (m = 12558, n = 8 per group). Because of outliers and skewed distributions of expression values the latter comparison was performed with Wilcoxon tests. As in the simulation study, the BH method was used to control the FDR at 5%. The considered estimators give rather divergent π0 and FNR estimates for the Gruvberger data (Table 3). However, they can be classified into two groups: The estimators based on the assumption of a uniform null distribution of the p-values (Convest, LocThe, Pfndr) and the estimators based on estimated null distributions (EfronMLE, LocMLE). Within each group the results are similar.

Table 3.

Estimates of π0 and the FNR for the microarray data sets from Gruvberger et al. (2001) with m = 3389, n = 28/28, and Huang et al. (2001) with m = 12558, n = 8/8 for standardized and non-standardized data, respectively

Gruvberger Huang
non-standardized
R = 163
standardized
R = 296
non-standardized
R = 50
standardized
R = 110
π̂ 0 FNR^ π̂ 0 FNR^ π̂ 0 FNR^ π̂ 0 FNR^
Convest .77 .80 .63 .77 .93 .95 .94 .85
LocThe .78 .73 .98 .92
Pfndr .77 .80 .65 .76 .93 .95 .93 .88
LocMLE .35 .14 .91 .87
EfronMLE .96 .27 .96 .12 .95 .93 .94 .86

The differences between the estimates from the two classes of estimators may be due to either a large fraction of alternative hypotheses or correlation between the test statistics of different hypotheses. As described in Section 3.2 we investigated the distribution of pair-wise correlation coefficients between pairs of genes and computed pairwise Pearson (Spearman) correlations for the Gruvberger (Huang) data set. For the Gruvberger data set the distribution of the correlation coefficients indicates much stronger dependence than for the Huan data set (Gruvberger: group 1 (2) μ̂ = 1.32 (1.44), σ̂ = 1.5 (1.73), Huang: group 1 (2), μ̂ = 0.014 (0.02), σ̂ = 1.08 (1.06)). After chip-wise standardization (see Section 3), the average of the pair-wise correlations in the Gruvberger data set is close to zero but the standard deviation is still larger than the nominal value 1 (group 1 (2) μ̂ = 0.002 (0.007), σ̂ = 1.38 (1.52)). This indicates a strong dependence of observations, in the order of the magnitude of block correlation with block size 250 (see Section 3.2). Thus, in this data set standardization cannot remove the correlations. For the Huang data set the standardization is performed by subtracting the chip-wise median and dividing by the interquartile range. Since there is little correlation observed in the raw data, standardizing hardly changes the distribution of correlation coefficients (group 1 (2), μ̂ = 0.05 (0.04), σ̂ = 1.08 (1.05)). Note that in both examples after standardization a larger number of hypotheses can be rejected and the FNR estimates become smaller (See Table 3).

The example shows that the choice of the null distribution may be crucial for the estimation of the FNR. However, this also holds for hypothesis testing: For example, if the rejection threshold γ in the Gruvberger data set is chosen such that the FDR estimate from the locfdr package is 0.05 one can reject 111 hypotheses assuming the theoretical null hypothesis but only 7 when choosing the empirical null distribution.

5 CONCLUSIONS

We investigated a family of FNR estimators which is based on the estimated proportion of true null hypotheses π0 as well as estimators based on local false discovery rate estimates. For the former one can show that given the observations are sufficiently independent across hypotheses, the asymptotic FNR estimates are consistent as m → ∞ if the underlying π0 estimate is consistent. This holds, e.g., for the Storey (assuming that the tuning parameter λ approaches 1), Lbe, Howmany, and Jin estimators. However, for finite m the simulation studies show that the estimation error in estimating the FNR is considerable larger than for the estimators of π0. A reliable estimation of the FNR is difficult, especially if the number of alternative hypotheses is small. In these settings the FNP is still highly variable and the FNR estimators are unreliable.

Since the proposed FNR estimators are based on univariate p-values, they can be applied to a wide range of statistical tests including multi-stage procedures for which group-sequential p-values can be defined (Victor and Hommel, 2007; Zehetmayer et al., 2005, 2008).

The Convest, LocThe, and Pfndr estimators that are based on theoretical null distributions showed the most favourable characteristics for independent test statistics in the considered scenarios but have a large RMSE under strong dependence. In contrast, the EfronMLE and LocMLE estimators based on estimated null distributions are more robust in the equi-correlated scenariosbut show no advantage in the block correlated scenarios or if the proportion of alternative hypotheses is large. The latter comes from the fact that the estimation of the null distribution is based on the assumption that for most hypotheses the null hypothesis holds.

Supplementary Material

Supplementary Material

ACKNOWLEDGEMENT

We thank Peter Bauer and the two referees for many helpful suggestions.

Funding: This work was supported by the Austrian Science Fund FWF (grant numbers P18698-N15 and T 401-B12).

Footnotes

Availability: R-code (R Development Core Team, 2008) to compute all considered estimators based on p-values and supplementary material is available from the authors web page http://statistics.msi.meduniwien.ac.at/index.php?page=pageszfnr

REFERENCES

  1. Benjamini Y, et al. Adaptive linear step-up procedures that control the false discovery rate. Biometrika. 2006;93:491–507. [Google Scholar]
  2. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J.R.Statist.Soc.B. 1995;57:289–300. [Google Scholar]
  3. Benjamini Y, Hochberg Y. On the adaptive control of the false discovery fate in multiple testing with independent statistics. J.Educ. Behav. Stat. 2000;25:60–83. [Google Scholar]
  4. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 2001;29:1165–1188. [Google Scholar]
  5. Broberg P. A comparative review of estimates of the proportion unchanged genes and the false discovery rate. BMC Bioinformatics. 2005;6:199–219. doi: 10.1186/1471-2105-6-199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Craiu R, Sun L. Choosing the lesser evil: trade-off between false discovery rate and non-discovery rate. Stat. Sinica. 2008;18:861–879. [Google Scholar]
  7. Dalmasso C, et al. A simple procedure for estimating the false discovery rate. Bioinformatics. 2005;21:660–668. doi: 10.1093/bioinformatics/bti063. [DOI] [PubMed] [Google Scholar]
  8. Delongchamp RR, et al. Multiple-testing strategy for analyzing cdna array data on gene expression. Biometrics. 2004;60:774–782. doi: 10.1111/j.0006-341X.2004.00228.x. [DOI] [PubMed] [Google Scholar]
  9. Efron B. Correlation and large-scale simultaneous significance testing. JASA. 2007a;102:93–103. [Google Scholar]
  10. Efron B. Size, power and false discovery rates. Ann. Stat. 2007b;35:1351–1377. [Google Scholar]
  11. Efron B. Correlated z-values and the accuracy of large-scale statistical estimates. Stanford University; 2009. Working Paper. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Genovese C, Wasserman L. Operating characteristics and extensions of the false discovery rate procedure. J.R.Statist.Soc.B. 2002;64:499–517. [Google Scholar]
  13. Gruvberger S, et al. Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res. 2001;61:5979–5984. [PubMed] [Google Scholar]
  14. Hoenig J, Heisey D. The abuse of power: The pervasive fallacy of power calculations for data analysis. Am. Stat. 2001;55:19–24. [Google Scholar]
  15. Hsueh H, et al. Comparison of methods for estimating the number of true hypotheses in multiplicity testing. J. Biopharm. Stat. 2003;13:675–689. doi: 10.1081/BIP-120024202. [DOI] [PubMed] [Google Scholar]
  16. Huang Y, et al. Gene expression in papillary thyroid carcinoma reveals highly consistent profiles. Proc. Nat. Acad. Sci. 2001;98:15044–15049. doi: 10.1073/pnas.251547398. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Jin J, Cai T. Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons. JASA. 2007;102:495–506. [Google Scholar]
  18. Johnstone I, Silverman B. Needles and straw in haystacks: Empirical bayes estimates of possibly sparse sequences. Ann. Stat. 2004;32:1594–1649. [Google Scholar]
  19. Langaas M, et al. Estimating the proportion of true null hypotheses, with application to dna microarray data. J.R.Statist.Soc.B. 2005;67:555–572. [Google Scholar]
  20. Meinshausen N, Rice J. Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Stat. 2006;34:373–393. [Google Scholar]
  21. Norris AW, Kahn CR. Analysis of gene expression in pathophysiological states: Balancing false discovery and false negative rates. Proc. Nat. Acad. Sci. 2006;103:649–653. doi: 10.1073/pnas.0510115103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Owen AB. Variance of the number of false discoveries. Journal of the Royal Statistical Society Series B. 2005;67:411–426. [Google Scholar]
  23. Pavlidis P, et al. The effect of replication on gene expression microarray experiments. Bioinformatics. 2003;19:1620–1627. doi: 10.1093/bioinformatics/btg227. [DOI] [PubMed] [Google Scholar]
  24. Pawitan Y, et al. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics. 2005;21:3017–3024. doi: 10.1093/bioinformatics/bti448. [DOI] [PubMed] [Google Scholar]
  25. Posch M, et al. Hunting for significance with the false discovery rate. JASA. 2009;104:836–840. [Google Scholar]
  26. Pounds S, Cheng C. Improving false discovery rate estimation. Bioinformatics. 2004;20:1737–1745. doi: 10.1093/bioinformatics/bth160. [DOI] [PubMed] [Google Scholar]
  27. Pounds S, Morris S. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics. 2003;19:1236–1242. doi: 10.1093/bioinformatics/btg148. [DOI] [PubMed] [Google Scholar]
  28. R Development Core Team . R: A language and environment for statistical omputing. R Foundation for Statistical Computing; Vienna, Austria: 2008. http://www.R-project.org. [Google Scholar]
  29. Sarkar SK. Fdr-controlling stepwise procedures and their false negatives rates. J. Stat. Plan. Infer. 2004;125:119–137. [Google Scholar]
  30. Schweder T, Spjotvoll E. Plots of p-values to evaluate many tests simultaneously. Biometrika. 1982;69:493–502. [Google Scholar]
  31. Senn S, Bretz F. Power and sample size when multiple endpoints are considered. Pharm. Stat. 2007;6:161–170. doi: 10.1002/pst.301. [DOI] [PubMed] [Google Scholar]
  32. Storey J, Tibshirani R. Statistical significance for genomewide studies. Proc. Nat. Acad. Sci. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Storey JD. A direct approach to false discovery rates. J.R.Statist.Soc.B. 2002;64:479–498. [Google Scholar]
  34. Storey JD, et al. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J.R.Statist.Soc.B. 2004;66:187–205. [Google Scholar]
  35. Strimmer K. A unified approach to false discovery rate estimation. BMC Bioinformatics. 2008;9:303–317. doi: 10.1186/1471-2105-9-303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Victor A, Hommel G. Combining adaptive designs with control of the false discovery rate - a generalized definition for a global p-value. Biometrical J. 2007;49:94–106. doi: 10.1002/bimj.200510311. [DOI] [PubMed] [Google Scholar]
  37. Zehetmayer S, et al. Two-stage designs for experiments with a large number of hypotheses. Bioinformatics. 2005;21:3771–3777. doi: 10.1093/bioinformatics/bti604. [DOI] [PubMed] [Google Scholar]
  38. Zehetmayer S, et al. Optimized multi-stage designs controlling the false discovery or the family wise error rate. Stat. Med. 2008;27:4145–4160. doi: 10.1002/sim.3300. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

RESOURCES