Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Apr 12.
Published in final edited form as: Biometrika. 2013 Mar 26;100(2):495–502. doi: 10.1093/biomet/ast001

The optimal power puzzle: scrutiny of the monotone likelihood ratio assumption in multiple testing

Hongyuan Cao 1, Wenguang Sun 2, Michael R Kosorok 3
PMCID: PMC3984571  NIHMSID: NIHMS438491  PMID: 24733954

Summary

In single hypothesis testing, power is a non-decreasing function of type I error rate; hence it is desirable to test at the nominal level exactly to achieve optimal power. The puzzle lies in the fact that for multiple testing, under the false discovery rate paradigm, such a monotonic relationship may not hold. In particular, exact false discovery rate control may lead to a less powerful testing procedure if a test statistic fails to fulfil the monotone likelihood ratio condition. In this article, we identify different scenarios wherein the condition fails and give caveats for conducting multiple testing in practical settings.

Some key words: False discovery rate, heteroscedasticity, monotone likelihood ratio, multiple testing dependence

1. Introduction

We study an important assumption that has been implicitly used in the multiple testing literature. In the context of false discovery rate analysis (Benjamini & Hochberg, 1995), we show that the assumption can be violated in many important settings. The goal of this article is to explicitly state the assumption to bridge the gap in conventional methodological development, rigorously investigate the legitimacy of the assumption in various settings, and give caveats for conducting multiple testing in practice.

To identify this assumption, it is helpful to first closely examine the framework of single hypothesis testing. Suppose we want to test H0 versus H1 based on the observed value of a continuous random variable X. A binary decision rule δ ∈ {0, 1} divides the sample space S into two regions S = S0S1: δ = 0 when XS0 and δ = 1 when XS1. Let T(·) be a function of X, with small values indicating evidence against H0. The critical region S1 can be expressed as S1 = {xS : T(x) < t}. Correspondingly we have a testing rule δ = I{T(X) < t}, where I(·) is an indicator function and t is the rejection threshold. Denote by F0 and F1 the conditional distributions of X under H0 and H1, and by G0 and G1 the conditional distributions of T(X) under H0 and H1. The Type I and Type II error rates of δ are α(t) = prH0{T(X) < t} = G0(t) and β(t) = prH1 {T(X) > t} = 1 − G1(t), respectively. Since α(t) increases in t and β(t) decreases in t, we conclude that β(t) decreases in α(t). Therefore the optimal choice of t*, which minimizes β(t) subject to α(t) ≤ α0, should satisfy α(t*) = α0. In other words, we wish to test H0 at the nominal level exactly in order to minimize the type II error rate.

Now suppose we want to test m hypotheses H1, …, Hm simultaneously based on a random vector X = (X1, …, Xm). Let θ1, …, θm be independent and identically distributed Bernoulli (p) random variables, where θi = 0 if Hi is a null and θi = 1 otherwise. Assume

Xi~(1-θi)F0+θiF1, (1)

where F0 and F1 are the null and non-null distributions, respectively. Let Ti(·) be a function of X for testing Hi. The solution to a multiple testing problem can be represented by a vector of binary decisions δ = (δ1, …, δm) ∈ {0, 1}m, where δi = 1 if we reject Hi and δi = 0 otherwise. As an example, consider a testing rule which rejects Hi when Pi < t, where Pi is the p-value. Then Ti(X) = Pi and we can write δi = I(Pi < t). Denote the conditional distribution of Pi under the alternative by G1. The mixture distribution of Pi is G(t) = (1 − p)t + pG1(t), where p is the proportion of non-null hypotheses. The false discovery rate is the expected proportion of false positives among all rejections. Let xy = max(x, y). Genovese & Wasserman (2002) showed that the false discovery rate, as a function of the p-value threshold t, is

FDR(t)=E{i=1m(1-θi)δi(i=1mδi)1}=(1-p)t(1-p)t+pG1(t)+O(m-1/2). (2)

The false non-discovery rate, missed discovery rate, and average power can be used to describe the power of an false discovery rate procedure:

FNR(t)=E[i=1m(1-δi)θi{i=1m(1-δi)}1]=p{1-G1(t)}p{1-G1(t)}+(1-p)(1-t)+O(m-1/2),MDR(t)=E{i=1mθi(1-δi)(i=1mθi)1}=1-G1(t)+O(m-1/2), (3)

AP(t) = 1 − MDR(t). Similar to the situation in single hypothesis testing, it is often assumed in the multiple testing literature that

FDR(t)increasesintandFNR(t)decreasesint;thereforeFNR(t)decreasesinFDR(t). (4)

Consequently, to achieve the optimal power, we should control the false discovery rate at the nominal level α exactly. That is, the optimal p-value cutoff t* should solve the equation

(1-p)t(1-p)t+pG1(t)=α. (5)

In Genovese & Wasserman (2002), the testing rule δi = I(Pi < t*) is referred to as the oracle false discovery rate procedure. In the literature, considerable effort has been devoted to the development of data-driven methods aiming to mimic the oracle for precise false discovery rate control (Benjamini & Hochberg, 2000; Genovese & Wasserman, 2004; Benjamini et al., 2006). The tacit assumption is that the closer a test gets to the upper bound α, the more powerful the test is. However, a fundamental question is whether (4) always holds. This question yields a logical gap in methodological development. If (4) is not true, then a false discovery rate procedure at level α* < α can be more powerful than a procedure at level α. Consequently the oracle procedure (5) is not optimal and all attempts to achieve precise false discovery rate control must fail. Surprisingly, (4) can be violated in various important scenarios.

2. The Monotone Likelihood Ratio Condition

Consider a decision rule δ = (δ1, …, δm), where δi = I(Ti < t). Various statistics Ti have been proposed for multiple testing in the literature, including the local false discovery rate (Efron et al., 2001), the weighted p-value (Genovese et al., 2006), the local index of significance (Sun & Cai, 2009) and t-statistics (Cao & Kosorok, 2011). Therefore it is desirable to develop a general principle which guarantees that (4) is fulfilled by different Ti. To focus on the main idea, we assume for the moment that Ti are identically distributed with G0(t) = pr(Ti < t | θi = 0) and G1(t) = pr(Ti < t | θi = 1) for i = 1, …, m. Let gj(t) = (d/dt)Gj(t)(j = 0, 1) be the corresponding conditional densities. The monotone likelihood ratio condition can be stated as

g1(t)/g0(t)ismonotonicallydecreasingint. (6)

It is commonly assumed that G1(t), the p-value distribution under the alternative, is a concave function. Such an assumption has been made in Storey (2002), Genovese & Wasserman (2002, 2004) and Kosorok & Ma (2007), among others. This concavity assumption is a special case of condition (6) if the null p-value distribution is uniform. A significant advantage of condition (6), compared to condition (4), is that it can be roughly checked in practice. For a p-value testing procedure, we can first estimate the mixture density by ĝP (t). Then ĝP (t) would be decreasing in t if the monotone likelihood ratio condition holds.

The dominant terms on the right hand sides of equations (2)(3) are referred to as the marginal false discovery rate and marginal false non discovery rate, respectively. The property of a testing rule is essentially characterized by these approximations. We mainly use these marginal measures hereinafter to simplify our discussion while still preserving the key features of the problem. The main finding is that condition (6), although not affecting the validity of a multiple testing procedure, plays an important role in optimality analysis. The next proposition shows that exact false discovery rate control leads to the most powerful test when condition (6) is fulfilled.

Proposition 1

(Sun & Cai, 2007). Consider random mixture model (1). Let Ti = T(Xi) be the test statistic and δ(T, t) = {δi : i = 1, …, m} = {I(T(Xi) < t) : i = 1, …, m} the testing rule. If Ti satisfies condition (6), then (i) mFDR(t) increases in t; (ii) mFNR(t) decrease in t; and (iii) mFNR(t) decrease in mFDR(t). In particular, results (i), (ii) and (iii) hold when T(Xi) = Pi and the p-value distribution function under the alternative is concave.

As pointed out by a reviewer, the monotonicity relationship is derived only for single-step thresholding procedures δ(T, t). The results in Genovese & Wasserman (2002) indicate that, in a random mixture model, a broad class of stagewise testing procedures have asymptotically equivalent versions in the family of single-step thresholding procedures. Therefore our result remains relevant when stagewise procedures such as the step-up procedure of Benjamini & Hochberg (1995) are considered.

3. Violation of the Monotone Likelihood Ratio Condition

3·1. Heteroscedastic models

This section explores several important situations where conditions (4) and (6) are violated. First consider a heteroscedastic normal mixture model

Ziθi~(1-θi)N(0,1)+θiN(μ,σ2),i=1,,m, (7)

where θ1, …, θm are independent Bernoulli(p) variables. The next proposition shows that the standard approach, which thresholds the z-value or equivalently, the one-sided p-value Pi = pr{N(0, 1) > Zi}, may fail to fulfill condition (6).

Theorem 1

Consider the normal mixture model (7). Define the one-sided p-value Pi = pr{N(0, 1) > Zi}. Let δ = (δi: i = 1, …, m) be a testing rule, where δi = I(Pi < t). Then condition (6) always holds when σ ≥ 1 but fails when σ < 1.

The heteroscedastic model (7) can arise from applications such as sign tests. Suppose we want to test whether random variable Yi has median 0 based on replicated observations Yi1, …, Yin, (i = 1, …, m). Let q = pr(Yi > 0). The hypotheses can be stated as H0i: q = 0.5 versus H1i: q ≠ 0.5. Test statistic is Zi=n-1/2j=1nsign(Yij)=n-1/2j=1n{2I(Yij>0)-1}. We have E(Zi) = 2q − 1, var(Zi)=σq2=4q(1-q), Zi ~ N (0, 1) under H0i, and ZiN(2q-1,σq2)) under H1i with σq2<1. Therefore the sign test gives rise to a heteroscedastic model asymptotically. Next we provide a numerical example to illustrate the failure of the condition in a heteroscedastic model.

Example 1

We generate m = 2000 independent Bernoulli(p) variables θ1, …, θm with p = 0.1, and generate Zi according to model (7) with μ = 2.5. The one-sided p-value is obtained as Pi = pr{N (0, 1) > Zi}. We vary the critical value t from 1.95 to 4 and calculate false discovery proportion FDP(t). Then FDR(t) is obtained by averaging the FDP(t) over 2000 replications. The results are summarized in the first row of Fig. 1. We can see that when σ = 1, FDR(t) decreases monotonically in t. However, when σ = 0.5, FDR(t) first decreases and then increases in t. The violation of monotonicity leads to testing results that are not interpretable. For example, the right panel of the first row of Fig. 1 suggests that if we threshold at t = 3.8, the false discovery rate is 0.12, but if we threshold at t = 3.0, the false discovery rate is 0.07. In fact larger threshold does not necessarily control false discovery rate at a lower level when σ < 1. This heteroscedasticity resulted in the violation of (4) and (6).

Fig. 1.

Fig. 1

The first row corresponds to heteroscedastic models with σ = 1 (left) and σ = 0.5(right); The second row corresponds to correlated tests with weak correlation (left) and strong correlation (right)

3·2. Correlated tests

This section discusses the violation of condition (6) under dependency. An additional example on multiple testing with groups is discussed in the Supplementary Material. The dependency issue has attracted much attention in the multiple testing literature (Benjamini & Yekutieli, 2001; Efron, 2007; Wu, 2008; Sun & Cai, 2009). The next example shows that condition (6) can be violated under strong dependency.

Example 2

Suppose we observe X = (X1, …, Xm) from the model

X=μ+ε, (8)

and want to identify non-zero elements in μ = (μ1, …, μm). In many important applications such as imaging analysis and signal processing, it is commonly believed that the null cases are independent but the non-null cases are clustered (Logan et al., 2008). We consider such a setting. In our simulation, the total number of tests is m = 2000 and the proportion of non-null hypotheses is p = 0.1. Let m0 = m(1 − p). Without loss of generality, we assume that the first m0 elements X0 = (X1, …, Xm0) are null cases and the remaining mm0 elements X1 = (Xm0+1, …, Xm) are non-null cases. Under the null, X1, …, Xm0 are independent observations from N(0, 1). Under the alternative, X1 follows a multivariate normal distribution with mean μ1 = μ1mm0 and equi-correlated variance covariance matrix Σ = (1 − ρ)I + ρJ, where 1mm0 is a vector of ones, I is the identity matrix and J is a square matrix of ones.

We vary the critical value t from 1.95 to 4 and calculate the false discovery rate by averaging over 2000 replications. The results are summarized in the second row of Figure 1. The left and right panels consider the weakly correlated case where μ = 2.5 and ρ = 0.1 and the strongly correlated case where μ = 2.5 and ρ = 0.9. We can see that under weak correlation, the false discovery rate is monotonically decreasing in the threshold. In contrast, under strong correlation, condition (4) is violated because the false discovery rate first decreases and then increases and finally decreases in the critical value t.

Inspired by a comment from a reviewer, we investigated the relationship between the marginal false discovery rate and false discovery rate under dependency. The two error measures can be very different when the tests are highly correlated. We present the results related to the false discovery rate here since it is more commonly used. See the Supplementary Material for more results on the marginal false discovery rate.

3·3. A real data example

Next we present an example from a DNA methylation study. The study was conducted by Teschendorff et al. (2010) to investigate the mechanisms of diabetic nephropathy, which often develops in patients with chronic diabetes. The data set contains 96 cases and 98 controls on 25880 markers. We are interested in identifying markers at which the proportions of methylation are different between cases and controls. A two sample t-statistic is calculated for each gene and the t-statistics are then converted to p-values.

The left panel of Figure 2 contains the histogram of p-values overlaid with the density estimate ĝ(t). The mixture distribution is G(t) = (1 − p)t + pG1(t). Condition (6) implies that G1(t) is concave. Hence a roughly decreasing pattern is expected for ĝ(t) should the monotone likelihood ratio condition hold. However we can see that ĝ(t) first increases and then decreases, indicating that condition (6) is violated. A direct consequence is that the false discovery rate is not a monotone function of the p-value cutoff, which makes the search for optimal threshold impossible. To see this, we apply the q-value false discovery rate approach (Storey, 2002) to estimate the non-null proportion as = 0.49. The false discovery rate for a given cutoff t can be approximately estimated as FDR^(t)=(1-p^)t/{m-1iI(Pi<t)}. The right panel of Figure 2 plots the false discovery rate estimates against a grid of p-value cutoffs; it first decreases and then increases. The pattern is very counter-intuitive, and, moreover, the results are uninterpretable since a larger p-value may correspond to a smaller false discovery rate level in the range between 0 and 0.20. We suspect that in this data set the p-value ranking is inappropriate. In other words, small p-values do not necessarily indicate strong evidence against the null. This example shows that the multiple testing results should be interpreted with caution. In particular, further investigation is required for possible effects of the normality assumption, heteroscedasticity, grouping and dependence among tests.

Fig. 2.

Fig. 2

The left plot is the histogram and density of p-values; the right plot is the estimated false discovery rate

4. Generalized Monotone Ratio Condition

Let T = (T1, …, Tm) be the test statistics and = (θ1, …, θm) be Bernoulli(pi) variables with pi = pr(θi = 1), i = 1, …, m. Suppose that Ti | θi ~ (1 − θi)Gi0 + θiGi1. Condition (6) requires all Gi0’s (and Gi1’s) to be identical. Now we generalize condition (6) by allowing Gi0 and Gi1 to vary across i so that we can handle a wider class of test statistics such as weighted p-values (Genovese et al., 2006) and the local index of significance (Sun & Cai, 2009). Let gi0 and gi1 be the corresponding densities. Define the following generalized monotone ratio condition

i=1mpigi1(t)i=1m(1-pi)gi0(t)ismonotonicallydecreasingint. (9)

The next theorem generalizes Proposition 1.

Theorem 2

Consider a decision rule of the form δ = {δi: i = 1, …, m} = {I(Ti < t): i = 1, …, m}. If Ti satisfies (9), then (i) mFDR(t) increases in t; (ii) mFNR(t) decreases in t; and (iii) mFNR(t) decreases in mFDR(t).

Next we propose a class of test statistics which always satisfy the generalized condition (9). Let θi ~ Bernoulli(pi). Suppose we observe X = (X1, …, Xm) from the following model

X=μ+ε, (10)

where μi | θi ~ (1 − θi)fi0(μ) + θifi1(μ) and E(ε) = 0. The use of fi0(μ) and fi1(μ) allows the null and non-null distributions to vary with i. We also assume that θ and ε follow some multivariate distribution with arbitrary covariance matrices Σθ and Σε, respectively. The next theorem derives a class of test statistics for model (10) which always obey (9).

Theorem 3

Consider model (10). Denote by Θ the collection of all model parameters pi, fi0, fi1, Σθ and Σε. Suppose an oracle knows Θ. Let TORi=prΘ(θi=0X) be the oracle test statistic and TOR={TORi:i=1,,m}. Then TOR satisfies condition (9).

The oracle statistic involves unknown parameters which require accurate estimation in practice. In situations where Θ and TOR can be estimated well, Theorem 3 can be directly applied to avoid the failure of condition (9). For example, suppose X1, …, Xm are a random sample from mixture density f(x) = (1 − p)f0(x) + pf1(x). Then condition (9) reduces to condition (6) and TORi reduces to the local false discovery rate Lfdr(Xi) = (1 − p)f0(Xi)/f (Xi), which by Theorem 3 obeys (6). Similarly, test statistics which obey (9) can be derived, for exmaple, in hidden Markov models and the multi-group model considered by Efron (2008) and Cai & Sun (2009). In the Supplementary Material we revisit Example 1 to demonstrate an important application of Theorem 3. Theorems 2 and 3 together provide a useful framework for choosing proper test statistics in practice. However, the scope of our result is limited since strong distributional assumptions are needed and the estimation of unknown Θ can be very challenging. By revealing the interesting connection between estimation and testing in problems arising from model (10), we show that much research is still needed towards a more general estimation and testing theory in large-scale simultaneous inference.

5. Discussion

The monotone likelihood ratio condition plays an important role in optimal thresholding theory for false discovery rate analysis. It guarantees that precise false discovery control leads to the most powerful test. We provide important scenarios where this seemingly reasonable assumption is violated and discuss the consequence of violation using both simulated and real data. Although our discussion primarily considers the false discovery rate, we expect that similar issues exist for other important error measures in multiple testing (Romano & Wolf, 2007). We argue that the tacit assumption (4) should be scrutinized in practice and optimal thresholds in multiple testing need to be carefully interpreted.

The failure of the monotonicity condition can be resulted from improper model assumptions such as homoscedasticity and normality of the distributions, as well as independence and homogeneity among the tests. We discussed a possible framework for choosing test statistics to avoid the failure of condition. However, our theory is far from solving the problem completely. Instead, the main goal is to demonstrate why one should be very careful on unknown model aspects and distributional issues in analyzing complex data sets from modern scientific applications, which commonly consist of a large number of variables with a small sample size. Our investigation reveals that, in addition to the existing list of concerns, the seemingly reasonable monotonicity assumption can be violated unexpectedly. Hence precise inference in the large p small n paradigm is very difficult and we should always proceed with caution.

Supplementary Material

Supplementary material

Acknowledgments

We thank the editor, an associate editor and two referees for helpful suggestions that streamlined this paper. We thank Michael Wu for providing us the DNA methylation data. This research was supported by grants from the U.S. National Science Foundation and U.S. National Institute of Health.

Footnotes

Supplementary material

Supplementary Material available at Biometrika online includes proofs of all theorems, simulation studies on grouped hypothesis testing and marginal false discovery rate analysis, and a revisit of Example 1.

Contributor Information

Hongyuan Cao, Email: hycao@uchicago.edu, Department of Health Studies, University of Chicago, Chicago, Illinois 60637, U.S.A.

Wenguang Sun, Email: wenguans@marshall.usc.edu, Department of Information and Operation Management, Marshall School of Business, University of Southern California, Los Angeles, California 90089, U.S.A.

Michael R. Kosorok, Email: kosorok@unc.edu, Department of Biostatistics and Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27514, U.S.A

References

  1. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc B. 1995;57:289–300. [Google Scholar]
  2. Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics. 2000;25:60–83. [Google Scholar]
  3. Benjamini Y, Krieger AM, Yekutieli D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika. 2006;93:491–507. [Google Scholar]
  4. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Statist. 2001;29:1165–1188. [Google Scholar]
  5. Cai TT, Sun W. Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. J Amer Statist Assoc. 2009;488:1467–1481. [Google Scholar]
  6. Cao H, Kosorok MR. Simultaneous critical values for t-tests in very high dimensions. Bernoulli. 2011;17:347–394. doi: 10.3150/10-BEJ272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Efron B. Correlation and large-scale simultaneous significance testing. J Amer Statist Assoc. 2007;102:93–103. [Google Scholar]
  8. Efron B. Simultaneous inference: When should hypothesis testing problems be combined? Annals of Applied Statistics. 2008;2:197–223. [Google Scholar]
  9. Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. J Amer Statist Assoc. 2001;96:1151–1160. [Google Scholar]
  10. Genovese C, Wasserman L. Operating characteristics and extensions of the false discovery rate procedure. J R Stat Soc B. 2002;64:499–517. [Google Scholar]
  11. Genovese C, Wasserman L. A stochastic process approach to false discovery control. Ann Statist. 2004;32:1035–1061. [Google Scholar]
  12. Genovese CR, Roeder K, Wasserman L. False discovery control with p-value weighting. Biometrika. 2006;93:509–524. [Google Scholar]
  13. Kosorok M, Ma S. Marginal asymptotics for the “large p, small n” paradigm: with application to microarray data. Ann Statist. 2007;35:1456–1486. [Google Scholar]
  14. Logan B, Geliazkova M, Rowe D. An evaluation of spatial thresholding techniques in fmri analysis. Human Brain Mapping. 2008;29:1379–1389. doi: 10.1002/hbm.20471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Romano JP, Wolf M. Control of generalized error rates in multiple testing. Ann Statist. 2007;35:1378–1408. [Google Scholar]
  16. Storey JD. A direct approach to false discovery rates. J R Stat Soc B. 2002;64:479–498. [Google Scholar]
  17. Sun W, Cai TT. Oracle and adaptive compound decision rules for false discovery rate control. J Amer Statist Assoc. 2007;102:901–912. [Google Scholar]
  18. Sun W, Cai TT. Large-scale multiple testing under dependence. J R Stat Soc B. 2009;71:393–424. [Google Scholar]
  19. Teschendorff AE, Menon U, Gentry-Maharaj A, Ramus SJ, Weisenberger DJ, Shen H, Campan M, Nourshmehr H, Bell CG, Maxwell AP, Savage DA, Mueller-Holzner E, Marth C, Kocjan G, Gayther SA, Jones A, Beck S, Wagner W, Laird PW, Jacobs IJ, Widschwendter M. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Research. 2010:440–446. doi: 10.1101/gr.103606.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Wu WB. On false discovery control under dependence. Ann Statist. 2008;36:364–380. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

RESOURCES