Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Sep 15.
Published in final edited form as: Stat Probab Lett. 2008 Sep 15;78(13):1878–1884. doi: 10.1016/j.spl.2008.01.055

Optimality of the Holm procedure among general step-down multiple testing procedures

Alexander Y Gordon †,, Peter Salzman ‡,§
PMCID: PMC2583789  NIHMSID: NIHMS69192  PMID: 19759804

Abstract

We study the class of general step-down multiple testing procedures, which contains the usually considered procedures determined by a nondecreasing sequence of thresholds (we call them threshold step-down, or TSD, procedures) as a parametric subclass. We show that all procedures in this class satisfying the natural condition of monotonicity and controlling the family-wise error rate (FWER) at a prescribed level are dominated by one of them – the classical Holm procedure. This generalizes an earlier result pertaining to the subclass of TSD procedures (Lehmann and Romano, Testing Statistical Hypotheses, 3rd ed., 2005). We also derive a relation between the levels at which a monotone step-down procedure controls the FWER and the generalized FWER (the probability of k or more false rejections).

Keywords: multiple testing procedure, p-value, step-down procedure, monotone procedure, family-wise error rate, generalized family-wise error rate, Holm procedure

1 Introduction

Multiple testing procedures (MTPs) have many applications. Among major applications are those in biology and medicine, such as DNA microarray experiments, clinical trials with multiple endpoints, functional neuroimaging. Their needs have largely motivated the intensive development of the theory and methods of multiple hypothesis testing in the last decade. For a brief introduction to MTPs in the context of microarray experiments, see Dudoit et al. (2003). More information can be found in Hochberg and Tamhane (1987), Shaffer (1995), Hsu (1996), and Benjamini et al. (2004).

The problem of multiple testing arises when we want to simultaneously test several, say m, hypotheses about the probability distribution from which our data are drawn. We assume that associated with each hypothesis Hi is a p-value, Pi, which is a random variable, such that 0 ≤ Pi ≤ 1 and, if Hi is true,

pr{Pix}x,0x1. (1)

The observed p-value measures the strength of the evidence against the corresponding hypothesis, typically provided by an appropriate test statistic: the smaller the p-value, the stronger the evidence. For a more detailed discussion of p-values, see, e.g., Lehmann and Romano (2005b, pp. 63–64).

From now on we will assume that hypothesis Hi is true if and only if (1) holds.

In the rest of the Introduction we will outline the notions necessary for our main result, as well as the result itself. Rigorous formulations will be given in Section 2 and Section 3.

A multiple testing procedure, say ℳ, is a decision rule that, based on the vector of observed p-values (p-vector), selects some hypotheses for rejection; the corresponding p-values are said to be ℳ-significant, or just significant, if this causes no ambiguity.1

We assume that an MTP has two basic properties: it is cutting (a significant p-value cannot exceed an insignificant one) and symmetric (permuting components of a p-vector does not affect which of them will be found significant).

Our main result pertains to monotone step-down procedures. The natural condition of monotonicity means that if some p-values are replaced by smaller ones, the number of rejected hypotheses can only increase. The fact that ℳ is a step-down MTP essentially means: whether or not a p-value is ℳ-significant depends only on this and smaller p-values.

We consider those monotone step-down procedures for which the probability of rejecting at least one true hypothesis (the family-wise error rate, or FWER) does not exceed a chosen level α (0 < α < 1), whatever the joint distribution of the p-values. Our main result states that all those MTPs are dominated by one of them – the classical Holm procedure. Domination means that whenever such a procedure ℳ rejects some hypotheses, the Holm procedure, based on the same p-values, rejects those hypotheses too.

A particular case of this result pertaining to the algorithmically defined class of what we call threshold step-down (TSD) procedures (see Subsection 2.3 below) was obtained earlier by Lehmann and Romano (2005b, Chapter 9).

The remaining part of the paper is organized as follows. In Section 2 we give necessary definitions, mainly following Gordon (2007a) and (Gordon 2007b) (with some changes in terminology). In Section 3 the main result is rigorously formulated and proven. In Section 4 we apply it to the study of generalized family-wise error rates of a monotone step-down procedure.

2 Multiple testing procedures and the FWER

2.1 Multiple testing procedures

A p-value-only-based multiple testing procedure (in the sequel – multiple testing procedure, or MTP) is a Borel measurable mapping ℳ: Im → 2Km from the unit cube Im = [0, 1]m to the set 2Km of all subsets of Km = {1, 2, … ,m}. Applying ℳ to a vector p = (p1,…, pm) of observed p-values (p-vector), we obtain a subset ℳ(p) of Km; the inclusion i ∈ ℳ(p) means that, given p-values p1,…, pm, ℳ rejects hypothesis Hi, or equivalently, the ith p-value is ℳ-significant.

Example 2.1

The classical Bonferroni procedure, which we denote by Bon fα (0 < α < 1), is defined by Bon fα(p) := {iKm : pi ≤ α/m}, pIm.

A multiple testing procedure ℳ is symmetric, if for any p-vector p and any one-to-one mapping (permutation) σ:KmKm, denoting by σ(p) such p-vector p′ that pσ(i)=pi for all i, we have

(σ(p))=σ((p)). (2)

Informally speaking, this means that if we list the hypotheses (and their p-values) in a different order, the procedure ℳ will reject the same hypotheses as before.

Procedure ℳ is cutting, if it satisfies the following condition: whenever, given p-vector p, pi is ℳ-significant and pj is not, we have pipj. In fact, in this case we even have pi < pj, provided ℳ is symmetric; indeed, symmetry implies that if pi = pj, then the corresponding hypotheses Hi and Hj are either both rejected or both accepted. (To see this, use (2) choosing the permutation σ that transposes i and j.)

All MTPs considered below are assumed to be symmetric and cutting.

Procedure ℳ determines a function s: ImLm = {0, 1, 2,…,m} defined as follows: sℳ(p) is the number of rejected hypotheses, given p-vector p (i.e., s(p) = Card ℳ(p)). The function s(·) is symmetric (permuting components of the p-vector p does not affect s(p), which follows from (2)) and, in turn, allows to uniquely recover procedure ℳ: given p, ℳ rejects hypotheses with the s(p) smallest p-values. Note that the function s(·), in view of its symmetry, is completely determined by its values on the simplex

Simpm={t=(t1,,tm)Rm:0t1tm1};

the same is true for ℳ. The restriction of s(·) to Simpm can be any Borel measurable function s(·) on Simpm with the following property:

iftSimpmandr:=s(t)satisfies1rm1,thentr<tr+1.

This property is equivalent to the fact that declaring the first s(t) components of tSimpm significant will not result in equalities between significant and insignificant components.

For a more detailed discussion, see Gordon (2007a).

2.2 Comparison of procedures

Let ℳ and ℳ′ be two multiple testing procedures. Following Liu (1996), we say that procedure ℳ′ dominates procedure ℳ, if for any p-vector p = (p1, p2,…,pm) we have ℳ′(p) ⊃ ℳ(p), i.e., ℳ′ rejects all hypotheses Hi rejected by ℳ (and maybe some others). In this case we write ℳ′ ⪰ ℳ.

It is easy to see that ℳ′ ⪰ ℳ if and only if sℳ′(t) ≥ s(t) for all tSimpm.

Note that the partial order ⪰ between MTPs is not linear: both relations ℳ′ ⪰ ℳ and ℳ ⪰ ℳ′ may be false. Therefore, given an arbitrary class of MTPs, one should not expect that it contains a procedure that dominates every procedure in the class.

2.3 Step-down procedures

A multiple testing procedure ℳ is a step-down procedure if, given tSimpm, whether or not i ∈ ℳ(t) (i.e., s(t) ≥ i), depends only on t1,…, ti. In other words, it is required that if t, t′ ∈ Simpm, tj=tj for all j = 1, 2,…, i, and s(t) ≥ i, then also s(t′) ≥ i.

The threshold step-down (TSD) procedure generated by uSimpm (notation: TSDu), given a nondecreasing sequence of p-values t = (t1, t2,…,tm) ∈ Simpm, declares ti significant if and only if tjuj for all j = 1,…,i.

Remark 2.1

A TSD procedure is a step-down procedure.

Example 2.2

The Bonferroni procedure Bon fα (0 < α < 1) is TSDu with ui = α/m, i = 1,…,m.

Example 2.3

The Holm (1979) procedure Holmα (0 < α < 1) is TSDu with ui = α/(mi + 1), i = 1,…,m.

Example 2.4

The procedure introduced by Benjamini and Liu (1999) is TSDu with

ui=min(1,mq(mi+1)2),1im(0<q<1).

Given two m-dimensional vectors x and y, we write xy if xiyi, i = 1,…,m.

Remark 2.2

TSDu′TSDu if and only if u′ ≥ u.

Note that the class of step-down procedures is broader than that of TSD procedures.

Example 2.5

Let m = 2 and a, ε ∈ (0, 1). Let procedure ℳ act as follows. Given 0 ≤ t1t2 ≤ 1, t1 is ℳ-significant if and only if t1 ≤ ε; t2 is ℳ-significant if and only if t1 ≤ ε and at1 + (1 − a)t2 ≤ ε. Then ℳ is a well-defined MTP (see the end of Subsection 2.1); moreover, ℳ is a step-down procedure, but not a TSD one.

This example can be modified in many ways and generalized to arbitrary m > 1.

2.4 Monotone procedures

We call a multiple testing procedure, ℳ, monotone, if for any p, p′ ∈ Im relation p′ ≤ p implies: s(p′) ≥ s(p).

To show that a given procedure is monotone, it suffices to establish the above implication only for pairs of p-vectors that belong to Simpm, which is usually simpler.

Proposition 1

Procedure ℳ is monotone, if and only if relation t′ ≤ t (t, t′ ∈ Simpm) implies: s(t′) ≥ s(t).

This statement, in a slightly different formulation, was proven in (Gordon 2007b). It shows that the definition of monotonicity given in that work is equivalent to the definition given above.

Corollary 1

Any TSD procedure is monotone.

Remark 2.3

Procedure ℳ in Example 2.5 is monotone.

2.5 Family-wise error rate

The family-wise error rate (FWER) introduced by Tukey (1953) measures the probability with which some type I errors occur.

Let ℳ be a multiple testing procedure and P a probability distribution on the unit cube Im. Let P = (P1, P2,…,Pm) be a random vector with distribution P. Denote by T the set (depending on P) of all iKm = {1, 2,…,m} for which the ith hypothesis

Hi:pr{Pix}xforallx[0,1] (3)

is true. Let

FWER(,P):=pr{(P)T}. (4)

In other words, FWER(ℳ, P) is the probability for ℳ, given a set of m random values Pi with joint distribution P, to reject at least one true hypothesis Hi.

Procedure ℳ is said to control the FWER at level α, if for all probability distributions P on Im we have FWER(ℳ, P) ≤ α. This is true, if and only if the quantity

FWER():=supPFWER(,P), (5)

where the supremum is taken over all probability distributions P on Im, does not exceed α.

The quantity (5) may be viewed as the exact level at which ℳ controls the FWER.

3 Optimality of the Holm procedure

The Holm procedure Holmα (see Example 2.3), given tSimpm, finds ti significant if and only if

tjαmj+1forallj,1ji.

Proposition 2

(Holm 1979). FWER(Holmα) ≤ α.

Therefore, Holmα is a monotone step-down procedure (see Remark 2.1 and Corollary 1) controlling the FWER at level α. The following statement – the main result of this work – shows that Holmα dominates any procedure with similar properties.

Theorem 1

Let ℳ be a monotone step-down multiple testing procedure with

FWER()α<1. (6)

Then ℳ ⪯ Holmα.

Proof

Let ℳ be a monotone step-down procedure, such that (6) is true. Suppose that, given a nondecreasing sequence of p-values

t=(t1,,tm),0t1t2tm1, (7)

procedure ℳ finds the kth of them, tk, significant. How large can the number tk be?

Set l := mk + 1, τ := min{tk, 1/l}, and

β:=lτ, (8)

so that β ≤ 1.

Define l random variables

Pk,Pk+1,,Pm (9)

as follows. First, choose at random an integer z equal to one of the numbers k, k + 1,…,m, 0 with probabilities τ, τ, …, τ, 1 − β, respectively. Then generate, independently of each other (given z), the random numbers (9), so that if z ≠ 0, Pz is uniformly distributed on the interval [0, τ], while the others are uniformly distributed on [τ, 1]; if z = 0, all Pi, i = k, k + 1,…,m, are uniformly distributed on [τ, 1].

It follows from the construction that the distribution of each random variable Pi, i = k, k + 1,…,m, is supported on [0, 1] and has a constant density on each of the two intervals [0, τ] and [τ, 1]. On the first of them the density equals 1, hence the same is true for the other (because the integral of the density is 1). Therefore,

eachPi,i=k,k+1,,m,isuniformlydistributedon[0,1]. (10)

Also, we have

minkimPiτ,ifz0. (11)

Furthermore, let P1,…, Pk−1 be such random variables that 0 ≤ Piti, i = 1,…,k − 1. These inequalities and (11), where τ ≤ tk, imply that the ordered random variables P(i) satisfy inequalities

P(i)ti,i=1,2,,k,ifz0. (12)

(We use the following property of nondecreasing rearrangements: given a1,…, amR and cR, the inequality a(i)c is equivalent to the existence of at least i values of j, such that ajc.)

The random vector P = (P1,…, Pm), whose distribution on the unit cube [0, 1]m we denote by P, has the following properties:

  1. According to (10), hypotheses Hk,Hk+1,…,Hm (see (3)) are true.

  2. By (12), with probability at least β we have
    P(i)ti,i=1,2,,k. (13)

Since the procedure ℳ found the first k terms ti in (7) significant, being a step-down procedure, ℳ will still find them significant if the remaining terms tk+1, tk+2,…, tm are replaced by 1’s. Because ℳ is monotone, the first k terms of the sequence (P(1),…, P(m)) will also be found significant, if (13) holds. According to (ii), this occurs with probability at least β. By (i), no more than k − 1 hypotheses Hi are false, so by rejecting k or more hypotheses procedure ℳ commits at least one type I error. We, therefore, have FWER(ℳ, P) ≥ β, and consequently FWER(ℳ) ≥ β. Comparing this with (6), we see that β ≤ α. In particular, β < 1, so that recalling (8) we obtain τ = tk and

tkαmk+1.

Because, given p-vector (7), procedure ℳ finds all tj, j = 1,…, k, significant, it follows that inequalities

tjαmj+1,j=1,,k,

are all true; this means that the kth p-value tk is Holmα-significant.

We have shown that if, for a given tSimpm, its kth component tk is ℳ-significant, then it is Holmα-significant as well. Therefore, ℳ ⪯ Holmα.

Corollary 2

(Lehmann and Romano 2005b, Chapter 9) Let ℳ = TSDu, where uSimpm. If FWER(ℳ) ≤ α < 1, then ui ≤ α/(m − i + 1), i = 1,…,m.

Proof

A TSD procedure is a step-down procedure; according to Corollary 1, it is monotone. Hence, Theorem 1 applies, and ℳ ⪯ Holmα. In view of Remark 2.2, this implies inequalities between respective thresholds of the two procedures.

The following example shows that the condition that ℳ is a step-down procedure cannot be removed from Theorem 1.

Example 3.1

Let ℳ be a procedure that, given a p-vector tSimpm, rejects all hypotheses, if ti ≤ α for all i, and accepts all hypotheses otherwise. This procedure, as is easy to see, is monotone and controls FWER at level α. Nevertheless, the relation ℳ ⪯ Holmα is not true: if ti = c (α/m < c < α), i = 1, 2,…,m, then ℳ rejects all hypotheses, while Holmα rejects none.

4 Generalized FWER of a step-down procedure

The concept of generalized FWER – the probability of rejecting k or more true hypotheses – was introduced by Victor (1982) and studied in the context of TSD procedures by Hommel and Hoffman (1988), Lehmann and Romano (2005a), and Gordon (2007a). We use notation from the latter work similar to (4) and (5): for a multiple testing procedure ℳ, a probability distribution P on the unit cube Im, and an integer k (1 ≤ km), we set

FWERk(,P):=pr{Card((P)T)k},

where P is a random vector with distribution P and T is the set (depending on P) of all i ∈ {1, 2,…,m} for which the hypothesis (3) is true. Furthermore, we set

FWERk():=supPFWERk(,P).

This quantity may be interpreted as the exact level at which ℳ controls FWERk.

Obviously, FWERk(ℳ) ≤ FWER1(ℳ) ≡ FWER(ℳ) for all k (1 ≤ km). It turns out that for monotone step-down MTPs this inequality can be significantly strengthened.

Theorem 2

Let ℳ be a monotone step-down procedure, such that FWER(ℳ) ≤ α (α < 1). Then for all k (1 ≤ km)

FWERk()Ckα, (14)

where

Ck={4k/(k+1)2,ifkis odd;4/(k+2),ifkis even. (15)

The upper bound Ckα in (14) is sharp.

Proof

By Theorem 1, ℳ ⪯ Holmα, so that FWERk(ℳ) ≤ FWERk(Holmα). It is shown in Gordon (2007a, Example 5.1) that FWERk(Holmα) = Ckα, where Ck is given by (15). This proves (14). The upper bound Ckα cannot be replaced by a smaller one, because it is attained when ℳ= Holmα.

Remark 4.1

Note that the sharp upper bound in (14) does not depend on m (the number of hypotheses, mk). Also note that 1 = C1 = C2 > C3 > C4 > … and that Ck < 4/k, the two sides being asymptotically equivalent for large k.

Acknowledgements

We are grateful to Dr. Andrei Yakovlev for valuable discussions. We are also thankful to the anonymous referee whose comments and suggestions helped to improve the presentation.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1

We call MTPs of this type p-value-only-based (see Subsection 2.1). There are other types of procedures in the literature, see, e.g., Westfall and Young (1993).

References

  1. Benjamini Y, Bretz F, Sarkar S, editors. Recent Developments in Multiple Comparison Procedures, Volume 47 of Lecture Notes – Monograph Series. Beathwood, Ohio: Institute of Mathematical Statistics; 2004. [Google Scholar]
  2. Benjamini Y, Liu W. Department of Statistics and Operations Research, Tel Aviv University; A distribution-free multiple test procedure that controls the false discovery rate. Technical report. Technical report RP-SOR-99-3.
  3. Dudoit S, Shaffer JP, Boldrich JC. Multiple hypothesis testing in microarray experiments. Statist. Science. 2003;18:71–103. [Google Scholar]
  4. Gordon AY. Explicit formulas for generalized family-wise error rates and unimprovable step-down multiple testing procedures. J. Statist. Plann. 2007a;Inference 137 (S.N.Roy Centennial Volume):3497–3512. Available online at http://www.sciencedirect.com. [Google Scholar]
  5. Gordon AY. Unimprovability of the Bonferroni procedure in the class of general step-up multiple testing procedures. Statist. Prob. Lett. 2007b;77:117–122. available online at http://www.sciencedirect.com. [Google Scholar]
  6. Hochberg Y, Tamhane AC. Multiple Comparison Procedures. New York: Wiley; 1987. [Google Scholar]
  7. Holm S. A simple sequentially rejective multiple test procedure. Scand. J. Statist. 1979;6:65–70. [Google Scholar]
  8. Hommel G, Hoffman T. Controlled uncertainty. In: Bauer P, Hommel G, Sonnemann E, editors. Multiple Hypothesis Testing. Heidelberg: Springer; 1988. pp. 154–161. [Google Scholar]
  9. Hsu JC. Multiple Comparisons: Theory and Methods. New York: Chapman & Hall; 1996. [Google Scholar]
  10. Lehmann EL, Romano JP. Generalizations of the familywise error rate. Ann. Statist. 2005a;33:1138–1154. [Google Scholar]
  11. Lehmann EL, Romano JP. Testing statistical hypotheses. 3rd ed. New York: Springer; 2005b. [Google Scholar]
  12. Liu W. Multiple tests of a non-hierarchical family of hypotheses. J. Roy. Statist. Soc. Ser. B. 1996;58:455–461. [Google Scholar]
  13. Shaffer J. Multiple hypothesis testing: A review. Annual Review of Psychology. 1995;46:561–584. [Google Scholar]
  14. Tukey JW. The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948–1983. New York: Chapman and Hall; 1953. The problem of multiple comparison. Unpublished manuscript; pp. 1–300. [Google Scholar]
  15. Victor N. Exploratory data analysis and clinical research. Methods of Information in Medicine. 1982;21:53–54. [PubMed] [Google Scholar]
  16. Westfall PH, Young SS. Resampling-based multiple testing: examples and methods for p-value adjustment. New York: Wiley; 1993. [Google Scholar]

RESOURCES