Multiple testing and its applications to microarrays

Yongchao Ge; Stuart C Sealfon; Terence P Speed

doi:10.1177/0962280209351899

. Author manuscript; available in PMC: 2014 Aug 14.

Published in final edited form as: Stat Methods Med Res. 2009 Dec;18(6):543–563. doi: 10.1177/0962280209351899

Multiple testing and its applications to microarrays

Yongchao Ge ¹, Stuart C Sealfon ¹, Terence P Speed ^2,³

PMCID: PMC4131454 NIHMSID: NIHMS610756 PMID: 20048384

Abstract

The large-scale multiple testing problems resulting from the measurement of thousands of genes in microarray experiments have received increasing interest during the past several years. This paper describes some commonly used criteria for controlling false positive errors, including familywise error rates, false discovery rates and false discovery proportion rates. Various statistical methods controlling these error rates are described. The advantages and disadvantages of these methods are discussed. These methods are applied to gene expression data from two microarray studies and the properties of these multiple testing procedures are compared.

Keywords: multiple testing, familywise error rate, false discovery rate, false discovery proportion, adjusted p-value, resampling, microarray

1 Introduction

During the past several years, researchers have increasingly used large-scale multiple testing procedures for microarray data analysis involving thousands of genes. Various statistical procedures to address multiple testing problems associated with microarray studies have been reported. Microarray assays generate mRNA transcript concentration estimates for thousands of genes simultaneously.^1–4 Some reviews of microarray data analysis can be found in Nature Genetics Supplement,⁵ Speed,⁶ and Parmigiani et al.⁷ One common goal in conducting microarray experiments is to identify genes that are differentially expressed, i.e., genes whose expression values are associated with a response or covariate of interest. The response could be censored survival time or other clinical outcomes, the covariates could be either categorical (e.g. treatment/control status, cell type) or continuous (e.g. dose of a drug, time of growth). A full treatment of multiple testing problems for different types of microarray data is found in Dudoit et al.⁸ The present paper focuses on the comparison of gene expressions between two groups, as seen in the two microarray datasets^9,10 used for illustration.

The expressions for m genes and n samples from a microarray experiment are typically represented by an m × n matrix X. Here m is typically several thousands or tens of thousands, and n is usually less than one hundred due to the cost of microarray experiments and limitation on biological materials. We assume that the first n₁ samples and the remaining n₂ = n − n₁ samples are formed into two groups. For each gene, we can formulate a null hypothesis test that gene i is non-differentially expressed,

H_i: The mean expression values of gene i are the same between the two groups.

The null hypothesis H_i can be tested by a two-sample Welch t-statistic,¹¹

t_{i} = \frac{{\bar{x}}_{i 2} - {\bar{x}}_{i 1}}{\sqrt{\frac{s_{i 2}^{2}}{n_{2}} + \frac{s_{i 1}^{2}}{n_{1}}}},

where x̄_i₁ and x̄_i₂ denote the sample average expression values of gene i in the two groups, respectively; $s_{i 1}^{2}$ and $s_{i 2}^{2}$ are the corresponding sample variances.

The greater t_i is in absolute value, the stronger evidence is that gene i is differentially expressed. The p-values p_i, i = 1, …, m, can be defined as,

p_{i} = P (∣ T_{i} ∣ \geq ∣ t_{i} ∣ ∣ H_{i} ∣) .

This probability can be computed by a resampling method (permutation or bootstrap) or by a Student t-distribution table with an appropriate number of degrees of freedom. When the Student t-distribution table is applied to microarray data of small sample size, one needs to be comfortable with the assumption that x_ij are normally distributed. In either case, the p-values obtained can be used to rank the genes and the top ranking genes can be examined. Whether a gene is truly differentially expressed can be confirmed by carrying out further experiments such as quantitative real time PCR. However, one may want a quantitative assessment of the evidence concerning the possibly differentially expression genes in order to avoid follow up genes having little prospect of being truly differentially expressed. To address this need, one can carry out a simultaneous test for the null hypothesis for each gene that the expression is non-differentially expressed. In any such testing situation, two types of errors can occur: a false positive, or type I error, is committed when a gene is falsely declared to be differentially expressed, and a false negative, or type II error, is committed when the test fails to identify a truly differentially expressed gene. The numbers of type I errors and type II errors are denoted by V and D respectively^* in Table 1. Since a typical microarray experiment measures the expressions of thousands of genes simultaneously, one is faced with an extreme multiple testing problem. Special problems arising from the multiplicity aspect include defining an appropriate type I error rate, and devising powerful multiple testing procedures that control this error rate and that incorporate the joint distribution of the test statistics T₁, …, T_m.

Table 1.

Summary table for multiple testing problems, based on Table 1 in Benjamini and Hochberg (1995).¹³

	accept	reject	total
true null hypotheses	U	V	m₀
false null hypotheses	D	S	m₁
total	W	R	m

Open in a new tab

This paper covers three commonly used criteria to define type I error rates: the familywise error rates,¹² the false discovery rates,¹³ and the false discovery proportion rates.¹⁴ Section 2 introduces the basic notions of multiple testing and gives the definitions of these three criteria. Sections 3 reviews simple procedures that do not require resamplings. Sections 4 presents resampling-based procedures. The multiple testing procedures of Sections 3 and 4 are applied to the gene expression datasets from two published microarray studies described in Section 5. And finally, Section 6 summarizes our findings and outlines open questions.

2 Multiple testing

2.1 Set-up

We will adopt the notations of Romano and Wolf¹⁵ to describe a multiple testing problem. We first assume that gene expression data matrix X belongs to a certain family of multivariate distributions Inline graphic = {P_θ, θ ∈ Ω}, and hypothesis H_i can be viewed as a subset ω_i of Ω. Here Ω can be a nonparametric model, a parametric model, or a semiparametric model. For the microarray dataset of this paper, we can define the set for hypothesis H_i,

ω_{i} = {θ : μ_{i 1} (P_{θ}) = μ_{i 2} (P_{θ})},

where μ_i₁(P_θ) and μ_i₂(P_θ) are the mean expression values of gene i for the two groups respectively. Let the full set of hypothesis indices be M = {1, …, m}. For any subset K contained by M, define the intersection null hypothesis H_K = ∩_i_∈_K H_i to be

H_{K} : θ \in \underset{i \in K}{\cap} ω_{i} .

Define the set of truly non-differentially expressed genes to be

M_{0} = M_{0} (θ) = {i : ω_{i} ∋ θ} .

The complement set of M₀ is denoted by M₁ = M₁(θ) = {i: ω_i ∋ θ}. Therefore, the full set M is decomposed by M₀(θ) and M₁(θ), the cardinality numbers of which are denoted by m₀ and m₁, respectively, as summarized in Table 1. A multiple testing procedure, can be represented by a decision function

δ = δ (X) = (δ_{1} (X), \dots, δ_{m} (X)),

(1)

where δ_i(X)=1 indicates that gene i is claimed to be differentially expressed (rejection), and δ_i(X) is equal to zero otherwise. For example, a common cutoff value c for all genes’ t-statistics has a decision function δ(X)(c) = I(|T₁| ≥ c), I(|T₂| ≥ c), …, I(|T_m| ≥ c), where I(·) is the indicator function. The cut-off value c can be a gene specific, and/or data adaptive random variable. One may also use the p-values (P₁, …, P_m) to construct a decision function δ(·). For any given multiple testing procedure, the decision function δ(X) has a multivariate probability distribution associated with θ. In the rest of this paper, many functions and variables depend on the observed microarray data X and on the unknown parameter θ, both of which are omitted for the sake of simplicity unless we want to emphasize this dependency. For example, δ(X) will be abbreviated by δ, and M₀(θ) by M₀. For a given δ, the total number of rejected hypotheses is

R = \sum_{i = 1, \dots, m} δ_{i} .

The number of type I errors (false positives or falsely claimed differentially expressed genes) is

V = \sum_{i \in M_{0} (θ)} δ_{i},

(2)

and the number of type II errors is

D = \sum_{i \in M_{1} (θ)} (1 - δ_{i}) .

U, S and W in Table 1 can be similarly defined. We introduce a random variable Q, the false discovery proportion^13,14

Q = FDP = \frac{V}{R} \cdot I (R > 0) .

2.2 Different criteria to control type I errors

V and D are two types of errors we are concerned about. Only a few papers have addressed the problem of type II errors, for example, Genovese and Wasserman introduce the false non-discovery rates¹⁶ and Taylor et al. propose the ‘miss’ rate.¹⁷ The former needs the assumption that the test statistics T₁, …, T_m are samples of a mixture model of two distributions, and the latter gives a conservative estimate of the ‘miss’ rate. At the time of this writing, we are not aware of a general solution to control type II errors for dependent data. This paper focuses on controlling type I errors, as seen in most current literature and as traditionally done in single hypothesis testing. In multiple testing problems, various ways of defining type I error rates are in the following:

Per-comparison error rate (PCER), the ratio of the expected number of type I errors to the number of hypotheses, i.e.,
$PCER = E (V) / m .$
Per-family error rate (PFER). Not really a rate, the PFER is defined as the expected number of type I errors, i.e.,
$PFER = E (V) .$
Familywise error rate (FWER), the probability of committing at least k type I errors,^14,18 i.e.,
${FWER}_{k} = P (V \geq k) .$

The values of interesting k tend to be much less than m. Especially, FWER₁ is exactly the same as the classically defined familywise error rate (FWER). The well known Bonferroni correction controls the FWER for almost all types of datasets.
False discovery rate (FDR). The expectation of the false discovery proportion.¹³
$FDR = E (Q) = E (\frac{V}{R} ∣ R > 0) \cdot P (R > 0) .$

When m₀ = m, we have FDR=FWER.
False discovery proportion rate (FDPR), the probability that the false discovery proportion is at least γ,
${FDPR}_{γ} = P (Q \geq γ) .$

If γ is restricted on the interval (0,1/m), then FDPR_γ is equal to FWER₁.
Proportion of false positives (PFP), the ratio of the expected false positives to the expected positives,¹⁹ i.e.,
$PFP = E (V) / E (R) .$

This is another way to characterize the ratio V/R.
Positive false discovery rate (pFDR). If interest is in estimating an error rate when positive findings have occurred, then the pFDR proposed by Storey²⁰ is appropriate. It is defined as the conditional expectation of the proportion of type I errors among the rejected hypotheses, given that at least one hypothesis is rejected,
$pFDR = E (\frac{V}{R} ∣ R > 0) .$

Storey²⁰ gives a Bayesian interpretation of this definition. In a microarray data study involving thousands of genes, the probability that R > 0 is almost equal to one, and so the FDR and the pFDR are nearly equivalent.

Table 2 summarizes the error rates for controlling the number of false positives: PFER and FWER_k; and the error rates for controlling the false discovery proportion: FDR and FDPR_γ. The above four error rates give the expectations and the exceeding probabilities for V and Q. PFER is equivalent to PCER, which is the same as single hypothesis testing. The PFP and the pFDR are two different ways besides the FDR to describe the expectation of the false discovery proportion Q. The FWER_k provides an overall sense of the number of false positives for confirmation studies; and the FDPR_γ is useful to explore the data in finding interesting genes for further experiments.

Table 2.

Comparisons between the error rates for V and Q

random variable	expectation	exceeding probability
number of false positives V	PFER=E(V)	FWER_k=P (V ≥ k)
false discovery proportion Q	FDR=E(Q)	FDPR_γ=P (Q ≥ γ)

Open in a new tab

2.3 Weak control, exact control and strong control of a multiple testing procedure

A multiple testing procedure, which can be specified by nested decision functions δ^α (·) as in equation (1), is designed to ensure that one of the error rates defined in Section 2.2 is no greater than α, for 0 < α < 1. For example, the Bonferroni procedure can be specified by the decision function

δ_{i}^{α} = I (p_{i} \leq α / m) .

Here p_i is the p-value of testing hypothesis H_i, and p_i can be viewed as a statistic of microarray data matrix X. It is important to note that the expectations and probabilities in the definitions of these type I error rates are determined by the specification of θ or the probability distribution P_θ. It is always helpful to have some sense of the type I error rates under the complete null hypothesis H_M (m₀ = m or M₀ = M), i.e., no genes are differentially expressed. A multiple testing procedure under H_M is said to exhibit weak control. For the FDPR_γ, the weak control is

{FDPR}_{γ} (H_{M}) = max_{θ \in H_{M}} P_{θ} (Q \geq γ) \leq α .

The definition above is dependent upon H_M and the corresponding decision function δ of the multiple testing procedure. A weak control is a minimal requirement for a multiple testing procedure. However, under the complete null H_M, we have V = R, so

pFDR (H_{M}) = PFP (H_{M}) = 1.

Thus weak control of the pFDR and the PFP is impossible. Furthermore, under the complete null H_M, we have Q = I(V ≥ 1) implied by V = R, so

FWER (H_{M}) = FDR (H_{M}) = {FDPR}_{γ} (H_{M}) = max_{θ \in H_{M}} P_{θ} (V \geq 1),

where γ can be any value taking over the interval (0,1).

Weak control is a useful first step to find out if there are any interesting genes at all in the microarray data, but is not satisfactory if we expect that some genes are differentially expressed, which happens to be the case for most microarray datasets. We should be concerned about the error control under the set M₀ of true null hypotheses. This is called exact control, i.e.,

{FDPR}_{γ} (H_{M_{0}}) = max_{θ \in H_{M_{0}}} P_{θ} (Q \geq γ) \leq α .

As M₀ is unknown beforehand and it is the goal of the microarray experiment to find out a possible value of M₀, we need strong control of the error rate to safeguard against all possible situations, i.e.,

{FDPR}_{γ} (Θ) = max_{θ \in Θ} P_{θ} (Q \geq γ) \leq α .

2.4 Estimating and controlling type I error rates

In developing a multiple testing procedure, the first step is to estimate the type I error rate under some idealized situation. For example, for a fixed rejection region [0, c] of p-values, let V_c and R_c be the numbers of false positives and rejections, respectively. A direct computation of the FDR is impossible without knowing the exact value of θ, but the false discovery rate FDR for this rejection region [0, c] can be intuitively estimated by mc/R_c as shown in the following,

{FDR}_{c} = E (\frac{V_{c}}{R_{c}} \cdot I (R_{c} > 0)) \approx \frac{E (V_{c})}{R_{c}} = \frac{m_{0} \cdot c}{R_{c}} \leq \frac{m c}{R_{c}}

Especially, when c is the i-th smallest p-value p₍_i₎, we have the estimated FDR to be

{\hat{FDR}}_{p_{(i)}} = \frac{m \cdot p_{(i)}}{i}

(3)

This point estimate is relatively simple, but computing the mean and variance of it is more challenging especially in the situations that we do not want to impose strong assumptions on the data. The difficulty arises because the probability distribution of the FDR estimate is dependent upon R_c, which relies on the characteristics of the non-null hypotheses. For results about the expectation of the FDR estimate under the assumption that the null p-values are independently distributed as U[0,1], readers are referred to Theorem 1 in Storey, Taylor and Siegmund.²¹

A point estimate of the type I error rate can be used to suggest a multiple testing procedure by rejecting a hypothesis when its corresponding estimate is no greater than the level α. In order for this suggested testing procedure to be valid, attention is required for the proof of strong control. For example, the procedure suggested by equation (3) is equivalent to the BH procedure.^13,22 Benjamini and Yekutieli²³ gave the proof that the BH procedure provides strong control for the FDR when the joint distribution of the p-values (p₁, p₂, ···, p_m) satisfy positive regression dependency for the subset M₀ (PRDS),²³ a much weaker assumption than that the null p-values are independently distributed. The meaning that statistics T satisfy PRDS is that for any increasing set D and for each i in M₀, P (T ∈ D | T_i = t) is nondecreasing in t. A set D is said to be increasing if for any x ∈ D and y_i ≥ x_i, i = 1, ···, m, then y ∈ D. Here T, x and y are m-dimensional vectors, with the i-th component denoted by T_i, x_i and y_i, respectively.

When a new algorithm for a type I error rate comes out, readers should pay attention to the distinction between point estimation and strong control. For strong control, readers also need to know the assumptions required. For example, the popular software SAM²⁴ is a good procedure to estimate the FDR. The proofs of strong control on the FDR by the SAM are currently only given when p-values are independent or when some types of asymptotic arguments are invoked.²¹

3 Procedures controlling type I error rates

3.1 Assumptions

We will focus on the frequentist approach of handling multiple testing. There are several explorations of multiple testing with empirical Bayes ideas. Interested readers are referred to the papers by Efron and his collaborators^25,26. In this paper, we assume that the marginal distributions of the test statistic T_i or the p-value P_i on the true null hypothesis can be specified. For example, the null p-values can be assumed to be uniformly distributed on the interval [0,1] for most applications. The null distribution of T_i can be estimated by resampling (permutation or bootstrap). This forms the weakest assumption in microarray data. The following lists four commonly used assumptions.

BAS assumption: Each null p-value has a marginal distribution as U[0,1]. It allows possible joint dependence structure among the p-values.
SEP assumption: In addition to the BAS assumption, it further assumes that the joint distribution of null p-values is independent of the joint distribution of non-null p-values, i.e.,
$P (P_{i} \leq x_{i}, i \in M_{0}; P_{j} \leq y_{j}, j \in M_{1}) = P (P_{i} \leq x_{i}, i \in M_{0}) \cdot P (P_{j} \leq y_{j}, j \in M_{1})$
IND assumption: In addition to the SEP assumption, it further assumes that the null p-values (P_i, i ∈ M₀) are independently distributed.
MIX assumption: A stronger assumption than IND, it further assumes that the non-null p-values are independently identically distributed as G(·). The function G(·) may be restricted to be no less than the distribution function for U[0,1] and may further be assumed to be a convex function over the interval [0,1].

The SEP assumption puts neither restrictions on the dependence structure of null p-values (P_i, i ∈ M₀), nor restrictions on the joint structure of non-null p-values (P_i, i ∈ M₁). This is a much weaker assumption than IND. We tend not to use the strongest assumption MIX, as rarely do we have the distribution of the non-null test statistics, which depends on how the parameters for non-null depart from the null. Furthermore, differentially expressed genes rarely behave similarly, which is in conflict with the homogeneity assumption on the non-null hypotheses imposed by the MIX assumption.

For the microarray study, the IND assumption means that all non-differentially expressed genes are acting independently of each other; the SEP assumption means that the non-differentially expressed genes do not interfere with differentially expressed genes. The three assumptions (BAS, SEP, and IND) can be relaxed by applying probability inequalities in their respective proof of strong control of type I error rates. For example, the BAS assumption for most procedures can be relaxed to require that

P (P_{i} \leq x) \leq x for x \in [0, 1] .

The SEP assumption for the LR-FDPi and Holm-FDR procedures in Table 3 can be relaxed to require that

Table 3.

Summary table for procedures controlling various type I error rates

error rate

procedure

critical value c_i^a

step direction

assumption

PCFR

single-step

BAS

FWER

Bonferroni^b

α/m

single-step

BAS

Holm^c

α/(m − i + 1)

step-down

BAS

Šidák^d

1 − (1 − α)^1/(^m−i⁺¹⁾

step-down

IND

FWER_k

LR-FWER^e

α · a_i_,_k

step-down

BAS

RS-FWER^f

α · a_i_,_k/D₁(k, m)

step-up

BAS

FDR

BH^g

α · i/m

step-up

IND

BY^h

α · (i/m)/C_m

step-up

BAS

Holm-FDRⁱ,^j

min {\frac{m α}{{(m - i + 1)}^{2}}, 1}

step-down

SEP

FDPR_γ

LR-FDPi^e

α · b_i_,_γ

step-down

SEP

LR-FDPii^e

α · b_i_,_γ/C_⌊_mγ_⌋+1

step-down

BAS

RS-FDPiⁱ

α · b_i_,_γ/D₃(γ, m)

step-up

BAS

RS-FDPii^f

α · b_i_,_γ/D₂(γ, m)

step-down

BAS

Open in a new tab

$\begin{array}{l} a_{i, k} = {\begin{cases} k / m & i \leq k \\ k / (m - i + k) & i > k \end{cases} & b_{i, γ} = \frac{⌊ i γ ⌋ + 1}{m - i + ⌊ i γ ⌋ + 1} \\ ⌊ x ⌋ is the greatest integer \leq x & ⌈ x ⌉ is the smallest integer \geq x \\ C_{i} = \sum_{j = 1}^{i} 1 / j & D_{1}, D_{2}, D_{3} are defined in the Appendix \end{array}$

Beonferroni (1936)²⁷, improved by Holm procedure.

Holm (1979)²⁸

Šidák(1967)²⁹

Lehmann and Romano (2005)¹⁸. LR-FDPi is improved by RS-FDPii; LR-FWER at k = 1 is equivalent to Holm.

Romano and Shaikh (2006a)³⁰

Benjamini and Hochberg (2005)¹³

Benjamini and Yekutieli (2001)²³

ⁱ

Romano and Shaikh (2006b)³¹

Ge et al.(2007)³²

P (P_{i} \leq x ∣ P_{j}, j \in M_{1}) \leq x for i \in M_{0}, x \in [0, 1] .

The IND assumption for the BH procedure can be relaxed to require positively regression dependent (PRDS) as mentioned at the end of Section 2.4. These weaker assumptions such as PRDS are still difficult to verify for a specific microarray dataset and there is no clear consensus which assumption is appropriate. We can always use procedures that rely only on the BAS assumption, but this approach may lose power to detect interesting genes as seen in the datasets of Section 5.

3.2 Single-step, step-down, step-up

Bonferroni²⁷ and Holm²⁸ procedures are two classic procedures that control the FWER. The Bonferroni procedure rejects H_i when p_i ≤ α/m. It is a single-step procedure as the rejection of a particular hypothesis H_i is independent of what the p-values might be taken for other hypotheses. By contrast, Holm is a step-down procedure. After p-values are first ordered such that p₍₁₎ ≤ ··· ≤ p₍_m₎, a general description of step-down procedures is in the following,

Step-down procedure

For non-decreasing critical values 0 ≤ c₁ ≤ ··· ≤ c_m ≤ 1, let r be the greatest index satisfying p₍₁₎ ≤ c₁, …, p₍_r₎ ≤ c_r. Reject the hypotheses H₍₁₎, …, H₍_r₎ if r exists; otherwise, i.e., p₍₁₎ > c₁, reject no hypothesis.

A step-down procedure begins with the smallest p-value p₍₁₎ (most significant) and continues rejecting hypotheses as long as the p-value is no greater than its corresponding critical value. Holm is a step-down procedure with the critical values

c_{i}^{Holm} = α / (m - i + 1) .

The classical procedure that controls the FDR is BH,¹³ which is a step-up procedure and gives strong control under the PRDS condition,²³ a weaker assumption than IND. A step-up procedure is described as in the following,

Step-up procedure

For non-decreasing critical values 0 ≤ c₁ ≤ ··· ≤ c_m ≤ 1, let r be the smallest index satisfying p₍_m₎ > c_m, …, p₍_r₊₁₎ > c_r₊₁. Define r = m if p₍_m₎ ≤ c_m. Reject the hypotheses H₍₁₎, …, H₍_r₎ if r ≥ 1; otherwise, i.e., p₍_i₎ > c_i for all i, reject no hypothesis.

A step-up procedure begins with the greatest p-value p₍_m₎ (least significant) and continues accepting hypotheses as long as the p-value is greater than its corresponding critical value. BH is a step-up procedure with

c_{i}^{B H} = α \cdot i / m .

In Table 3, we list the critical values of some simple procedures that provide strong control of type I error rates under various assumptions.

4 Resampling-based methods

For microarray datasets involving thousands of genes, one may apply a procedure that requires a strong assumption (a modiffication of assumptions IND or SEP), which is difficult to verify for a specific microarray data. If one applies procedures with the weakest assumption BAS, then there may not be enough power to detect differentially expressed genes as shown in Section 5.

A more fruitful approach to handle the multiple testing problem is to use data adaptive cutoff values from the actual microarray data by resampling the test statistics. It is beyond this paper’s scope to review resampling algorithms here. Readers are referred to the papers^8,33–35 for more details. We will introduce two simplified resampling algorithms in this paper. One is the k-maxT procedure to control the FWER_k. The k-maxT is a single-step procedure, and is a modiffication of maxT.¹² This procedure is the same as Algorithm A^* proposed by Korn et al.¹⁴ The other is the maxZ procedure^36,37 to control the false discovery proportion for independent data. We will use resamplings to generalize the maxZ procedure for dependent data in controlling the FDPR_γ.

4.1 Permutation algorithm for the k-maxT procedure to control the FWER_k

For resampling procedures to control the FWER_k, it is usually more convenient to compute the adjusted p-values,¹² since it is computationally intensive to compute the cutoff value for each specified α. After we compute the adjusted p-values, we reject any hypothesis whose adjusted p-value is no greater than α. The adjusted p-value gives us the flexibility to explore the data without having to recompute the cutoff value c_i all over again. The idea of the k-maxT procedure is to estimate the FWER_k under the complete null^† H_M for any interesting rejection region [−∞, −|t_i|] ∪ [|t_i|, +∞] of t-statistics, i.e., k-maxT adjusted p-value is defined as

{\tilde{p}}_{i} = P (k - {max}_{1 \leq j \leq m} ∣ T_{j} ∣ \geq ∣ t_{i} ∣ ∣ H_{M}),

where k-max_1≤_j_≤_m|T_j| denotes the k-th greatest statistic of |T₁|, ···, |T_m|. In order for these adjusted p-values to provide a strong control of the FWER_k, we require that H_M gives the least favorable situation configuration in computing the adjusted p-value, i.e., for any x ≥ 0,

P (k - {max}_{1 \leq j \leq m} ∣ T_{j} ∣ \geq x ∣ H_{M}) \geq P_{θ} (k - {max}_{j \in M_{0} (θ)} ∣ T_{j} ∣ \geq x) for any θ \in Θ .

(4)

This condition is expected to hold in many situations as the left hand side is based on computing the k-th greatest variable from m variables, and the right hand side is based on computing the k-th greatest variable from a subset of the aforementioned m variables. One should realize that the inequality is not true in general as the two sides are based on two different probability distribution generating mechanisms. The condition in equation (4) is weaker than the monotonicity assumption required by Romano and Wolf.¹⁵ Their assumption can be satisfied in many situations as shown in their Theorem 1. For the problem of finding differentially expressed genes in microarray data, the inequality holds as the computation of the test statistic for each gene is not involved with any other gene. Therefore the adjusted p-values computed under the complete null H_M can guarantee a strong control of the FWER_k. The details of the k-maxT procedure are described in Algorithm 1.

4.2 Permutation algorithm for the maxZ procedure to control the FDPR_γ

The idea of the maxZ procedure is to simulate the distribution of max_c_≥0 Z_c under the complete null H_M, where Z_c is a normalized V_c,

Z_{c} = \frac{V_{c} - μ_{c}}{σ_{c}} .

(5)

Here V_c = V_c(H_M) = Σ_j_∈_M (|T_j| ≥ c) is the number of false positives for a cutoff value c of test statistics; μ_c and σ_c are the mean and standard deviation of V_c(H_M) respectively. Let χ_{1 −} _α be the 1 − α quantile of the probability distribution of max_c_≥0 Z_c|H_M. A simultaneous upper prediction bound ³⁶ of the false positives V_c can be estimated by,

{\tilde{B}}_{c}^{1 - α} = μ_{c} + σ_{c} \cdot χ_{1 - α},

in the sense that

P (V_{c} \leq {\tilde{B}}_{c}^{1 - α} for all c \geq 0 ∣ H_{M}) \to 1 - α .

In order to generalize the simultaneous bound to any θ ∈ Θ, i.e.,

P_{θ} (V_{c} \leq {\tilde{B}}_{c}^{1 - α} for all c \geq 0) \geq 1 - α,

we need the assumption that H_M gives the least favorable configurations in computing the 1 − α quantile of max_c_≥0 Z_c, i.e., for any x ≥ 0,

P (max_{c \geq 0} Z_{c} \leq x ∣ H_{M}) \leq P_{θ} (max_{c \geq 0} Z_{c} \leq x) for any θ \in Θ .

(6)

This condition is expected to hold in many situations as $Z_{c} = \frac{V_{c} (H_{M}) - μ_{c}}{σ_{c}}$ on the left hand side is always no smaller than $Z_{c} = \frac{V_{c} (θ) - μ_{c}}{σ_{c}}$ on the right hand side (readers should note that μ_c and σ_c are the same constants on both sides, but V_c is depending on θ, see equation (2)). As in the condition required for the k-maxT procedure of equation (4), we need to be careful that the two sides of the above inequality are computed under two different probability distribution generating mechanisms.

Algorithm 1.

Permutation algorithm for single-step k-maxT to control the FWER_k

For the original data matrix X, compute the test statistics t₁, …, t_m.

For the bth permutation, b = 1, …, B:

Permute the n columns of the data matrix X.
Compute test statistics $t_{1}^{b}, \dots, t_{m}^{b}$ for each hypothesis.
Next, compute |t^b|₍_k₎, the k-th greatest statistic of $∣ t_{1}^{b} ∣, \dots, ∣ t_{m}^{b} ∣$ .

The above steps are repeated B times and the adjusted p-values are estimated by

{\tilde{p}}_{i} = \frac{# {b : {∣ t^{b} ∣}_{(k)} \geq ∣ t_{i} ∣}}{B} for i = 1, \dots, m

Open in a new tab

Applying Meinshausen’s technique,³⁸ an improved bound of V_c can be obtained by

{\tilde{V}}_{c}^{1 - α} = R_{c} - max_{τ \geq c} (R_{τ} - {\tilde{B}}_{τ}^{1 - α}),

which is smaller or equal to ${\tilde{B}}_{c}^{1 - α}$ . Here R_c is the total number of rejections, equal to $\sum_{j = 1}^{m} I (∣ T_{j} ∣ \geq c)$ . We can then obtain the simultaneous upper prediction bound of the false discovery proportion Q,

{\tilde{Q}}_{c}^{1 - α} = \frac{{\tilde{V}}_{c}^{1 - α}}{max (R_{c}, 1)},

satisfying

P_{θ} (Q_{c} \leq {\tilde{Q}}_{c}^{1 - α} for all c \geq 0) \geq 1 - α . for any θ \in Θ .

${\tilde{Q}}_{c}^{1 - α}$ is also called false discovery proportion confidence envelope³⁹. We can then reject any hypothesis whose test statistic is no less than ĉ, where

\hat{c} = min {c \geq 0 : {\tilde{Q}}_{c}^{1 - α} \leq γ} .

The computational details of the maxZ procedure are described in Algorithm 2. This rejection procedure gives a strong control of the FDPR_γ under the assumption specified by equation (6).

Remarks

We use the mean and standard deviation to normalize V_c.^{36, 37} In fact, we can apply any nondecreasing function F_c(·) to normalize the variable V_c, i.e., Z_c is redefined to be F_c(V_c) and ${\tilde{B}}_{c}^{1 - α}$ is redefined to be $F_{c}^{- 1} (χ_{1 - α})$ . Meinshausen³⁸ takes F_c to be an empirical distribution of V_c, which is analogous to the quantile normalization used in microarray data analysis.⁴⁰ Different choices of function F_c may affect the power of the multiple testing procedure.
By noting that m₀ = max_c_≥0 V_c, we can obtain a conservative 1 − α upper confidence bound for m₀ given by ${max}_{c \geq 0} {\tilde{V}}_{c}^{1 - α}$ . And $m - {max}_{c \geq 0} {\tilde{V}}_{c}^{1 - α}$ is a conservative 1 − α lower confidence bound for m₁.

5 Microarray applications

5.1 Data sets

Apo AI knock-out experiment

The Apo AI experiment⁹ was carried out as part of a study of lipid metabolism and atherosclerosis susceptibility in mice. The apolipoprotein AI (Apo AI) is a gene known to play a pivotal role in HDL metabolism, and mice with the Apo AI gene knocked out have very low HDL cholesterol levels. The goal of the experiment was to identify genes with altered expression in the livers of these knock-out mice compared to inbred control mice. The treatment group consisted of eight mice with the Apo AI gene knocked out and the control group consisted of eight wild-type C57Bl/6 mice. For the 16 microarray slides, the target cDNA was from the liver mRNA of the 16 mice. The reference cDNA came from the pooled control mice liver mRNA. Among the 6 356 cDNA probes, about 200 genes were related to lipid metabolism. In the end, we obtained a 6 356 ×16 matrix with 8 columns from the controls and 8 columns from the treatments. Differentially expressed genes between the treatments and controls are identified by two sample Welch t-statistics.

Leukemia study

One goal of Golub et al.¹⁰ was to identify genes that are differentially expressed in patients with two types of leukemias, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were measured using Affymetrix high-density oligonucleotide arrays containing m = 6 817 human genes. The learning set comprises n = 38 samples, 27 ALL cases and 11 AML cases. The data were preprocessed and filtered as described in Ge et al.³³ and the processed data (available at the Bioconductor package multtest) were summarized by a 3 051×38 matrix. Differentially expressed genes between ALL and AML patients were identified by two sample Welch t-statistics.

5.2 Results

Previous analysis of these two datasets, especially on the FWER and the FDR can be seen in Ge et al.³³ To keep the results brief, we only consider k = 1 for FWER_k (or FWER) procedures and γ = 0.05 for FDPRγ procedures. In total, we have 11 procedures for multiple testing: Holm, Šidák, RS-FWER, and k-maxT to control the FWER_k; BH, BY, and Holm-FDR to control the FDR; and LR-FDPi, RS-FDPi, RS-FDPii and maxZ to control the FDPR_γ. The p-values required for the simple procedures listed in Table 3 are obtained by permutations, where the Apo AI knock-out experiment takes all $B = (\begin{matrix} 16 \\ 8 \end{matrix}) = 12 870$ permutations and the leukemia dataset takes 100 000 random permutations. The resampling algorithms listed in Section 4 take the exact same number of permutations as computing the p-values.

For the Apo AI dataset, the k-maxT and maxZ procedures reject 8 genes and 6 genes respectively at α = 0.05, no other method listed in Section 3 is able to reject any gene. This shows that extreme power obtained by incorporating the dependency among the test statistics. Even though the methods in Section 3 are easy to implement, these critical values are unlikely to be improved upon.³⁰ The loss of power of these methods is due to that the critical values c_i are too general.

The results for the leukemia dataset are plotted in Panels A, B and C of Figure 1. Resampling-based procedures (k-maxT and maxZ) are more powerful than the simple methods of Table 3. For these simple methods, as expected, one that requires stronger assumptions tends to be able to reject more genes at the same level α. Two simple methods stand out: LR-FDPi and BH. LR-FDPi is similar to maxZ in controlling FDPR_γ. BH is comparable to the maxZ procedure that controls the median FDP (see panel D and more discussion in the next paragraph).

Different multiple testing procedures for the Leukemia dataset. Each point of the curve in Panels A, B and C gives the minimum type I error rate (FWER_k, FDR or FDPR_γ) achieved for the specific multiple testing procedure when the top 1, 2, … genes are rejected. The curve for p-val gives the permutation based p-values. In Panel D, the maxZ procedure gives the simultaneous upper prediction bounds of the FDP at level 1− α =0.50, 0.90, 0.95 and 0.99 when different number of genes are rejected, while the BH and BY procedures are plotted in panel D to compare with the FDP bound at 1 − α =0.50. All results are based on the same 100 000 random permutations.

In general, we observe that FWER procedures reject fewer genes than FDR procedures, which is due to the strong criterion specified by the FWER. A direct comparison of FDR and FDPRγ is inappropriate due to different interpretations of α. From the maxZ procedure, we obtained the simultaneous upper prediction bound Q̃^{1 −} ^α at different level of 1 − α (see panel D of Figure 1). Q̃^0.5 gives the upper bound of the median FDP. Recall that the FDR is trying to control the expectation of the FDP. Therefore, we can use the median FDP to represent the performance of maxZ in comparison with the FDR procedures. It seems that maxZ is much more powerful than the BY procedure, and more or less the same as the BH procedure, which needs a stronger assumption (PRDS) than the maxZ procedure. This also shows the power of incorporating the dependency among the dataset.

For the Apo AI dataset, Q̃^0.5 from maxZ rejects 8 genes at γ = 0.05, the same as the ones the k-maxT procedure rejected at α = 0.05. These 8 genes have been verified to be related to the knocked out gene. For the leukemia dataset, many more genes are expected to be differentially expressed. In fact, the maxZ procedure gives 661 genes as a 99% lower confidence bound for m₁ for the leukemia dataset (6 genes for the Apo AI dataset).

6 Discussion

For a large dataset such as microarray, we have seen the extreme power achieved by resampling methods as shown by k-maxT and maxZ, which exploit the correlation structure presented in the data. It seems that the usefulness of the simple procedures of Section 3 is very limited in a large dataset, even with the recent exciting theoretical progress in this area led by Romano and Shaikh.³⁰ These simple methods are useful when the test statistics satisfy IND or SEP assumptions, such as BH and RS-FDPi procedures, which are difficult to verify. Owen⁴¹ shows that the dependence among the hypothesis tests greatly affect the variance of number of false positives, and may affect the validity of the multiple testing procedures that require the IND assumption.

The FDR has its advantages in finding interesting genes than the more stringent criterion FWER. The generalized version FWER_k faces another difficulty of picking up an appropriate k in exploring the data. We have implemented a resampling algorithm for each error rates listed in Table 2, other than the FDR. Existing resampling algorithm on the FDR such as Yekutieli and Benjamnini (1999)⁴² still requires the SEP assumption. The difficulty of developing a resampling algorithm for the FDR is due to the unknown probabilistic distribution specified by the non-null hypothesis. The maxZ procedure gets around this problem by estimating the simultaneous upper prediction bound of V_c, which depends only on the true null hypotheses. Before a resampling algorithm for the FDR under general dependent data is developed, one practical solution is to use maxZ to control the median FDP, rather than the FDR (the expectation of the FDP).

The maxZ procedure gives investigators much more insight into the data than controlling the median FDP. It generates simultaneous bounds of the FDP at different level 1 − α as shown in panel D of Figure 1. This procedure also produces a confidence bound for m₀. Another advantage of maxZ is that it can be applied to most data as long as the complete null H_M gives the least favorable configuration to compute the distribution of max_c_≥0 Z_c. This assumption is even unnecessary as long as one can find a distribution that gives a 1 − α quantile χ_{1 −} _α such that

min_{θ \in Θ} P_{θ} (max_{c \geq 0} Z_{c} \leq χ_{1 - α}) \geq 1 - α

The closer the left hand side is to 1 − α, the more power is achieved to detect differentially expressed genes. The disadvantages of the maxZ procedure are that it is computationally expensive and that it may require special handling for a new dataset. Different strategies to generate the null distribution of max_c_≥0 Z_c or to estimate the quantile χ₁₋ _α may influence the validity and the power of analysis.

In designing multiple testing procedures, especially a resampling-based one, we need to balance the power, the running time, and the required assumption for providing strong control. k-maxT and maxZ achieve a good balance among them. The implementation details are relatively straightforward and the running time is only a couple of minutes at a regular desktop computer for both data sets in this paper. The assumptions required for both procedures are minimal. The step-down version of k-maxT is not proposed as a general solution as it requires a stronger assumption and as the power achieved from the step-down is negligible for a large data set such as microarray. Meinshausen’s procedure³⁸ is not proposed here as it requires much more computation and it may suffer from the discreteness of p-values as seen in the minP procedure.³³

Some recent work on the local false discovery rate and sample size for multiple testing either requires Bayesian interpretation of the data or requires the MIX (or IND) assumption. Interested readers can check the papers^43,44 for details. A general solution to computing the sample size in dealing with multiple testing problem for general dependent data seems very challenging, as is developing methods to control type II error rates.

Multiple testing’s aim is to draw a cutoff value to claim which genes are differentially expressed. This is useful when not many genes are expected to be differentially expressed. It is probably not interesting to know exactly where this cutoff is drawn if many genes are expected to be differentially expressed. Other than focusing on how to decide the cutoff value of t-statistics or p-values, more attention should be paid to the design of a powerful statistic, or a good ranking of the genes, which is very useful for investigators to use microarray as a screening tool to focus on the top candidate genes for further investigations.

Acknowledgments

We thank Xiaochun Li and Chi-Hong Tseng for their valuable comments on the manuscript. We thank the editors for helpful comments that have led to an improved paper. This work was supported by NIH grants U19 AI 62623, RO1 DK46943 and contract HHSN 26600500021C.

Appendix: The definitions of D₁, D₂, D₃ in Table 3

There are no closed forms for D₁, D₂ and D₃. The computational details are in the following.

For m₀ = k, k + 1, …, m, define
$S_{1} (k, m, m_{0}) = 1 + m_{0} \sum_{1 \leq j \leq m_{0} - k} \frac{k}{k + j} \cdot (\frac{1}{m_{0} - j} - \frac{1}{m_{0} - j + 1}),$

and let
$D_{1} = D_{1} (k, m) = max_{k \leq m_{0} \leq m} S_{1} (k, m, m_{0}) .$

The above definition of D₁ is specific to the sequence α_i = a_i,k, where a_i,k is defined in Table 3. For the general definition, readers are referred to the paper³¹.
We will use g(j) to denote [γj] + 1. Define a sequence 0 ≤ α₁ ≤ ··· ≤ α_m ≤ 1, where α_i is b_γ,i as defined in Table 3. Let
$S_{2} (γ, m, m_{0}) = m_{0} α_{1} + m_{0} \sum_{\begin{matrix} 0 \leq k < m - 1 \\ m_{0} \geq g (m - k) \end{matrix}} \frac{α_{m - k} - α_{m - k - 1}}{max (m_{0} - k, g (m - k))}$

and define
$D_{2} = D_{2} (γ, m) = max_{1 \leq m_{0} \leq m} S_{2} (γ, m, m_{0}) .$

Define β₀ = 0,

β_{i} = \frac{i}{max {m + i - ⌈ \frac{i}{γ} ⌉ + 1, m_{0}}} i = 1, \dots, ⌊ γ m ⌋

, and

β_{⌊ γ m ⌋ + 1} = \frac{⌊ γ m ⌋ + 1}{m_{0}}

. Let

N = min {⌊ γ m ⌋ + 1, m_{0}, ⌊ γ (\frac{m - m_{0}}{1 - γ} + 1) ⌋ + 1},

and set

S_{3} (γ, m, m_{0}) = m_{0} \sum_{i = 1}^{N} \frac{β_{i} - β_{i - 1}}{i} .

Finally,

D_{3} = D_{3} (γ, m) = max_{1 \leq m_{0} \leq m} S_{3} (γ, m, m_{0}) .

Algorithm 2.

Permutation algorithm for the maxZ procedure to control the FDPR

Define a very dense grid of K critical values 0 ≤ c₁ < ··· < c_K < ∞. Compute the test statistics of the data matrix X, denote the ordered statistics by |t₍₁₎| ≥ ··· ≥ |t₍_m₎|.

Permute the n columns of the data matrix X for B times. Compute test statistics $t_{1}^{b}, \dots, t_{m}^{b}$ for each hypothesis.
For k = 1, …, K, compute the number of rejections with the cutoff value c_k,
$v_{k}^{b} = \sum_{i = 1}^{m} I (∣ t_{i}^{b} ∣ \geq c_{k}),$

and the mean and standard deviation of ( $v_{k}^{1}, \dots, v_{k}^{B}$ ),
$μ_{k} = \frac{1}{B} \sum_{b = 1}^{B} v_{k}^{b}, σ_{k} = \sqrt{\frac{1}{B} \sum_{b = 1}^{B} {(v_{k}^{b} - μ_{k})}^{2}} .$
The B samples of max_c_≥0 Z_c|H_M are obtained by
$z_{(1)}^{b} = max_{k = 1, \dots, K} \frac{v_{k}^{b} - μ_{k}}{σ_{k}} for b = 1, \dots, B .$
For i = 1, …, m, compute the mean $μ_{(i)}^{*}$ and standard deviation $σ_{(i)}^{*}$ of ( $v_{∣ t_{(i)} ∣}^{b}$ , b = 1, ···, B), where $v_{∣ t_{(i)} ∣}^{b} = \sum_{j = 1}^{m} I (∣ t_{j}^{b} ∣ \geq ∣ t_{(i)} ∣)$ .
Estimate the 1 − α quantile χ_{1 −} _α from the B samples ( $z_{(1)}^{1}, \dots, z_{(1)}^{B}$ ). The simultaneous upper confidence bound Q̃ of the false discovery proportion for the rejection region [−∞, −|t₍_i₎|] ∪ [|t₍_i₎|, +∞] can be computed by,
$\begin{matrix} {\tilde{B}}_{(i)} = μ_{(i)}^{*} + σ_{(i)}^{*} \cdot χ_{1 - α} \\ {\tilde{V}}_{(i)} = i - max (0, max_{j = 1, \dots, i} (j - {\tilde{B}}_{(i)})) \\ {\tilde{Q}}_{(i)} = {\tilde{V}}_{(i)} / i \end{matrix}$
For a specific γ, find the maximum index k̂ such that Q̃₍_k₎ ≤ γ and reject the k̂ hypotheses with the greatest t-statistics in absolute value. And use Ṽ₍_m₎ (or m − Ṽ₍_m₎) as an estimate of the 1 − α upper confidence bound for m₀ (or 1 − α lower confidence bound for m₁).

Open in a new tab

Footnotes

The number of type II errors was denoted by T by Benjamini and Hochberg.¹³ We replace it by D to remove the confusion with the test statistic T in this paper.

^†

We assume that parameter space for H_M consists of a single point for the sake of simplicity; otherwise, we take a parameter from H_M that gives the least favorable configuration in computing the probabilities.

References

1.DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nature Genetics. 1996;14:457–60. doi: 10.1038/ng1296-457. [DOI] [PubMed] [Google Scholar]
2.Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–70. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]
3.Lockhart DJ, Dong HL, Byrne MC, Follettie MT, Gallo MV, Chee MS, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology. 1996;14:1675–80. doi: 10.1038/nbt1296-1675. [DOI] [PubMed] [Google Scholar]
4.Wodicka L, Dong H, Mittmann M, Ho MH, Lockhart DJ. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nature Biotechnology. 1997;15(13):1359–67. doi: 10.1038/nbt1297-1359. [DOI] [PubMed] [Google Scholar]
5.Nature Genetics Editors, editor. The Chipping Forecast II. Nature Genetics Supplement. 2003;32:461–552. [Google Scholar]
6.Speed TP, editor. Statistical analysis of gene expression microarray data. Boca Raton: Chapman & Hall/CRC; 2003. [Google Scholar]
7.Parmigiani G, Garrett ES, Irizarry RA, Zeger SL, editors. The Analysis of Gene Expression Data. New York: Springer; 2003. [Google Scholar]
8.Dudoit S, van der Laan MJ, Pollard KS. Multiple testing Part I. Single-step procedures for control of general Type I error rates. Statistical Applications in Genetics and Molecular Biology. 2004;3(1):Article 13. doi: 10.2202/1544-6115.1040. [DOI] [PubMed] [Google Scholar]
9.Callow MJ, Dudoit S, Gong EL, Speed TP, Rubin EM. Microarray expression profiling identifies genes with altered expression in HDL deficient mice. Genome Research. 2000;10(12):2022–29. doi: 10.1101/gr.10.12.2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
11.Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika. 1938;29:350–62. [Google Scholar]
12.Westfall PH, Young SS. Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons; 1993. [Google Scholar]
13.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995;57:289–300. [Google Scholar]
14.Korn EL, Troendle JF, McShane LM, Simon R. Controlling the number of false discoveries: Application to high dimensional genomic data. Journal of Statistical Planning and Inference. 2004;124:379–98. [Google Scholar]
15.Romano JP, Wolf M. Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association. 2005;100(469):94–108. [Google Scholar]
16.Genovese C, Wasserman L. Operating characteristics and extensions of the false discovery rate procedure. J R Statist Soc B 2002. 64:499–517. [Google Scholar]
17.Taylor J, Tibshirani R, Efron B. The ‘miss rate’ for the analysis of gene expression data. Biostatistics. 2005;6(1):111–7. doi: 10.1093/biostatistics/kxh021. [DOI] [PubMed] [Google Scholar]
18.Lehmann EL, Romano JP. Generalization of the familywise error rate. The Annals of Statistics. 2005;33:1138–54. [Google Scholar]
19.Fernando RL, Nettleton D, Southey BR, Dekkers JCM, Rothschild MF, Soller M. Controlling the proportion of false positives in multiple dependent tests. Genetics. 2004;166:611–9. doi: 10.1534/genetics.166.1.611. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Storey JD. A direct approach to false discovery rates. J R Stat Soc B. 2002;64:479–98. [Google Scholar]
21.Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc B. 2004;66:187–205. [Google Scholar]
22.Seeger P. A note on a method for the analysis of significance en masse. Technometrics. 1968;10 (3):586–93. [Google Scholar]
23.Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple hypothesis testing under dependency. The Annals of Statistics. 2001;29(4):1165–88. [Google Scholar]
24.Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001;98:5116–21. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96(456):1151–1160. [Google Scholar]
26.Efron B. Microarrays, Emprical Bayes, and the two-groups model. Department of Statistics, Stanford University; 2006. http://www-stat.stanford.edu/~brad/papers. [Google Scholar]
27.Bonferroni CE. Teoria statistica delle classi e calcolo delle probabilit. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze. 1936;8:3–62. [Google Scholar]
28.Holm S. A simple sequentially rejective multiple test procedure. Scand J Statist. 1979(6):65–70. [Google Scholar]
29.Šidák Z. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association. 1967;62:626–33. [Google Scholar]
30.Romano JP, Shaikh AM. Stepup procedures for control of generalizations of the familywise error rate. Annals of Statistics. 2006a;34(4):1850–73. [Google Scholar]
31.Romano JP, Shaikh AM. On stepdown control of the false discovery proportion. IMS Lecture Notes—Monograph Series: 2nd Lehmann Symposium—Optimality; 2006b. pp. 33–50. [Google Scholar]
32.Ge Y, Sealfon SC, Tseng CH, Speed TP. A Holm-type procedure controlling the false discovery rate. Statistics and Probability Letters 2007. 77:1756–62. [Google Scholar]
33.Ge Y, Dudoit S, Speed TP. Resampling-based multiple testing for microarray data analysis. Test. 2003;12(1):1–44. With discussion 44–77. [Google Scholar]
34.Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003;8(1):71–103. [Google Scholar]
35.van der Laan MJ, Dudoit S, Pollard KS. Multiple testing. Part II. Step-down procedures for control of the family-wise error rate. Stat Appl Genet Mol Biol. 2004;3(1):Article 14. doi: 10.2202/1544-6115.1041. [DOI] [PubMed] [Google Scholar]
36.Ge Y, Li X, Sealfon SC, Speed TP. An upper prediction bound for the false discovery proportion. Proceedings of the American Statistical Association, Statistical Computing Section [CD-ROM]; 2005; pp. 2093–100. [Google Scholar]
37.Meinshausen N, Rice J. Estimating the Proportion of False Null Hypotheses among a Large Number of Independently Tested Hypotheses. The Annals of Statistics. 2006;34(1):373–93. [Google Scholar]
38.Meinshausen N. False Discovery Control for Multiple Tests of Association under General Dependence. Scandinavian Journal of Statistics. 2006;33(2):227–37. [Google Scholar]
39.Genovese C, Wasserman L. A stochastic process approach to false discovery control. The Annals of Statistics. 2004;32:1035–61. [Google Scholar]
40.Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
41.Owen AB. Variance of the Number of False Discoveries. Journal of the Royal Statistical Society, Series B. 2005;67(3):411–26. [Google Scholar]
42.Yekutieli D, Benjamini Y. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference. 1999;82:171–96. [Google Scholar]
43.Efron B. Large-scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis. Journal of the American Statistical Association. 2004;99 (465):96–104. [Google Scholar]
44.Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate and sample size for microarray studies. Bioinformatics. 2005;21:3017–24. doi: 10.1093/bioinformatics/bti448. [DOI] [PubMed] [Google Scholar]

[R1] 1.DeRisi J, Penland L, Brown PO, Bittner ML, Meltzer PS, Ray M, et al. Use of a cDNA microarray to analyse gene expression patterns in human cancer. Nature Genetics. 1996;14:457–60. doi: 10.1038/ng1296-457. [DOI] [PubMed] [Google Scholar]

[R2] 2.Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995;270:467–70. doi: 10.1126/science.270.5235.467. [DOI] [PubMed] [Google Scholar]

[R3] 3.Lockhart DJ, Dong HL, Byrne MC, Follettie MT, Gallo MV, Chee MS, et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology. 1996;14:1675–80. doi: 10.1038/nbt1296-1675. [DOI] [PubMed] [Google Scholar]

[R4] 4.Wodicka L, Dong H, Mittmann M, Ho MH, Lockhart DJ. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nature Biotechnology. 1997;15(13):1359–67. doi: 10.1038/nbt1297-1359. [DOI] [PubMed] [Google Scholar]

[R5] 5.Nature Genetics Editors, editor. The Chipping Forecast II. Nature Genetics Supplement. 2003;32:461–552. [Google Scholar]

[R6] 6.Speed TP, editor. Statistical analysis of gene expression microarray data. Boca Raton: Chapman & Hall/CRC; 2003. [Google Scholar]

[R7] 7.Parmigiani G, Garrett ES, Irizarry RA, Zeger SL, editors. The Analysis of Gene Expression Data. New York: Springer; 2003. [Google Scholar]

[R8] 8.Dudoit S, van der Laan MJ, Pollard KS. Multiple testing Part I. Single-step procedures for control of general Type I error rates. Statistical Applications in Genetics and Molecular Biology. 2004;3(1):Article 13. doi: 10.2202/1544-6115.1040. [DOI] [PubMed] [Google Scholar]

[R9] 9.Callow MJ, Dudoit S, Gong EL, Speed TP, Rubin EM. Microarray expression profiling identifies genes with altered expression in HDL deficient mice. Genome Research. 2000;10(12):2022–29. doi: 10.1101/gr.10.12.2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]

[R11] 11.Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika. 1938;29:350–62. [Google Scholar]

[R12] 12.Westfall PH, Young SS. Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons; 1993. [Google Scholar]

[R13] 13.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995;57:289–300. [Google Scholar]

[R14] 14.Korn EL, Troendle JF, McShane LM, Simon R. Controlling the number of false discoveries: Application to high dimensional genomic data. Journal of Statistical Planning and Inference. 2004;124:379–98. [Google Scholar]

[R15] 15.Romano JP, Wolf M. Exact and approximate stepdown methods for multiple hypothesis testing. Journal of the American Statistical Association. 2005;100(469):94–108. [Google Scholar]

[R16] 16.Genovese C, Wasserman L. Operating characteristics and extensions of the false discovery rate procedure. J R Statist Soc B 2002. 64:499–517. [Google Scholar]

[R17] 17.Taylor J, Tibshirani R, Efron B. The ‘miss rate’ for the analysis of gene expression data. Biostatistics. 2005;6(1):111–7. doi: 10.1093/biostatistics/kxh021. [DOI] [PubMed] [Google Scholar]

[R18] 18.Lehmann EL, Romano JP. Generalization of the familywise error rate. The Annals of Statistics. 2005;33:1138–54. [Google Scholar]

[R19] 19.Fernando RL, Nettleton D, Southey BR, Dekkers JCM, Rothschild MF, Soller M. Controlling the proportion of false positives in multiple dependent tests. Genetics. 2004;166:611–9. doi: 10.1534/genetics.166.1.611. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Storey JD. A direct approach to false discovery rates. J R Stat Soc B. 2002;64:479–98. [Google Scholar]

[R21] 21.Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc B. 2004;66:187–205. [Google Scholar]

[R22] 22.Seeger P. A note on a method for the analysis of significance en masse. Technometrics. 1968;10 (3):586–93. [Google Scholar]

[R23] 23.Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple hypothesis testing under dependency. The Annals of Statistics. 2001;29(4):1165–88. [Google Scholar]

[R24] 24.Tusher V, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci. 2001;98:5116–21. doi: 10.1073/pnas.091062498. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association. 2001;96(456):1151–1160. [Google Scholar]

[R26] 26.Efron B. Microarrays, Emprical Bayes, and the two-groups model. Department of Statistics, Stanford University; 2006. http://www-stat.stanford.edu/~brad/papers. [Google Scholar]

[R27] 27.Bonferroni CE. Teoria statistica delle classi e calcolo delle probabilit. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze. 1936;8:3–62. [Google Scholar]

[R28] 28.Holm S. A simple sequentially rejective multiple test procedure. Scand J Statist. 1979(6):65–70. [Google Scholar]

[R29] 29.Šidák Z. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association. 1967;62:626–33. [Google Scholar]

[R30] 30.Romano JP, Shaikh AM. Stepup procedures for control of generalizations of the familywise error rate. Annals of Statistics. 2006a;34(4):1850–73. [Google Scholar]

[R31] 31.Romano JP, Shaikh AM. On stepdown control of the false discovery proportion. IMS Lecture Notes—Monograph Series: 2nd Lehmann Symposium—Optimality; 2006b. pp. 33–50. [Google Scholar]

[R32] 32.Ge Y, Sealfon SC, Tseng CH, Speed TP. A Holm-type procedure controlling the false discovery rate. Statistics and Probability Letters 2007. 77:1756–62. [Google Scholar]

[R33] 33.Ge Y, Dudoit S, Speed TP. Resampling-based multiple testing for microarray data analysis. Test. 2003;12(1):1–44. With discussion 44–77. [Google Scholar]

[R34] 34.Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003;8(1):71–103. [Google Scholar]

[R35] 35.van der Laan MJ, Dudoit S, Pollard KS. Multiple testing. Part II. Step-down procedures for control of the family-wise error rate. Stat Appl Genet Mol Biol. 2004;3(1):Article 14. doi: 10.2202/1544-6115.1041. [DOI] [PubMed] [Google Scholar]

[R36] 36.Ge Y, Li X, Sealfon SC, Speed TP. An upper prediction bound for the false discovery proportion. Proceedings of the American Statistical Association, Statistical Computing Section [CD-ROM]; 2005; pp. 2093–100. [Google Scholar]

[R37] 37.Meinshausen N, Rice J. Estimating the Proportion of False Null Hypotheses among a Large Number of Independently Tested Hypotheses. The Annals of Statistics. 2006;34(1):373–93. [Google Scholar]

[R38] 38.Meinshausen N. False Discovery Control for Multiple Tests of Association under General Dependence. Scandinavian Journal of Statistics. 2006;33(2):227–37. [Google Scholar]

[R39] 39.Genovese C, Wasserman L. A stochastic process approach to false discovery control. The Annals of Statistics. 2004;32:1035–61. [Google Scholar]

[R40] 40.Bolstad BM, Irizarry RA, Astrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–93. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]

[R41] 41.Owen AB. Variance of the Number of False Discoveries. Journal of the Royal Statistical Society, Series B. 2005;67(3):411–26. [Google Scholar]

[R42] 42.Yekutieli D, Benjamini Y. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. Journal of Statistical Planning and Inference. 1999;82:171–96. [Google Scholar]

[R43] 43.Efron B. Large-scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis. Journal of the American Statistical Association. 2004;99 (465):96–104. [Google Scholar]

[R44] 44.Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate and sample size for microarray studies. Bioinformatics. 2005;21:3017–24. doi: 10.1093/bioinformatics/bti448. [DOI] [PubMed] [Google Scholar]

PERMALINK

Multiple testing and its applications to microarrays

Yongchao Ge

Stuart C Sealfon

Terence P Speed

Abstract

1 Introduction

Table 1.

2 Multiple testing

2.1 Set-up

2.2 Different criteria to control type I errors

Table 2.

2.3 Weak control, exact control and strong control of a multiple testing procedure

2.4 Estimating and controlling type I error rates

3 Procedures controlling type I error rates

3.1 Assumptions

Table 3.

3.2 Single-step, step-down, step-up

Step-down procedure

Step-up procedure

4 Resampling-based methods

4.1 Permutation algorithm for the k-maxT procedure to control the FWERk

4.2 Permutation algorithm for the maxZ procedure to control the FDPRγ

Algorithm 1.

Remarks

5 Microarray applications

5.1 Data sets

Apo AI knock-out experiment

Leukemia study

5.2 Results

Figure 1.

6 Discussion

Acknowledgments

Appendix: The definitions of D1, D2, D3 in Table 3

Algorithm 2.

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4.1 Permutation algorithm for the k-maxT procedure to control the FWER_k

4.2 Permutation algorithm for the maxZ procedure to control the FDPR_γ

Appendix: The definitions of D₁, D₂, D₃ in Table 3