Directional false discovery rate control in large-scale multiple comparisons

Wenjuan Liang; Dongdong Xiang; Yajun Mei; Wendong Li

doi:10.1080/02664763.2024.2344260

. 2024 May 25;51(15):3195–3214. doi: 10.1080/02664763.2024.2344260

Directional false discovery rate control in large-scale multiple comparisons

Wenjuan Liang ^a,^b, Dongdong Xiang ^a,^†, Yajun Mei ^c, Wendong Li ^a,^CONTACT

PMCID: PMC11536645 PMID: 39507211

Abstract

The advance of high-throughput biomedical technology makes it possible to access massive measurements of gene expression levels. An important statistical issue is identifying both under-expressed and over-expressed genes for a disease. Most existing multiple-testing procedures focus on selecting only the non-null or significant genes without further identifying their expression type. Only limited methods are designed for the directional problem, and yet they fail to separately control the numbers of falsely discovered over-expressed and under-expressed genes with only a unified index combining all the false discoveries. In this paper, based on a three-classification multiple testing framework, we propose a practical data-driven procedure to control separately the two directions of false discoveries. The proposed procedure is theoretically valid and optimal in the sense that it maximizes the expected number of true discoveries while controlling the false discovery rates for under-expressed and over-expressed genes simultaneously. The procedure allows different nominal levels for the two directions, exhibiting high flexibility in practice. Extensive numerical results and analysis of two large-scale genomic datasets show the effectiveness of our procedure.

Keywords: Gene expression, multiple testing, marginal FDR, separate control, data-driven

1. Introduction

In bioinformatics such as gene expression microarray experiments and brain image studies, it is often crucial to study each gene or brain location and decide whether it is ‘significant’ or not. In these studies, the number of genes or brain locations is very large. If each gene or brain location is associated with one hypothesis, then a large number of hypotheses are tested simultaneously, which is referred to as large-scale multiple testing in the literature. Since the pioneering work of [3] that introduced the concept of controlling the false discovery rate (FDR), there has been an increasing amount of literature on multiple testing in recent years. See, for example, [1,2,6,8,14,19,24,27] and the references therein.

Conventionally, various multiple testing methods have been proposed for dividing genes into null and non-null genes. However, in many bioinformatics applications, such approach may be insufficient to meet the need of effective gene diagnosis, and it is often required to make further directional decisions for the selected non-null genes. For instance, in a typical comparative genomics studies with high-throughput technologies, the dataset is obtained from comparing the expression levels of genes for the healthy and cancer subjects [7,12]. The goal of the study is not only to isolate significant genes related to cancer but also to distinguish between under-expressed and over-expressed genes. Another example can be found in financial market, where monthly returns from a large number of mutual funds are routinely collected. Investors are interested in selecting skilled mutual funds with true capabilities to make profits by using the well-known alpha factor. By convention, a positive alpha indicates skilled performance, a negative alpha indicates unskilled performance, and a zero alpha indicates no fluctuations. As a result, it is important to accurately identify skilled fund managers and unskilled fund managers [28]. These inspire us to investigate such directional multiple testing problems with directional alternatives.

It should be noted that conventional multiple testing methods designed for binary classification can also be applied to the considered directional problem by making directional decisions according to the sign of the test statistics. However, as pointed out by [15], the BH procedure even some advanced FDR-control methods such as the robust FDR method of [4], the q-value of [22,23], and the empirical Bayes estimate of [11] would result in higher FDR in one direction once the null distribution is highly skewed. There has been some work focusing on the directional multiple testing problem [16,18,20,29]. However, these methods mainly focus on controlling the type III errors that misclassify under-expressed genes as over-expressed or vice-versa based on one single directional FDR index, which may be inappropriate especially in cases when one direction of the two alternatives is more important than the other. The main reason is that even if the defined directional FDR is controlled, the separate error rate in each direction may not be well controlled. Also, such a framework only involves a unified nominal type III error, and does not allow different nominal errors for each direction separately.

In this paper, we propose a directional multiple testing approach for controlling the directional FDRs simultaneously in both directions. Based on a three-classification multiple testing framework, we first give the definitions of the directions FDRs for both directions (e.g. under expressed and over expressed genes). Oracle and data-driven procedures are proposed to maximize the expected number of true rejections while controlling the two individual directional FDRs at nominal levels specified beforehand. Theoretically, we show that our procedure is valid and optimal for the considered directional multiple testing problem with simultaneous FDR-control. Extensive numerical results also show the effectiveness of our proposed methods for directional FDR-control, especially in cases with different separate nominal levels of the two directions as well as different non-null proportions and strengths.

The rest of this article is organized as follows. Section 2 formulates the directional multiple testing problem for controlling two directional FDRs simultaneously. Section 3 introduces the proposed oracle and data-driven procedures and establishes their validity and optimality for directional FDR control. Section 4 presents extensive simulation results. The analysis of two microarray datasets are given in Section 5. Section 6 concludes this article. Theoretical details are provided in the Appendix.

2. Problem formulation

Suppose $X = (X_{1}, \dots, X_{m})$ are independent and identically distributed (i.i.d.) observations from a random mixture model:

X_{i} = μ_{i} + ε_{i},

(1)

where $μ_{i}$ is the unknown parameter of interest and $ε_{i} \sim N (0, σ^{2})$ . We further assume that $μ_{i}$ s are generated from the following mixture model

\begin{matrix} μ_{i} ∣ θ_{i} \sim ((1 - | θ_{i} |) δ_{0} (μ_{i}) + I (θ_{i} = 1) h_{1} (μ_{i}) + I (θ_{i} = - 1) h_{2} (μ_{i}) \\ θ_{i} \overset{i . i . d}{\sim} Multinoulli (p_{0}, p_{1}, p_{- 1}), \sum_{k = 0, 1, - 1} p_{k} = 1, i = 1, \dots, m \end{matrix}

(2)

where $θ = (θ_{1}, \dots, θ_{m}) \in {0, 1, - 1}^{m}$ denotes the set of states (i.e. null/over-expressed/under-expressed) of each gene with $p_{0} / p_{1} / p_{- 1}$ being the corresponding probabilities, and $h_{1} (\cdot)$ and $h_{2} (\cdot)$ are the density functions for respectively the over- and under-expressed states with supports $(0, + \infty)$ and $(- \infty, 0)$ . Also, $δ_{0} (\cdot)$ is the dirac delta function whose value is zero everywhere except at zero, and whose integral over the entire real line is equal to one. Note that variations of model (1)–(2) have been widely applied in the field of large-scale multiple testing [5,10,19,24,25]. Conventional multiple testing problems primarily focus on the nondirectional hypotheses $H_{i, 0} : μ_{i} = 0$ vs. $H_{i, 1} : μ_{i} \neq 0$ , $i = 1, \dots, m$ , or equivalently, $H_{i, 0} : θ_{i} = 0$ vs. $H_{i, 1} : θ_{i} \neq 0, i = 1, \dots, m$ .

In this paper, in order to distinguish over-expressed genes from under-expressed ones, we consider $H_{i, 0} : μ_{i} = 0$ vs. $H_{i, 1} : μ_{i} > 0$ vs. $H_{i, - 1} : μ_{i} < 0$ , $i = 1, \dots, m$ , which is essentially equivalent to the following three-classification multiple testing problem

H_{i, 0} : θ_{i} = 0 vs . H_{i, 1} : θ_{i} = 1 vs . H_{i, - 1} : θ_{i} = - 1, i = 1, \dots, m .

(3)

Denote by $δ = (δ_{1}, \dots, δ_{m}) \in {0, 1, - 1}^{m}$ a set of decision rules for problem (3), where $δ_{i} = 0,$ 1 or -1 implies respectively that the ith gene is claimed as null, over- or under-expressed. Based on $δ$ and $θ$ , the results of multiple testing can be summarized as in Table 1. Various indies can then be defined for error control. For instance, the total marginal false discovery rate (FDR) is defined as

tmFDR = \frac{E (N_{01} + N_{21} + N_{02} + N_{12})}{E (R_{1} + R_{2})} .

The quantity is equivalent to the standard marginal FDR in the binary classification problem. Similarly, the type III error rate can also be defined as $\frac{E (N_{21} + N_{12})}{E (R_{1} + R_{2})}$ .

Table 1.

Classification of tested hypothesis.

	$δ_{i} = 0$	$δ_{i} = 1$	$δ_{i} = - 1$	Total
$θ_{i} = 0$	$N_{00}$	$N_{01}$	$N_{02}$	$m_{0}$
$θ_{i} = 1$	$N_{10}$	$N_{11}$	$N_{12}$	$m_{1}$
$θ_{i} = - 1$	$N_{20}$	$N_{21}$	$N_{22}$	$m_{2}$
Total	$R_{0}$	$R_{1}$	$R_{2}$	m

Open in a new tab

In practice, we are often interested in controlling false discoveries of $H_{i, - 1}$ and $H_{i, 1}$ separately, a goal that the tmFDR and type III error fail to achieve. For instance, controlling the tmFDR can only ensure that the total error rate is controlled, and it is very possible that one of the false discoveries of $H_{i, - 1}$ and $H_{i, 1}$ is not controlled well, especially for skewed data. To solve this problem, we define respectively the marginal FDR of over- and under-expressed genes as

mFD R_{1} = \frac{E {\sum_{i = 1}^{m} I (θ_{i} \neq 1) I (δ_{i} = 1)}}{E [\sum_{i = 1}^{m} I (δ_{i} = 1)]} = \frac{E (N_{01} + N_{21})}{E (R_{1})}

and

mFD R_{- 1} = \frac{E {\sum_{i = 1}^{m} I (θ_{i} \neq - 1) I (δ_{i} = - 1)}}{E [\sum_{i = 1}^{m} I (δ_{i} = - 1)]} = \frac{E (N_{02} + N_{12})}{E (R_{2})} .

By controlling the $mFD R_{1}$ and $mFD R_{- 1}$ at satisfactory levels simultaneously, one can achieve effective error control for both $H_{i, - 1}$ and $H_{i, 1}$ .

To measure the power of $δ$ , we also define the total expected number of true positives (ETP) as

ETP = E {\sum_{i = 1}^{m} [(I (θ_{i} = 1) I (δ_{i} = 1) + I (θ_{i} = - 1) I (δ_{i} = - 1))]},

which equals the total number of tests that are correctly identified by $δ$ into each direction. Hence, our considered problem can be formulated into a constrained optimization problem:

find the δ that maximize ETP subject to {\begin{cases} mFD R_{1} \leq α_{1} \\ mFD R_{- 1} \leq α_{- 1} \end{cases}

(4)

for given error levels $0 < α_{1}, α_{- 1} < 1$ .

Remark 2.1

Besides the proposed directional mFDR, variations of FDR can be employed to measure misclassification error. For example, the directional FDR can be defined as

$FD R_{1} = E {\frac{N_{01} + N_{21}}{max (R_{1}, 1)}}, FD R_{- 1} = E {\frac{N_{02} + N_{12}}{max (R_{2}, 1)}} .$

Results from [13] imply that $FD R_{1} = mFD R_{1} + O (m^{- 1 / 2})$ and $FD R_{- 1} = mFD R_{- 1} + O (m^{- 1 / 2})$ , which means that the two measures are asymptotically equivalent. Following previous literatures, marginal FDR will be used in this paper for technical convenience when obtaining optimality results.

Remark 2.2

Another advantage of problem (4) is that the different $α_{1}$ and $α_{- 1}$ enable fine control over the FDR of different directions. For example, if one direction is more important than the other, it may be necessary to choose a more stringent $α_{k}$ when classifying genes into that direction. In cases when it is unclear how to choose the $α_{k}$ s individually, it is easy to show that the optimal decision rule of problem (4) is also valid for controlling the tmFDR at level $α = max_{k} α_{k}$ .

3. Methodology

3.1. The oracle procedure

Define $T_{k} (X_{i}) = P (θ_{i} \neq k | X_{i})$ , k = 0, 1, −1. Then under model (1)–(2), problem (4) is equivalent to maximizing

\begin{aligned} ETP & = E {E {\sum_{i = 1}^{m} [(I (θ_{i} = 1) I (δ_{i} = 1) + I (θ_{i} = - 1) I (δ_{i} = - 1))] ∣ X}} \\ = E {\sum_{i = 1}^{m} I (δ_{i} = 1) P (θ_{i} = 1 ∣ X_{i}) + I (δ_{i} = - 1) P (θ_{i} = - 1 ∣ X_{i})} \\ = E {\sum_{i = 1}^{m} I (δ_{i} = 1) (1 - T_{1} (X_{i})) + I (δ_{i} = - 1) (1 - T_{- 1} (X_{i}))} \end{aligned}

subject to

{\begin{cases} E {\sum_{i = 1}^{m} [I (δ_{i} = 1) (T_{1} (X_{i}) - α_{1})]} \leq 0 \\ E {\sum_{i = 1}^{m} [I (δ_{i} = - 1) (T_{- 1} (X_{i}) - α_{- 1})]} \leq 0. \end{cases}

By Lagrange multiplier technique, problem (4) can be solved by minimizing the following unconstrained optimization problem with penalized objective function

L (δ, λ) = \sum_{i = 1}^{m} \sum_{k = 1, - 1} [I (δ_{i} \neq k) (1 - T_{k} (X_{i})) + λ_{k} I (δ_{i} = k) (T_{k} (X_{i}) - α_{k})],

since any $δ$ that minimizes $L (δ, λ)$ conditionally on the observed test statistics will also minimize $E {L (δ, λ)}$ , where $λ = (λ_{1}, λ_{- 1})$ are the penalty parameters. Note that $λ_{1}$ and $λ_{- 1}$ can be considered as the cost of missed discovery relative to false positive, in which $L (δ, λ)$ can be viewed as a generalization of the loss function in the compound decision theory [24].

Given $λ_{1}, λ_{- 1} > 0$ , denote by $δ^{λ} = {δ_{1}^{λ}, \dots, δ_{m}^{λ}}$ the minimizer of $L (δ, λ)$ , and by

δ_{i}^{λ} = {\begin{cases} k, & if R_{k, i} \leq min (0, R_{k^{'}, i}), k, k^{'} = \pm 1, k \neq k^{'} \\ 0, & otherwise. \end{cases}

(5)

its ith element, where $R_{k, i} = λ_{k} {T_{k} (X_{i}) - α_{k}} - {1 - T_{k} (X_{i})}$ . The following proposition describes the behavior of $δ^{λ}$ .

Proposition 3.1

Consider model (1)–(2) and suppose that $min_{k} p_{k} > 0$ . Then, for $δ^{λ}$ in (5):

(a)
$δ^{λ}$ minimizes $E {L (δ, λ)};$

(b)
define $N_{k} (λ) = E [I (δ_{i}^{λ} = k) {T_{k} (X_{i}) - α_{k}}]$ and
${\overset{ˇ}{λ}}_{k, t} = inf {λ \leq {\overset{ˇ}{λ}}_{k, t - 1} : N_{k} ({\overset{ˇ}{λ}}_{k, t - 1}) \leq 0} k = 1, - 1,$
where $t \geq 1$ , ${\overset{ˇ}{λ}}_{k, 0} = \infty$ , and ${\overset{ˇ}{λ}}_{k, t - 1}$ is the $λ$ with $λ_{k} = λ$ and $λ_{k^{'}} = {\overset{ˇ}{λ}}_{k^{'}, t - 1}, k^{'} \neq k .$ Suppose that $α_{1} + α_{- 1} \leq 1$ and $0 \in {(N_{1} (λ), N_{- 1} (λ)) : λ_{1} \geq 0, λ_{2} \geq 0}$ . Then the sequence ${{\overset{ˇ}{λ}}_{k, t}}$ is convergent, and $N_{k} (λ^{*}) = 0$ for $k = 1, - 1,$ where $λ^{*} = (λ_{1}^{*}, λ_{- 1}^{*})$ and $λ_{k}^{*} = lim_{t \to \infty} {\overset{ˇ}{λ}}_{k, t} .$

Remark 3.1

The condition $min_{k} p_{k} > 0$ in Proposition 3.1 ensures that the proportion of features in each direction is non-zero. The quantity $N_{k} (λ)$ derived from the constraint on ${mFDR}_{k}$ can be seen as a measure of the unused directional misclassification error by $δ^{λ}$ to maximize the number of discoveries. The condition $0 \in {(N_{1} (λ), N_{- 1} (λ)) : λ_{1} \geq 0, λ_{2} \geq 0}$ ensures that there exists some $λ$ such that ${mFDR}_{k} (δ^{λ})$ exactly attains $α_{k}$ for k = 1, −1. The condition $α_{1} + α_{- 1} \leq 1$ is mild as it allows a wide range of $α_{k}$ , for example, $0 \leq α_{k} \leq 1.$

Now, the oracle procedure for the directional error control problem (4) can be defined. The following Theorem 3.1 gives the definition of the procedure and shows its validity and optimality for directional mFDR-control. Here, validity means that our procedure is able to control the two directional FDRs at desired levels, and optimality means that the ETP of our procedure is the smallest among all valid procedures.

Theorem 3.1

Consider model (1)–(2) and suppose that $min_{k} p_{k} > 0$ . With $λ^{*}$ given in Proposition 3.1, define

$δ^{λ^{*}} = {δ_{1}^{λ^{*}}, \dots, δ_{m}^{λ^{*}}} .$

If $α_{1}$ and $α_{- 1}$ satisfy the conditions in Proposition 3.1(b), then:

(a)
${mFDR}_{k} (δ^{λ^{*}}) = α_{k}, k = 1, - 1$ ;

(b)
for any $δ$ satisfying ${mFDR}_{k} (δ) \leq α_{k}, k = 1, - 1$ ,
$ETP (δ^{λ^{*}}) \geq ETP (δ) .$

3.2. The data-driven procedure

The proposed oracle procedure cannot be applied in practice because it relies on the unknown quantities $T_{k}$ s. To solve this problem, we now develop a data-driven procedure for the three-group problem and investigate the estimation of the unknown quantities.

It is straightforward that $T_{k} (X_{i})$ can be rewritten as

\begin{aligned} T_{k} (X_{i}) & = 1 - P (θ_{i} = k | X_{i}) \\ = 1 - \frac{g (X_{i} | θ_{i} = k) P (θ_{i} = k)}{g (X_{i})} \\ = 1 - \frac{p_{k} g_{k} (X_{i})}{g (X_{i})}, \end{aligned}

where $g (X_{i})$ is the marginal probability density function (p.d.f.) of $X_{i}$ and $g_{k} (X_{i}) = g (X_{i} | θ_{i} = k)$ is the conditional p.d.f. of $X_{i}$ given $θ_{i} = k$ . First, the estimate of $g (X_{i})$ , denoted as $\hat{g} (X_{i})$ , can be obtained with conventional kernel-based methods [21]. For the estimation of $p_{k} g_{k} (X_{i})$ , a useful observation here is that $p_{k} g_{k} (X_{i})$ can be estimated without estimating $p_{k}$ and $g_{k} (X_{i})$ separately. Specifically, denote by

h (μ) = \frac{p_{1}}{1 - p_{0}} h_{1} (μ) + \frac{p_{- 1}}{1 - p_{0}} h_{- 1} (μ)

the p.d.f. of $μ_{i}$ when $θ_{i} \neq 0$ . It is then easy to check that

p_{1} g_{1} (X_{i}) = (1 - p_{0}) \int_{0}^{+ \infty} g (X_{i} | θ_{i} \neq 0, μ) h (μ) d μ

and

p_{- 1} g_{- 1} (X_{i}) = (1 - p_{0}) \int_{- \infty}^{0} g (X_{i} | θ_{i} \neq 0, μ) h (μ) d μ .

Noticing that $g (X_{i} | θ_{i} \neq 0, μ)$ is the p.d.f. of $N (μ, σ_{i}^{2})$ , once $h (μ)$ is estimated, $p_{k} g_{k} (X_{i})$ can be estimated immediately by numerical approximation. Motivated by [26], a deconvoluting kernel estimator of $h (μ)$ is given by

\hat{h} (μ) = \frac{1}{2 π (1 - {\hat{p}}_{0})} \int_{- \infty}^{+ \infty} e^{- itμ} [\hat{Ψ} (t) / Ψ_{ϵ} (t) - {\hat{p}}_{0}] Ψ_{K} (τt) d t,

where $\hat{Ψ} (t)$ is the empirical characteristic function of $X_{i}$ , $Ψ_{ϵ} (t)$ and $Ψ_{K}$ are respectively the characteristic functions of the error distribution and a kernel $K_{t}$ , τ is the bandwidth parameter, and ${\hat{p}}_{0}$ is the estimate of $p_{0}$ proposed by [17]. We choose the sinc kernel $K_{t} = (πt)^{- 1} \sin t$ with $Ψ_{K} (t) = I (| t | \leq 1))$ for normal distribution in this paper. To select the optimal choice of τ, the bandwidth selection method of [9] is applied to minimize the bootstrap-based approximated MISE of h. It should also be noted that in practical use, the standard estimator $max {0, \hat{h} (μ)}$ is used instead of $\hat{h} (μ)$ .

Based on the above estimates, the plug-in statistic of $T_{k} (X_{i})$ can be obtained, denoted as ${\hat{T}}_{k} (X_{i})$ . The data-driven procedure for the directional error control problem (4) can now be constructed. Similar to the oracle problem, let ${\hat{δ}}_{i}^{λ}$ be the $δ_{i}^{λ}$ defined in (5) with $T_{k}$ replaced by ${\hat{T}}_{k}$ and ${\hat{N}}_{k} (λ) = (1 / m) \sum_{i = 1}^{m} [I ({\hat{δ}}_{i}^{λ} = k) {{\hat{T}}_{k} (X_{i}) - α_{k}}]$ . Then we define

{\hat{λ}}_{k, t} = inf {λ \leq {\hat{λ}}_{k, t - 1} : {\hat{N}}_{k} ({\hat{λ}}_{k, t - 1}) \leq 0} k = 1, - 1,

where $t \geq 1$ , ${\hat{λ}}_{k, 0} = \infty$ , and ${\hat{λ}}_{k, t - 1}$ is the $λ$ with $λ_{k} = λ$ and $λ_{k^{'}} = {\hat{λ}}_{k^{'}, t - 1}, k^{'} \neq k .$ The convergence of the sequence ${{\hat{λ}}_{k, t}, t \geq 1}$ can be proved similar to Proposition 3.1. Denote by ${\hat{λ}}_{k}^{*}$ the converged value. The data-driven decision rule for the directional error control problem (4) can then be defined as

{\hat{δ}}^{{\hat{λ}}^{*}} = {{\hat{δ}}_{1}^{{\hat{λ}}^{*}}, \dots, {\hat{δ}}_{m}^{{\hat{λ}}^{*}}},

where ${\hat{λ}}^{*} = ({\hat{λ}}_{1}^{*}, {\hat{λ}}_{- 1}^{*}) .$ An efficient algorithm is given in Algorithm 1 to obtain ${\hat{δ}}^{{\hat{λ}}^{*}}$ .

3.2.

Theorem 3.2 shows that the data-driven decision rule ${\hat{δ}}^{{\hat{λ}}^{*}}$ is asymptotically valid and optimal for directional mFDR-control.

Theorem 3.2

Consider model (1)–(2) and suppose that all the conditions in Theorem 3.1 hold. Let ${\hat{p}}_{0}$ and $\hat{g}$ be the estimates of $p_{0}$ and g. Suppose that

(C1)
${\hat{p}}_{0} \leq 1$ almost surely, and $E | {\hat{p}}_{0} - p_{0} |^{2} = o (m^{- κ})$ for some $κ > 0$ .

(C2)
The OC proportion $1 - p_{0}$ satisfies $m^{- ρ_{2}} \leq 1 - p_{0} \leq 1 - m^{- ρ_{2}}$ for some $ρ_{2} \in (0, 1)$ .

(C3)
$E ∥ \hat{g} (X_{i}) - g (X_{i}) ∥^{2} \to 0$ holds as $m \to \infty$ .

(C4)
$h (μ)$ is continuous, bounded and twice differentiable, and $\int h^{″} (μ) d μ$ is bounded.

(C5)
$K (t)$ is a bounded even probability density function with $\int t^{2} K (t) d t$ being bounded. Given τ, $\sup_{t} | Ψ_{K} (t) / Ψ_{ϵ} (t / τ) |$ and $\int | Ψ_{K} (t) / Ψ_{ϵ} (t / τ) | d t$ are bounded.

Then,

(a)
${mFDR}_{k} ({\hat{δ}}^{{\hat{λ}}^{*}}) = α_{k} + o (1), k = 1, - 1$ ;

(b)
$ETP ({\hat{δ}}^{{\hat{λ}}^{*}}) / ETP (δ^{λ^{*}}) = 1 + o (1) .$

Remark 3.2

Conditions (C1)–(C3) are mild and satisfied by using the aforementioned estimators. Condition (C4) is a regular one for deconvolution to ensure that $h (μ)$ is well estimated. Condition (C5) concerns the choice of the kernel function, implying that any kernel satisfying this condition can be utilized. It can be easily checked that Condition (C5) is satisfied with the sinc kernel.

4. Simulation studies

This section investigates the numerical performance of the proposed oracle and data-driven procedures. The observations are generated from m = 5000 features according to model (1)–(2). All the results are obtained based on 500 replicated simulations. For comparison purposes, the following procedures are considered:

our proposed oracle and data-driven procedures;
the Benjamini-Hochberg (BH) procedure of [3]. Note that the BH procedure is based on nondirectional hypotheses. Thus, to make it useable for the directional multiple-testing problem (3), upon rejecting $H_{i, 0}$ , we select $H_{i, - 1}$ if $X_{i} < 0$ and $H_{i, 1}$ if $X_{i} > 0$ ;
the local false discovery rates (Lfdr) procedure of [24]. Its modification for applying to the directional multiple-testing problem (3) is similar to BH;
the Bayes directional false discovery rate (BDFDR) procedure of [20]. The BDFDR procedure was designed for the directional problem, but under an assumption that the null hypothesis $H_{i, 0}$ is never true, which is often invalid in practice. Considering that the null hypothesis cannot be ignored, we modify the posterior probabilities of negative alternatives $S_{i}^{-} = P (θ_{i} = - 1 | X_{i})$ and positive alternatives $S_{i}^{+} = P (θ_{i} = 1 | X_{i})$ in [20] respectively as $T_{1}$ and $T_{- 1}$ in our paper, and then implement the BDFDR procedure.

The following four settings are considered:

Case I
The nominal positive and negative mFDR levels are fixed at $α_{1} = α_{- 1} = 0.1$ , the positive and negative proportions are fixed at $p_{1} = p_{- 1} = 0.125$ , and the p.d.f.s of $μ_{i}$ are chosen as $h_{1} (μ_{i}) = h_{2} (- μ_{i}) = Gam (A, 1, 0.5)$ , where $Gam (A, B, C)$ denotes the p.d.f. of the gamma distribution with shape parameter A, location parameter B and scale parameter C. The shape parameter A varies from 2 to 5.
Case II
$α_{1} = 2 α_{- 1} = 0.1$ , $p_{1} = p_{- 1} = 0.125$ and $h_{1} (μ_{i}) = h_{2} (- μ_{i}) = Gam (A, 1, 0.5)$ . The shape parameter A varies from 2 to 5.
Case III
$α_{1} = 2 α_{- 1} = 0.1$ , $p_{1} = 4 p_{- 1} = 0.2$ and $h_{1} (μ_{i}) = h_{2} (- μ_{i}) = Gam (A, 1, 0.5)$ . The shape parameter A varies from 2 to 5.
Case IV
$α_{1} = 2 α_{- 1} = 0.10$ , $h_{1} (μ_{i}) = Gam (4, 1, 0.5)$ , $h_{2} (- μ_{i}) = Gam (2, 1, 0.5)$ , $p_{1}$ varies from 0.025 to 0.225 and $p_{- 1} = 1 - p_{0} - p_{1}$ also varies.

Note that in all cases considered, the null proportion $p_{0}$ is fixed at 0.75. In Case I, the non-null signal strengths and proportions as well as the nominal mFDR levels all have symmetric structures, and at least one of them is asymmetric in Cases II–IV. The results of Cases I–IV are presented respectively in Figures 1–4. In each figure, the actual $mFD R_{1}$ , $mFD R_{- 1}$ , $tmFDR$ and $ETP$ values of the considered methods are plotted in turn. To conduct a fair comparison, in each case, after the nominal $α_{1}$ and $α_{- 1}$ are fixed in the proposed oracle and data-driven procedures, the nominal total error in the other methods is set as the induced nominal total error in our methods.

Figure 4. — Comparison of procedures for directional FDR control under Case IV.

From Figure 1, we can observe that in cases when the non-null signal strengths and proportions as well as the nominal mFDR levels all have symmetric structures, the proposed oracle and data-driven procedures, Lfdr and BDFDR procedures could all control the directional mFDR at the nominal level, while the BH procedure is conservative in the sense that its directional mFDRs are significantly smaller than the nominal level. Meanwhile, these methods perform similarly in terms of ETP, except that the BH procedure has a slightly worse performance.

Different from Case I, $α_{1}$ and $α_{- 1}$ are set to be different in Case II, which often happens in situations when the signals in one direction is more important than those in the other. The results are shown in Figure 2. From the plots, we can observe that the performance of the proposed oracle and data-driven procedures is not affected by such asymmetry in the sense that the two procedures could still control the ${mFDR}_{1}$ and ${mFDR}_{- 1}$ at respectively their nominal levels. By contrast, the other considered procedures all fail to do so in the sense that their directional FDRs are all significantly larger or smaller than the nominal levels.

Figure 2. — Comparison of procedures for directional FDR control under Case II.

Based on Case II, Figure 3 further presents the results of Case III with different positive and negative signal proportions. It can be observed from the plots that our proposed procedures significantly outperform the other three alternatives. Specifically, in cases when the nominal mFDR level in the direction with fewer signals is more stringent, the FDR levels in this direction of these three methods are much larger than the nominal level. Finally, Figure 4 presents the results when the two-sided signal proportions and strengths as well as the nominal mFDR levels are all different. It can be seen from the plots that our proposed oracle and data-driven procedures are still valid and optimal for directional mFDR control. The performance of the alternatives is seriously affected by the asymmetry, especially for the direction with weaker and fewer signals.

5. Applications to microarray analysis

In this section, we apply the proposed data-driven procedure to the analysis of the breast cancer and HIV microarray datasets analyzed by [10,17]. For comparison, the BH, Lfdr and BDFDR procedures in the simulation are also considered.

5.1. Analysis of breast cancer data

The breast cancer microarray data, contains 15 patients diagnosed with breast cancer, among which 7 patients are with the BRCA1 mutation and the other 8 are with the BRCA2 mutation. The tumor of each patient was analyzed on a separate microarray, and the microarrays were reported on the same set of m = 3226 genes. For the ith gene, the two-sample t test was implemented to compare the BRCA1 responses with the BRCA2 responses and to obtain the t score, which was then transformed into z score, denoted as $X_{i}$ .

Based on $X_{i}, i = 1, \dots, m$ , the considered procedures can be implemented. We set $α_{1}$ = $α_{- 1} = 0.1$ for our data-driven procedure, and set the nominal total mFDR level at $α = 0.1$ for the other three procedures. The results are summarized in Table 2. From the table, it can be seen that the numbers of identified genes of our procedure and the BDFDR procedure are almost the same. Also note that all the 6 under-expressed genes from our proposed data-driven procedure are also identified by BDFDR, while all the 10 over-expressed genes from BDFDR are also identified by our data-driven procedure. Moreover, according to [17], the estimated non-null proportion is very small but nonzero (i.e. ${\hat{p}}_{0} = 0.9872$ ). In such cases, the Lfdr procedure reports no rejections and thus fail to identify any significant genes. Also, as claimed by [17], the Lfdr procedure reports rejections only when $α \geq 0.91.$ By contrast, the BH procedure reports over 100 significant genes in total.

Table 2.

Number of identified genes in the breast cancer data.

Method	Nominal mFDR	Over-expressed	Under-expressed
Data-driven	$α_{1} = α_{- 1} = 0.1$	11	6
Data-driven	$α_{1} = 0.08, α_{- 1} = 0.15$	3	20
BDFDR	$α = 0.1$	10	8
Lfdr	$α = 0.1$	0	0
BH	$α = 0.1$	52	55

Open in a new tab

Another advantage of the proposed procedure is that the two nominal directional mFDR levels, $α_{1}$ and $α_{- 1}$ , can be properly adjusted depending on the practical need. For instance, if breast tumors with over-expressed BRCA1 mutations compared with BRCA2 mutations are not of primary interest, i.e. under-expressed BRCA1 mutations genes are more useful for diagnosis, the nominal directional mFDR levels can be adjusted as, for example, $α_{1} = 0.08$ and $α_{- 1} = 0.15$ . The results are also shown in Table 2, from which we observe that more under-expressed genes and less over-expressed genes will be identified in such cases. In conclusion, the above results indicate that our proposed procedure leads to reliable directional identification of differentially expressed genes.

5.2. Analysis of HIV data

The second example is the HIV microarray data. Similarly, the proposed data-driven procedure with $α_{1}$ = $α_{- 1} = 0.05$ and the other methods with $α = 0.05$ are applied to identify under-expressed and over-expressed genes. The results are shown in Table 3. From the table, it can be observed that the proposed data-driven procedure and the BDFDR procedure report more significant genes than Lfdr and BH. Compared with BDFDR, our proposed data-driven procedure reports a few more over-expressed genes and less under-expressed genes. In practice, if over-expressed genes or under-expressed genes are our major concern, our data-driven procedure can be flexibly implemented by adjusting the nominal directional mFDR levels, for example, $α_{1} = 0.1, α_{- 1} = 0.01$ or $α_{1} = 0.01, α_{- 1} = 0.1$ . The corresponding results are also listed in Table 3.

Table 3.

Number of selected genes in the HIV microarray data.

Method	Nominal mFDR	Over-expressed	Under-expressed
Data-driven	$α_{1} = α_{- 1} = 0.05$	195	306
Data-driven	$α_{1} = 0.1, α_{- 1} = 0.01$	283	86
Data-driven	$α_{1} = 0.01, α_{- 1} = 0.1$	103	521
BDFDR	$α = 0.05$	177	330
Lfdr	$α = 0.05$	74	36
BH	$α = 0.05$	16	2

Open in a new tab

6. Conclusions

In this paper, we focused on the problem of directional large-scale multiple testing and proposed oracle and data-driven procedures on basis of a three-classification multiple testing framework. The proposed procedures enable simultaneous FDR-control for both the two directions with possibly different nominal FDR levels, exhibiting high flexibility. We proved theoretically that the proposed procedures are valid and optimal for directional FDR-control in the sense that they maximize the ETP while controlling the two directional FDRs at their nominal levels. We ran extensive simulations and real-data analysis and showed that our procedures outperforms the alternatives significantly in terms of directional FDR control.

The methods were developed under the assumption that the observations $X_{i}$ s are independent of each other. Developing directional multiple testing procedures for general dependence structures is still an open problem, which requires further research. Besides the dependence of the observations, it is assumed that the states $θ_{i}$ s are independent, which may also be invalid in practice. For instance, in microarray experiments, genes from the same biological pathway may share similar significance patterns. There has been some work to model such dependence using a hidden Markov model [25,29]. The extension to a three-state HMM should be fruitful for directional FDR-control. In addition, when domain knowledge (e.g. biological theory or prior experimental results) is available, it can be applied to weight the observed test statistics to further improve the power of the proposed procedures, which will be left for future study.

Acknowledgments

The authors want to thank the Editor, the Associate Editor, and anonymous referees for their constructive comments and suggestions that improved the quality of the paper significantly.

Appendix. Proofs.

To simplify the notation, we suppress the dependence of $T_{k} (X_{i})$ on $X_{i}$ in the proofs, and denote it as $T_{k, i}$ .

Proof Proo of Proposition 3.1 —

(a) To derive the oracle procedure that minimizes $L (δ, λ)$ , it suffices to minimize each of the terms

$\sum_{k = 1, - 1} {I (δ_{i} \neq k) (1 - T_{k, i}) + λ_{k} I (δ_{i} = k) (T_{k, i} - α_{k})}$

for $i = 1, \dots, m$ . It is easy to check that the minimizer is $δ_{i}^{λ}$ defined in Equation (5) for $i = 1, \dots, m$ . As a result, for any $δ \in {- 1, 0, 1}^{m}$ ,

$L (δ^{λ}, λ) \leq L (δ, λ),$

where $δ^{λ} = {δ_{1}^{λ}, \dots, δ_{m}^{λ}}$ . Taking the expectation on both sides, we have $E {L (δ^{λ}, λ)} \leq E {L (δ, λ)}$ for all $δ \in {- 1, 0, 1}^{m}$ .

(b) First, we need to show that $N_{k} (λ)$ is non-increasing in $λ_{k}$ but non-decreasing in $λ_{k^{'}}, k^{'} \neq k$ . Define

$A_{λ_{k}} = {T_{k, i} \leq \frac{α_{k} λ_{k} + 1}{λ_{k} + 1}}$

and

$B_{λ_{k}} = {λ_{k} (T_{k, i} - α_{k}) + T_{k, i} < min_{k^{'} \neq k} λ_{k^{'}} (T_{k^{'}, i} - α_{k^{'}}) + T_{k^{'}, i}} .$

Then we have

$N_{k} (λ) = E {I_{A_{λ_{k}}} I_{B_{λ_{k}}} (T_{k, i} - α_{k})} .$

Given $λ_{k}^{1} > λ_{k}^{2} > 0,$ it can be easily concluded that $A_{λ_{k}^{1}} \subseteq A_{λ_{k}^{2}}$ and $B_{λ_{k}^{1}} \subseteq B_{λ_{k}^{2}}$ . Furthermore, we have

$N_{k} (λ_{1}) - N_{k} (λ_{2}) = E {(I_{A_{λ_{k}^{1}}} I_{B_{λ_{k}^{1}}} - I_{A_{λ_{k}^{2}}} I_{B_{λ_{k}^{2}}}) I (T_{k, i} \geq α_{k}) (T_{k, i} - α_{k})} \leq 0,$

where $λ_{i}, i = 1, 2$ is the $λ$ with its kth component $λ_{k} = λ_{k}^{i}$ . In other words, $N_{k} (λ)$ is non-increasing in $λ_{k}$ . Similarly, we can prove that $N_{k} (λ)$ is non-decreasing in $λ_{k^{'}} .$

According to lemma 1 in [30], there exists a $λ^{* *}$ such that the constructed sequences ${{\overset{ˇ}{λ}}_{k, t}, t \geq 1}$ satisfy the relationships ${\overset{ˇ}{λ}}_{k, 1} \geq {\overset{ˇ}{λ}}_{k, 2} \geq \dots \geq λ_{k}^{* *}$ and $N_{k} ({\overset{ˇ}{λ}}_{k, t}^{'}) = 0, k = 1, - 1, t \geq 1,$ where ${\overset{ˇ}{λ}}_{k, t}^{'}$ is the $λ$ with $λ_{k} = {\overset{ˇ}{λ}}_{k, t}$ and $λ_{k^{'}} = {\overset{ˇ}{λ}}_{k^{'}, t - 1}, k^{'} \neq k .$ Then, it follows from the monotone convergence theorem that ${{\overset{ˇ}{λ}}_{k, t}, t \geq 1}$ will converge to a number, denoted as $λ_{k}^{*}$ . Let ${\overset{ˇ}{λ}}_{t} = ({\overset{ˇ}{λ}}_{1, t}, {\overset{ˇ}{λ}}_{- 1, t})$ , then we have

$N_{k} (λ^{*}) = lim_{t \to \infty} N_{k} ({\overset{ˇ}{λ}}_{t}) = lim_{t \to \infty} N_{k} ({\overset{ˇ}{λ}}_{k, t}^{'}) = 0, k = 1, - 1.$

Proof Proof of Theorem 3.1 —

(a) Given the fact that $X_{1}, \dots, X_{m}$ are identically distributed, for each k, $N_{k} (λ^{*}) = 0$ is equivalent to ${mFDR}_{k} (δ^{λ^{*}}) = α_{k}, k = 1, - 1$ .

(b) For any $δ$ satisfying ${mFDR}_{k} (δ) \leq α_{k}, k = 1, - 1$ , we have

$\begin{aligned} E [\sum_{i = 1}^{m} \sum_{k = 1, - 1} {(1 - T_{k, i}) - I (δ_{i}^{λ^{*}} = k) (1 - T_{k, i})}] \\ E [\sum_{i = 1}^{m} \sum_{k = 1, - 1} {I (δ_{i}^{λ^{*}} \neq k) (1 - T_{k, i}) + λ_{k}^{*} I (δ_{i}^{λ^{*}} = k) (T_{k, i} - α_{k})}] \\ = E {L (δ^{λ^{*}}, λ^{*})} \leq E {L (δ, λ^{*})} \\ = E [\sum_{i = 1}^{m} \sum_{k = 1, - 1} {I (δ_{i} \neq k) (1 - T_{k, i}) + λ_{k}^{*} I (δ_{i} = k) (T_{k, i} - α_{k})}] \\ \leq E [\sum_{i = 1}^{m} \sum_{k = 1, - 1} {(1 - T_{k, i}) - I (δ_{i} = k) (1 - T_{k, i})}] \end{aligned}$

As a result, we have

$ETP (δ^{λ^{*}}) \geq ETP (δ) .$

Proof Proof of Theorem 3.2 —

(a) Define ${\tilde{N}}_{k} (λ) = \frac{1}{m} \sum_{i = 1}^{m} I (δ_{i}^{λ} = k) (T_{i, k} - α_{k})$ . By the weak law of large numbers, we have (a) ${\tilde{N}}_{k} ({\overset{ˇ}{λ}}_{k, t - 1}) \overset{P}{\to} N_{k} ({\overset{ˇ}{λ}}_{k, t - 1})$ holds. For each k = 1, −1, if we fix $λ_{k^{'}}, k^{'} \neq k$ , then ${\hat{N}}_{k} (λ)$ is a function of $λ_{k}$ , and its continuous version, denoted as ${\hat{N}}_{k}^{C} (λ)$ , can be constructed by linear interpolation. ${\hat{N}}_{k}^{C} (λ)$ is continuous in $λ_{k}$ and monotone. Its inverse function, denoted as ${\hat{N}}_{k}^{C, - 1} (λ)$ , is also well defined, continuous and monotone. Then we have (b) ${\hat{N}}_{k} ({\hat{λ}}_{k, t - 1}) - {\hat{N}}_{k}^{C} ({\hat{λ}}_{k, t - 1}) \overset{P}{\to} 0$ and (c) ${\hat{λ}}_{k, t} - {\hat{N}}_{k}^{C, - 1} ({\hat{λ}}_{k 0, t - 1}) \overset{P}{\to} 0$ hold for k = 1, −1, where ${\hat{λ}}_{k 0, t - 1}$ is the $λ$ with the kth component 0 and the other the same as the counterparts of ${\hat{λ}}_{k, t - 1}$ . Suppose that ${\hat{λ}}_{k^{'}, t - 1} \overset{P}{\to} {\overset{ˇ}{λ}}_{k^{'}, t - 1}$ for all $k^{'} \neq k$ . Then we have (d) ${\hat{N}}_{k}^{C} ({\overset{ˇ}{λ}}_{k, t - 1}) - {\hat{N}}_{k}^{C} ({\hat{λ}}_{k, t - 1}) \overset{P}{\to} 0$ and (e) ${\hat{N}}_{k}^{C, - 1} ({\hat{λ}}_{k 0, t - 1}) - {\hat{N}}_{k}^{C, - 1} ({\overset{ˇ}{λ}}_{k 0, t - 1}) \overset{P}{\to} 0$ immediately, where ${\overset{ˇ}{λ}}_{k 0, t - 1}$ is the $λ$ with the kth component 0 and the rest the same as the counterparts of ${\overset{ˇ}{λ}}_{k, t - 1}$ .

To prove Theorem 3.2, the following two results shall be discussed:

${\hat{N}}_{k} ({\hat{λ}}_{k, t - 1}) \overset{P}{\to} {\tilde{N}}_{k} ({\overset{ˇ}{λ}}_{k, t - 1})$ holds for any $λ_{k} > 0$ ;

${\hat{N}}_{k}^{C, - 1} ({\overset{ˇ}{λ}}_{k 0, t - 1}) \overset{P}{\to} {\overset{ˇ}{λ}}_{k, t}$ and ${\hat{λ}}_{k, t} \overset{P}{\to} {\overset{ˇ}{λ}}_{k, t}, t \geq 1.$

To prove result (1), it suffices to prove that

$< / p >< p > E {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}^{2} = o (1),$

and ${\hat{N}}_{k} ({\hat{λ}}_{k, t - 1}) \overset{P}{\to} {\tilde{N}}_{k} ({\overset{ˇ}{λ}}_{k, t - 1})$ can be proved accordingly, which will be shown later. We have

$\begin{aligned} P ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k, δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} \neq k) & \leq P ({\hat{T}}_{k, i} \leq \frac{α_{k} λ_{k} + 1}{λ_{k} + 1}, {\hat{T}}_{k, i} > \frac{α_{k} λ_{k} + 1}{λ_{k} + 1}) \\ + P ({\hat{T}}_{k, i} > \frac{α_{k} λ_{k} + 1}{λ_{k} + 1}, T_{k, i} \leq \frac{α_{k} λ_{k} + 1}{λ_{k} + 1}) \\ + P {λ_{k} ({\hat{T}}_{k, i} - α_{k}) + {\hat{T}}_{k, i} \leq min_{k^{'} \neq k} {\hat{λ}}_{k^{'}, t - 1} ({\hat{T}}_{k^{'}, i} - α_{k^{'}}) + {\hat{T}}_{k^{'}, i}, \\ λ_{k} (T_{k, i} - α_{k}) + T_{k, i} > min_{k^{'} \neq k} {\overset{ˇ}{λ}}_{k^{'}, t - 1} ({\hat{T}}_{k^{'}, i} - α_{k^{'}}) + {\hat{T}}_{k^{'}, i}} \\ + P {λ_{k} ({\hat{T}}_{k, i} - α_{k}) + {\hat{T}}_{k, i} > min_{k^{'} \neq k} {\hat{λ}}_{k^{'}, t - 1} ({\hat{T}}_{k^{'}, i} - α_{k^{'}}) + {\hat{T}}_{k^{'}, i}, \\ λ_{k} (T_{k, i} - α_{k}) + T_{k, i} \leq min_{k^{'} \neq k} {\overset{ˇ}{λ}}_{k^{'}, t - 1} (T_{k^{'}, i} - α_{k^{'}}) + T_{k^{'}, i}} \\ = o (1) + o (1) = o (1) \end{aligned}$

and similarly

$< / p >< p > P ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} \neq k, δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) = o (1) .$

Then we have

$\begin{aligned} E {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}^{2} \\ \leq E {({\hat{T}}_{k, i} - T_{k, i})^{2} I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k, δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k)} + E {({\hat{T}}_{k, i} - α_{k})^{2} I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k, δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} \neq k)} \\ + E {(T_{k, i} - α_{k})^{2} I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} \neq k, δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k)} \\ \leq E {({\hat{T}}_{k, i} - T_{k, i})^{2}} + P ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k, δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} \neq k) + P ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} \neq k, δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) \\ = o (1) + o (1) + o (1) = o (1) . \end{aligned}$

By the Cauchy–Schwarz inequality, we have

$\begin{aligned} E {| I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k}) |} \\ \leq {[E {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}^{2}]}^{1 / 2} = o (1) \end{aligned}$

and

$\begin{aligned} E [{I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})} \\ \times {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}] \\ \leq {[E {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}^{2}]}^{1 / 2} \\ \times {[E {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}^{2}]}^{1 / 2} = o (1) . \end{aligned}$

Then the following two results can be obtained immediately:

$\begin{aligned} | E [1 / m \sum_{i = 1}^{m} {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}] | \\ \leq E {| I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k}) |} = o (1) \end{aligned}$

and

$\begin{aligned} Var [1 / m \sum_{i = 1}^{m} {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}] \\ \leq (1 / m) E {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}^{2} \\ + (1 - 1 / m) E [{I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})} \\ \times {I ({\hat{δ}}_{i}^{{\hat{λ}}_{k, t - 1}} = k) ({\hat{T}}_{k, i} - α_{k}) - I (δ_{i}^{{\overset{ˇ}{λ}}_{k, t - 1}} = k) (T_{k, i} - α_{k})}] = o (1) . \end{aligned}$

Now we can conclude that ${\hat{N}}_{k} ({\hat{λ}}_{k, t - 1}) \overset{P}{\to} {\tilde{N}}_{k} ({\overset{ˇ}{λ}}_{k, t - 1})$ .

Proof of (2): To prove result (2), it suffices to prove that

${\hat{N}}_{k}^{C} ({\overset{ˇ}{λ}}_{k, t - 1}) - {\hat{N}}_{k} ({\overset{ˇ}{λ}}_{k, t - 1}) \overset{P}{\to} 0,$

which follows from (1), (a), (b) and (d). Thus, we have

${\hat{N}}_{k}^{C, - 1} ({\overset{ˇ}{λ}}_{k 0, t - 1}) \overset{P}{\to} {\overset{ˇ}{λ}}_{k, t} .$

This result, together with (c) and (e) show that ${\hat{λ}}_{k, t} \overset{P}{\to} {\overset{ˇ}{λ}}_{k, t}, t \geq 1.$

Now we proceed to prove Theorem 3.2. When t = 1, we have ${\hat{λ}}_{k^{'}, t - 1} = {\overset{ˇ}{λ}}_{k^{'}, t - 1} = \infty$ for $k^{'} \neq k$ , and thus ${\hat{λ}}_{k, t} \overset{P}{\to} {\overset{ˇ}{λ}}_{k, t} .$ Then it is easy to derive that

${\hat{λ}}_{k, t} \overset{P}{\to} {\overset{ˇ}{λ}}_{k, t}, t \geq 1.$

Taking the limitations on both sides, we have ${\hat{λ}}_{k}^{*} \overset{P}{\to} λ_{k}^{*} .$ Then similar to the proof of (1), we have $P ({\hat{δ}}_{Sj}^{{\hat{λ}}^{*}} = k, δ_{Sj}^{λ^{*}} \neq k) = o (1)$ and $P ({\hat{δ}}_{Sj}^{{\hat{λ}}^{*}} \neq k, δ_{Sj}^{λ^{*}} = k) = o (1)$ . Then we have

$E {| I ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} = k) - I (δ_{i}^{λ^{*}} = k) |} \leq P ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} = k, δ_{i}^{λ^{*}} \neq k) + P ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} \neq k, δ_{i}^{λ^{*}} = k) = o (1) .$

Based on the above results, it can be easily shown that

$\begin{aligned} | E {1 / m \sum_{i = 1}^{m} (T_{k, i} - α_{k}) I ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} = k)} | & = | E [(T_{k, i} - α_{k}) {I ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} = k) - I (δ_{i}^{λ^{*}} = k)}] | \\ \leq E {| I ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} = k) - I (δ_{i}^{λ^{*}} = k) |} = o (1) \end{aligned}$

and

$E {1 / m \sum_{i = 1}^{m} I ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} = k)} = E {1 / m \sum_{i = 1}^{m} I (δ_{i}^{λ^{*}} = k)} + o (1) > 0,$

based on which we can derive that ${mFDR}_{k} ({\hat{δ}}^{{\hat{λ}}^{*}}) = α_{k} + o (1)$ . Meanwhile, we have

$| E [1 / m \sum_{i = 1}^{m} (1 - T_{k, i}) {I ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} = k) - I (δ_{i}^{λ^{*}} = k)}] | \leq E {| I ({\hat{δ}}_{i}^{{\hat{λ}}^{*}} = k) - I (δ_{i}^{λ^{*}} = k) |} = o (1),$

based on which we can derive that $ETP ({\hat{δ}}^{{\hat{λ}}^{*}}) / ETP (δ^{λ^{*}}) = 1 + o (1) .$ Till now, we have finished the proof of Theorem 3.2.

Funding Statement

This work was supported by the National Key R&D Program of China [2022YFA1003801; 2021YFA1000101; 2021YFA1000102], National Natural Science Foundation of China [12201382; 12071144; 71931004], Basic Research Project of Shanghai Science and Technology Commission [22JC1400800], Anhui Provincial Natural Science Foundation [2308085MA10], Natural Science Foundation of Anhui Provincial Universities [KJ2021A1040].

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Abramovich F., Benjamini Y., Donoho D.L., and Johnstone I.M., Adapting to unknown sparsity by controlling the false discovery rate, Ann. Stat. 34 (2006), pp. 584–653. [Google Scholar]
2.Basu P., Cai T.T., Das K., and Sun W., Weighted false discovery rate control in large-scale multiple testing, J. Am. Stat. Assoc. 113 (2018), pp. 1172–1183. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]
4.Benjamini Y. and Yekutieli D., The control of the false discovery rate in multiple testing under dependency, Ann. Stat. 29 (2001), pp. 1165–1188. [Google Scholar]
5.Cai T.T. and Sun W., Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks, J. Am. Stat. Assoc. 104 (2009), pp. 1467–1481. [Google Scholar]
6.Cai T.T. and Sun W., Large-scale global and simultaneous inference: Estimation and testing in very high dimensions, Annu. Rev. Econ. 9 (2017), pp. 411–439. [Google Scholar]
7.Cai T. and Sun W., Optimal screening and discovery of sparse signals with applications to multistage high-throughput studies, J. R. Stat. Soc. Ser. B (Methodol.) 79 (2017), pp. 197–223. [Google Scholar]
8.Cai T.T., Sun W., and Xia Y., Laws: A locally adaptive weighting and screening approach to spatial multiple testing, J. Am. Stat. Assoc. 117 (2021), pp. 1–14.35757777 [Google Scholar]
9.Delaigle A. and Gijbels I., Practical bandwidth selection in deconvolution kernel density estimation, Comput. Stat. Data Anal. 45 (2004), pp. 249–267. [Google Scholar]
10.Efron B., Large-scale simultaneous hypothesis testing: The choice of a null hypothesis, J. Am. Stat. Assoc. 99 (2004), pp. 96–104. [Google Scholar]
11.Efron B., Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge University Press, New York, 2010. [Google Scholar]
12.Efron B., Tibshirani R., Storey J.D., and Tusher V., Empirical Bayes analysis of a microarray experiment, J. Am. Stat. Assoc. 96 (2001), pp. 1151–1160. [Google Scholar]
13.Genovese C. and Wasserman L., Operating characteristics and extensions of the false discovery rate procedure, J. R. Stat. Soc. Ser. B (Methodol.) 64 (2002), pp. 499–517. [Google Scholar]
14.Genovese C. and Wasserman L., A stochastic process approach to false discovery control, Ann. Stat. 32 (2004), pp. 1035–1061. [Google Scholar]
15.Holte S.E., Lee E.K., and Mei Y., Symmetric directional false discovery rate control, Stat. Methodol. 33 (2016), pp. 71–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Javanmard A. and Javadi H., False discovery rate control via debiased lasso, Electron. J. Stat. 13 (2019), pp. 1212–1253. [Google Scholar]
17.Jin J. and Cai T.T., Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons, J. Am. Stat. Assoc. 102 (2007), pp. 495–506. [Google Scholar]
18.Lewis C. and Thayer D.T., A loss function related to the FDR for random effects multiple comparisons, J. Stat. Plan. Inference 125 (2004), pp. 49–58. [Google Scholar]
19.Li W., Xiang D., Tsung F., and Pu X., A diagnostic procedure for high-dimensional data streams via missed discovery rate control, Technometrics 62 (2020), pp. 84–100. [Google Scholar]
20.Sarkar S.K. and Zhou T., Controlling Bayes directional false discovery rate in random effects model, J. Stat. Plan. Inference 138 (2008), pp. 682–693. [Google Scholar]
21.Silverman B.W., Density Estimation for Statistics and Data Analysis, Routledge, New York, 2018. [Google Scholar]
22.Storey J.D., A direct approach to false discovery rates, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 64 (2002), pp. 479–498. [Google Scholar]
23.Storey J.D., The positive false discovery rate: A Bayesian interpretation and the q-Value, Ann. Stat. 31 (2003), pp. 2013–2035. [Google Scholar]
24.Sun W. and Cai T.T., Oracle and adaptive compound decision rules for false discovery rate control, J. Am. Stat. Assoc. 102 (2007), pp. 901–912. [Google Scholar]
25.Sun W. and Cai T.T., Large-scale multiple testing under dependence, J. R. Stat. Soc. Ser. B (Methodol.) 71 (2009), pp. 393–424. [Google Scholar]
26.Sun W. and McLain A.C., Multiple testing of composite null hypotheses in heteroscedastic models, J. Am. Stat. Assoc. 107 (2012), pp. 673–687. [Google Scholar]
27.Sun W., Reich B.J., Cai T.T., Guindani M., and Schwartzman A., False discovery control in large-scale spatial multiple testing, J. R. Stat. Soc. Ser. B (Methodol.) 77 (2015), pp. 59–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Wang L., Han X., and Tong X., Skilled mutual fund selection: False discovery control under dependence, preprint (2021). Available at arXiv, arXiv:2106.08511.
29.Xiang D., Li W., Tsung F., Pu X., and Kang Y., Fault classification for high-dimensional data streams: A directional diagnostic framework based on multiple hypothesis testing, Naval Res. Logist. 68 (2021), pp. 973–987. [Google Scholar]
30.Xiang D., Zhao S.D., and Cai T.T., Signal classification for the integrative analysis of multiple sequences of large-scale multiple tests, J. R. Stat. Soc. Ser. B (Methodol.) 81 (2019), pp. 707–734. [Google Scholar]

[CIT0001] 1.Abramovich F., Benjamini Y., Donoho D.L., and Johnstone I.M., Adapting to unknown sparsity by controlling the false discovery rate, Ann. Stat. 34 (2006), pp. 584–653. [Google Scholar]

[CIT0002] 2.Basu P., Cai T.T., Das K., and Sun W., Weighted false discovery rate control in large-scale multiple testing, J. Am. Stat. Assoc. 113 (2018), pp. 1172–1183. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0003] 3.Benjamini Y. and Hochberg Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B (Methodol.) 57 (1995), pp. 289–300. [Google Scholar]

[CIT0004] 4.Benjamini Y. and Yekutieli D., The control of the false discovery rate in multiple testing under dependency, Ann. Stat. 29 (2001), pp. 1165–1188. [Google Scholar]

[CIT0005] 5.Cai T.T. and Sun W., Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks, J. Am. Stat. Assoc. 104 (2009), pp. 1467–1481. [Google Scholar]

[CIT0006] 6.Cai T.T. and Sun W., Large-scale global and simultaneous inference: Estimation and testing in very high dimensions, Annu. Rev. Econ. 9 (2017), pp. 411–439. [Google Scholar]

[CIT0007] 7.Cai T. and Sun W., Optimal screening and discovery of sparse signals with applications to multistage high-throughput studies, J. R. Stat. Soc. Ser. B (Methodol.) 79 (2017), pp. 197–223. [Google Scholar]

[CIT0008] 8.Cai T.T., Sun W., and Xia Y., Laws: A locally adaptive weighting and screening approach to spatial multiple testing, J. Am. Stat. Assoc. 117 (2021), pp. 1–14.35757777 [Google Scholar]

[CIT0009] 9.Delaigle A. and Gijbels I., Practical bandwidth selection in deconvolution kernel density estimation, Comput. Stat. Data Anal. 45 (2004), pp. 249–267. [Google Scholar]

[CIT0010] 10.Efron B., Large-scale simultaneous hypothesis testing: The choice of a null hypothesis, J. Am. Stat. Assoc. 99 (2004), pp. 96–104. [Google Scholar]

[CIT0011] 11.Efron B., Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction, Cambridge University Press, New York, 2010. [Google Scholar]

[CIT0012] 12.Efron B., Tibshirani R., Storey J.D., and Tusher V., Empirical Bayes analysis of a microarray experiment, J. Am. Stat. Assoc. 96 (2001), pp. 1151–1160. [Google Scholar]

[CIT0013] 13.Genovese C. and Wasserman L., Operating characteristics and extensions of the false discovery rate procedure, J. R. Stat. Soc. Ser. B (Methodol.) 64 (2002), pp. 499–517. [Google Scholar]

[CIT0014] 14.Genovese C. and Wasserman L., A stochastic process approach to false discovery control, Ann. Stat. 32 (2004), pp. 1035–1061. [Google Scholar]

[CIT0015] 15.Holte S.E., Lee E.K., and Mei Y., Symmetric directional false discovery rate control, Stat. Methodol. 33 (2016), pp. 71–82. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0016] 16.Javanmard A. and Javadi H., False discovery rate control via debiased lasso, Electron. J. Stat. 13 (2019), pp. 1212–1253. [Google Scholar]

[CIT0017] 17.Jin J. and Cai T.T., Estimating the null and the proportion of nonnull effects in large-scale multiple comparisons, J. Am. Stat. Assoc. 102 (2007), pp. 495–506. [Google Scholar]

[CIT0018] 18.Lewis C. and Thayer D.T., A loss function related to the FDR for random effects multiple comparisons, J. Stat. Plan. Inference 125 (2004), pp. 49–58. [Google Scholar]

[CIT0019] 19.Li W., Xiang D., Tsung F., and Pu X., A diagnostic procedure for high-dimensional data streams via missed discovery rate control, Technometrics 62 (2020), pp. 84–100. [Google Scholar]

[CIT0020] 20.Sarkar S.K. and Zhou T., Controlling Bayes directional false discovery rate in random effects model, J. Stat. Plan. Inference 138 (2008), pp. 682–693. [Google Scholar]

[CIT0021] 21.Silverman B.W., Density Estimation for Statistics and Data Analysis, Routledge, New York, 2018. [Google Scholar]

[CIT0022] 22.Storey J.D., A direct approach to false discovery rates, J. R. Stat. Soc. Ser. B (Stat. Methodol.) 64 (2002), pp. 479–498. [Google Scholar]

[CIT0023] 23.Storey J.D., The positive false discovery rate: A Bayesian interpretation and the q-Value, Ann. Stat. 31 (2003), pp. 2013–2035. [Google Scholar]

[CIT0024] 24.Sun W. and Cai T.T., Oracle and adaptive compound decision rules for false discovery rate control, J. Am. Stat. Assoc. 102 (2007), pp. 901–912. [Google Scholar]

[CIT0025] 25.Sun W. and Cai T.T., Large-scale multiple testing under dependence, J. R. Stat. Soc. Ser. B (Methodol.) 71 (2009), pp. 393–424. [Google Scholar]

[CIT0026] 26.Sun W. and McLain A.C., Multiple testing of composite null hypotheses in heteroscedastic models, J. Am. Stat. Assoc. 107 (2012), pp. 673–687. [Google Scholar]

[CIT0027] 27.Sun W., Reich B.J., Cai T.T., Guindani M., and Schwartzman A., False discovery control in large-scale spatial multiple testing, J. R. Stat. Soc. Ser. B (Methodol.) 77 (2015), pp. 59–83. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0028] 28.Wang L., Han X., and Tong X., Skilled mutual fund selection: False discovery control under dependence, preprint (2021). Available at arXiv, arXiv:2106.08511.

[CIT0029] 29.Xiang D., Li W., Tsung F., Pu X., and Kang Y., Fault classification for high-dimensional data streams: A directional diagnostic framework based on multiple hypothesis testing, Naval Res. Logist. 68 (2021), pp. 973–987. [Google Scholar]

[CIT0030] 30.Xiang D., Zhao S.D., and Cai T.T., Signal classification for the integrative analysis of multiple sequences of large-scale multiple tests, J. R. Stat. Soc. Ser. B (Methodol.) 81 (2019), pp. 707–734. [Google Scholar]

PERMALINK

Directional false discovery rate control in large-scale multiple comparisons

Wenjuan Liang

Dongdong Xiang

Yajun Mei

Wendong Li

Abstract

1. Introduction

2. Problem formulation

Table 1.

Remark 2.1

Remark 2.2

3. Methodology

3.1. The oracle procedure

Proposition 3.1

Remark 3.1

Theorem 3.1

3.2. The data-driven procedure

Theorem 3.2

Remark 3.2

4. Simulation studies

Figure 1.

Figure 4.

Figure 2.

Figure 3.

5. Applications to microarray analysis

5.1. Analysis of breast cancer data

Table 2.

5.2. Analysis of HIV data

Table 3.

6. Conclusions

Acknowledgments

Appendix. Proofs.

Proof Proo of Proposition 3.1 —

Proof Proof of Theorem 3.1 —

Proof Proof of Theorem 3.2 —

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases