Bounding the Resampling Risk for Sequential Monte Carlo Implementation of Hypothesis Tests

Hyune-Ju Kim

doi:10.1016/j.jspi.2010.01.003

. Author manuscript; available in PMC: 2011 Jul 1.

Published in final edited form as: J Stat Plan Inference. 2010 Jul;140(7):1834–1843. doi: 10.1016/j.jspi.2010.01.003

Bounding the Resampling Risk for Sequential Monte Carlo Implementation of Hypothesis Tests

Hyune-Ju Kim ¹

PMCID: PMC2876353 NIHMSID: NIHMS176257 PMID: 20514142

Abstract

Sequential designs can be used to save computation time in implementing Monte Carlo hypothesis tests. The motivation is to stop resampling if the early resamples provide enough information on the significance of the p-value of the original Monte Carlo test. In this paper, we consider a sequential design called the B-value design proposed by Lan and Wittes and construct the sequential design bounding the resampling risk, the probability that the accept/reject decision is different from the decision from complete enumeration. For the B-value design whose exact implementation can be done by using the algorithm proposed in Fay, Kim and Hachey, we first compare the expected resample size for different designs with comparable resampling risk. We show that the B-value design has considerable savings in expected resample size compared to a fixed resample or simple curtailed design, and comparable expected resample size to the iterative push out design of Fay and Follmann. The B-value design is more practical than the iterative push out design in that it is tractable even for small values of resampling risk, which was a challenge with the iterative push out design. We also propose an approximate B-value design that can be constructed without using a specially developed software and provides analytic insights on the choice of parameter values in constructing the exact B-value design.

Keywords: B-Value, Bootstrap, Permutation, Sequential Design, Approximation

1 Introduction

When we implement Monte Carlo hypotheses tests (MC tests) such as bootstrap or permutation tests, computation time can be often saved by early stopping. The main idea is to stop early and reject/accept the null hypothesis if the early replications provide enough evidence for a significant/insignificant p-value of the original MC test. A simple sequential design is a curtailed design where we stop and reject or accept the null hypothesis when we have the replications enough to ensure the rejection or acceptance of the null hypothesis with a full enumeration of the MC test. Fay et al. (2007) proposed using a truncated sequential probability ratio (tSPRT) boundary which is minimax with respect to the resampling risk, and provided an algorithm to calculate a valid p-value after reaching the stopping boundary. As discussed in Fay et al. (2007), there is another way to determine an MC boundary: by minimizing the resampling risk over a class of possible distributions for the p-value. Fay and Follmann (2002) used the class of distributions for the p-values generated by location shifts for a standard normal test statistic and approximated the associated distributions for the p-values within this class using beta distributions. Within this class of distributions, Fay and Follmann (2002) found by numerical search the parameter values which gave the largest resampling risk for fixed boundaries, defining a “worst case” distribution. Their approach was then to determine the design for the MC test which bounded the resampling risk for that worst case distribution and would therefore bound the resampling risk for the entire class of distributions. Fay and Follmann (2002) used this approach in motivating the iterative push out (IPO) design. In Fay and Follmann (2002), the IPO boundary was shown to provide savings in the expected resample size compared to the fixed resample size design and a simple curtailed design. Its implementation, however, is not practical for small values of resampling risk.

Fay et al. (2007) did not fully explore using the Fay and Follmann (2002) approach for bounding the resampling risk within a class of distributions for the p-value, but only plotted resampling risk as a function of fixed p-values. In this paper, we explore the Fay and Follmann (2002) approach and focus on sequential designs which bound or approximately bound the resampling risk within the previously mentioned class of distributions for the p-value. In doing so, we use the B-value boundary, which is a tSPRT boundary written in terms of a B-value proposed by Lan and Witts (1988). The B-value is a statistic which produces the expected value of the test statistic at the end of study, conditional on information up to some point, and we make a decision based on this projected value. We construct the B-value boundary either by using the numerical algorithm introduced in Fay et al. (2007), FKH algorithm, or by using approximations introduced in this paper. We call the latter design the approximate B-value design, and it is much easier to construct, but it only provides a decision rule, without estimating the p-value. In order to numerically construct the B-value boundary, we implement the algorithm developed in Fay et al. (2007), which unlike the IPO design, provides a tractable design to bound the resampling risk below small values (e.g., below 0.01). Considering the class of the p-value distributions used in Fay and Follmann (2002), we empirically show that the B-value design can provide expected resample sizes significantly smaller than those of the fixed resample size and the simple curtailed sequential designs, and the expected resample size comparable to the IPO design. As another way of implementing the B-value design, we propose a method to construct an approximate B-value boundary by using results developed in sequential analysis. The approximate B-value design asymptotically bounds the resampling risk and the associated test asymptotically maintains its size. These approximations help one to understand analytic characteristics of the sequential MC test, and the approximate B-value design can be implemented without an aid of a special algorithm.

This paper is organized as follows. In Section 2, we introduce the notation and define the B-value design. We also summarize some exact properties of the B-value design, compared to the fixed resample size, curtailed and IPO designs. In Section 3, we propose the approximate B-value design, which only requires a simple numerical integration in order to make an accept/reject decision early, subject to a desirable resampling risk. Section 4 includes an example and further remarks.

2 B-value Design

2.1 Overview

Let T be a test statistic, whose large values lead to the rejection of the null hypothesis, H₀, and T₀ = T(d₀) denote the value of T observed for the original sample. When independent replications of data, say $d_{1}^{*}, d_{2}^{*}, \dots$ , are taken and we obtain $T_{1} = T (d_{1}^{*}), T_{2} = T (d_{2}^{*}), \dots$ , the p-value of the MC test is defined as p(d₀) = P(T_i ≥ T₀∣d₀) for all possible replications of T. For given d₀, consider p(d₀) = p as fixed. With such p, our goal is to stop taking Monte Carlo samples when we are reasonably sure that either p ≤ α (i.e., the decision after an infinite resample is to reject the original null hypothesis) or p > α (i.e., the decision after an infinite resample is to fail to reject the original null hypothesis). One way to think about the Monte Carlo problem is that we have independent and identically distributed random variables, X_i = I(T_i ≥ T₀), each distributed as a Bernoulli distribution with parameter p, and we want to design a study to decide between the two possibilities for the parameter p, either p ≤ α or p > α.

Suppose that X₁, X₂, …, are the independent and identically distributed Bernoulli random variables with parameter p, and consider the problem of testing $H_{0}^{*} : p \leq α$ versus $H_{1}^{*} : p > α$ . Various sequential procedures can be applied to conduct the testing of $H_{0}^{*}$ against $H_{1}^{*}$ , and in this paper we use the B-value proposed by Lan and Wittes (1988) to project the current trend to the end of a study. For our purposes, we can calculate the B-value after every Monte Carlo sample, and use these B-values to efficiently conduct a sequential testing.

Consider the statistic Z_n to test whether p = α or not with n observations,

Z_{n} = \frac{S_{n} - n α}{\sqrt{n α (1 - α)}},

where $S_{n} = \sum_{i = 1}^{n} X_{i}$ . For the pre-determined maximum number of resamples, m, the B-value at $t = \sqrt{n / m}$ is defined as

B (\sqrt{\frac{n}{m}}) = \sqrt{\frac{n}{m}} Z_{n} = \frac{S_{n} - n α}{\sqrt{m α (1 - α)}} .

Consider a sequential test where we stop resampling at the n-th sample and reject $H_{0}^{*} : p \leq α$ if $B (\sqrt{n / m}) \geq c_{1}$ for some n ≤ m (and thus do not reject the original null hypothesis H₀), or stop and fail to reject $H_{0}^{*}$ if $B (\sqrt{n / m}) \leq c_{2}$ for some n ≤ m (and thus reject the original null hypothesis H₀) where c₁ and c₂ are appropriately chosen constants. This sequential test produces parallel boundaries on plots of (n, S_n). We modify it by incorporating the idea of a simple curtailed test, which stops at the n-th sample and rejects $H_{0}^{*}$ if S_n ≥ ⌊α(m+1)⌋ = r₁, and stops and accepts $H_{0}^{*}$ if n − S_n ≥ ⌈(1 − α)(m+ 1)⌉ = r₀. Let

\begin{array}{l} U (n) & = min {s \geq 0 : s - n α \geq c_{1} \sqrt{m α (1 - α)}}, \\ L (n) & = max {s \geq 0 : s - n α \leq c_{2} \sqrt{m α (1 - α)}}, \end{array}

and define

\begin{array}{l} B_{Upper} & = {(n, s) = (n, min [U (n), r_{1}]) : n = n_{0}^{+}, n_{0}^{+} + 1, \dots, m} \\ B_{Lower} & = {(n, s) = (n, max [L (n), n - r_{0}]) : n = n_{0}^{-}, n_{0}^{-} + 1, \dots, m}, \end{array}

where $n_{0}^{+}$ is the smallest value of n such that U(n) ≤ n and $n_{0}^{-}$ is the smallest n for which L(n) ≥ 0. Also, denote the smallest value of n such that L(n) ≤ n − r₀ as $n_{1}^{-}$ and the smallest value of n such that U(n) ≥ r₁ as $n_{1}^{+}$ . Then, the stopping boundary is formed by B = B_Upper ∪ B_Lower. See Figure 1 for the boundary with c₁ = −c₂ = 2.58281, m = 3,620, and α = 0.05, for which $n_{0}^{+} = 36$ , $n_{0}^{-} = 678$ , $n_{1}^{+} = 2, 923$ , and $n_{1}^{-} = 3, 585$ .

A B-value Boundary with m = 3, 620, α = 0.05 and c₁ = −c₂ = 2.58281

2.2 The B-value Design Bounding the Resampling Risk

Let the stopping rule for the design be denoted by a b × 2 matrix, B, such that the j-th row denotes values of (n, s) where sampling stops, and consider a sequential MC test φ defined on the B-value boundary as follows: φ(n, s) = 0 would lead us not to reject the original null hypothesis (i.e. reject $H_{0}^{*} : p \leq α$ ) and φ(n, s) = 1 rejects the original null hypothesis (i.e. do not reject $H_{0}^{*} : p \leq α$ ). Then, the size of the sequential MC test, φ, is estimated as α*(φ, B) = E_UE_B (φ∣P), where E_B is the expectation over the Monte Carlo sampling design after replacing p with the random variable P, and E_U represents expectation with respect to P under the continuous uniform distribution. For the B-value design with φ defined this way, we require the design to bound the resampling risk under the distribution for P, both under the null and a class of alternative distributions, where the resampling risk,

γ (φ, F, B) = E_{F} E_{B} {I (P \leq α) (1 - φ)} + E_{F} E_{B} {I (P > α) φ},

is the expected probability that the accept/reject decision is different from complete enumeration for some distribution of P ∼ F. In other words, we require γ(φ, F, B) ≤ γ* for all F ∊ Inline graphic , where is the class of distributions which includes the null (i.e., the uniform) and many alternative distributions, and γ* is some pre-specified bound on the resampling risk.

Fay and Follmann (2002) proposed the design to start with a simple design and to iteratively “push out” the boundary at the point where not stopping will decrease the resampling risk the greatest per added expected sample size. The pushing out algorithm will not be reviewed here, but we note that it is computationally expensive to carry out, so that IPO designs that bound the resampling risk less than or equal to γ* = 0.01 are intractable. Fay and Follmann (2002) found distributions that met the resampling criterion by first defining a class of designs which may be expanded in some systematic way such that γ gets smaller as the designs get larger, then finding the distribution F̂_P ∊ Inline graphic which appears to give the largest values of γ(φ, F̂_P, B) for each member of the class of designs, and finally pick the smallest design in the class such that γ(φ, F̂_P, B) ≤ γ*. They estimated F_P with beta distributions and then searched for the F̂_P which gave the largest resampling risk for fixed boundaries of different sizes over all possible values of β, the probability of Type II error, and the numerical search showed that 1 − β ≈ .47 gave the largest resampling risk for fixed boundaries with α = 0.05. As discussed in Fay et al. (2007), the tSPRT boundary or equivalently the B-value boundary is then can be determined by searching the values of c₁ and c₂ for fixed m such that γ(φ, F̂*, B) < γ*, where F̂* is the “worst case” distribution chosen as in Fay and Follmann (2002).

In the rest of this section, we consider the B-value boundary of Section 2.1 and the test based on the valid p-value proposed in Fay et al. (2007), φ_FKH. Their valid p-value when (n, S_n) is a boundary point is defined as p̂_v(S_n, n) = F_{p̂_MLE}(S_n/n), where p̂_MLE is the maximum likelihood estimator of p, p̂_MLE(S_n, n) = S_n/n, for any closed boundary and F_p̂ is a cumulative distribution function of p̂. F_p̂ is defined in (5.2) of Fay et al. (2007) and can be computed by using the FKH algorithm. For the B-value boundary obtained by the FKH algorithm and the test function defined as

φ_{FKH} (n, s) = {\begin{matrix} 1 & if {\hat{p}}_{v} (s, n) \leq α; \\ 0 & if {\hat{p}}_{v} (s, n) > α, \end{matrix}

the following tables indicate the efficiency of the B-value design. In Table 1 we compare the B-value design with various values of c = c₁ = −c₂ to the IPO design of Fay and Follmann (2002) for α = .05 and γ* = .025. The distributions of the p-value considered in the tables are F̂* chosen as the “worst case” beta distribution, a uniform distribution, and two point mass distributions. Following the notation of Fay and Follmann (2002), we let H_α,₁₋_β be a distribution such that H_α,₁₋_β(p) = 1 − Φ{Φ⁻¹(1 − p) − Φ⁻¹(1 − α) + Φ⁻¹(β)} where Φ is a cumulative distribution function of a standard normal distribution, and Ĥ_α_,1−_β denote the beta distribution whose mean is same as the mean of H_α,₁₋_β and that Ĥ_α,₁₋_β(α) = H_α,₁₋_β(α) = 1 − β. In this numerical experiment, we use F̂* = Ĥ_0.05,0.47 following Fay and Follmann (2002), and consider the case of c = c₁ = −c₂ and ε = 2(1 − Φ(c)). The choices of ε (or c) are rather arbitrary in Tables 1 and 2, while m is chosen to be the smallest sample size to bound the resampling risk for c given. These values of m in Tables 1 and 2 are chosen by using the FKH algorithm, which required to repeatedly construct the B value boundary and to compute the associate risk, and some insights on such choices are provided in Section 3 with the approximate B-value design. We see that the expected resample size, E(N), for the B-value design with ε = 0.05 is close to the E(N) from the IPO design. The E(N) for the B-value designs decrease with increasing ε (or decreasing c), and some B-value designs have lower E(N) than the IPO design for most distributions tried. Note that changing the value of ε (or c) does not change the resampling risk within the range of values tried, which will be discussed further with the approximate B-value design in the next section. In Table 2, we compare the B-value design to the fixed resample size design and the curtailed design at γ* = 0.01, a value for which the IPO design is intractable. We see that the B-value designs can give much lower values of E(N) than the other designs, indicating that the B-value design is more efficient than the IPO and simple curtailed designs. In general, we see that for larger values of ε (i.e., smaller values of c), the B-value design shows smaller E(N). Thus, at any m predetermined, it would be desirable to use the design with the smallest value of c, while bounding the resampling risk.

Table 1.

E(N∣B) with α = 0.05, P ∼ B, and γ(φ, Ĥ_α,.₄₇, B) = 0.025

	Ĥ_α,.₄₇		Uniform		Point mass at α		Point mass at 0.001
	E(N)	Risk	E(N)	Risk	E(N)	Risk	E(N)	Risk
B-value Design
(ε, m) = (0.005, 580)	279.637	0.02499	87.812	0.00719	536.429	0.52660	301.020	0
(ε, m) = (0.025, 580)	242.704	0.02499	76.056	0.00719	534.795	0.52660	240.816	0
(ε, m) = (0.05, 580)	221.092	0.02499	69.231	0.00719	530.807	0.52657	210.204	0
(ε, m) = (0.1, 580)	194.643	0.02500	61.072	0.00720	518.236	0.52627	176.531	0
(ε, m) = (0.2, 600)	163.118	0.02475	51.169	0.00713	494.378	0.52310	139.796	0
IPO Design
m=576	213.508	0.0250	62.85	0.0072	459.6047	0.522	251.7032	0

Open in a new tab

Table 2.

E(N∣B) with α = 0.05, P ∼ B, and γ(φ, Ĥ_α_,.47, B) = 0.01

	Ĥ_α,.₄₇		Uniform		Point mass at α		Point mass at 0.06
	E(N)	Risk	E(N)	Risk	E(N)	Risk	E(N)	Risk
B-value Design
(ε, m) = (0.005, 3620)	954.475	0.00999	292.708	0.00289	3507.646	0.51065	2861.852	0.00426
(ε, m) = (0.025, 3620)	803.470	0.00999	245.940	0.00289	3485.032	0.51065	2575.559	0.00426
(ε, m) = (0.05, 3620)	723.584	0.00999	221.320	0.00289	3444.662	0.51063	2360.361	0.00426
(ε, m) = (0.1, 3620)	626.848	0.00999	191.519	0.00289	3334.562	0.51049	2060.366	0.00430
(ε, m) = (0.2, 3681)	510.256	0.01000	155.840	0.00289	3076.974	0.51058	1671.372	0.00452
Curtailed Design
m=3620	2371.276	0.00999	722.753	0.00289	3515.422	0.50913	3016.334	0.00431
Fixed Design
m=3620	3620	0.00999	3620	0.00289	3620	0.51065	3620	0.00426

Open in a new tab

3 Approximate B-value Design

Although much more tractable than the IPO design, the numerical calculation of the B-value design in Section 2.2 is still computationally expensive and requires a specially developed software. We now consider an alternative design which we call the approximate B-value design. This design does not require a special program to construct the boundary, and this approximate B-value design would also provide some analytic insights on the properties of the B-value design illustrated in Section 2.2.

In this section, we use the B-value boundary defined in Section 2.1, but define the test function φ as follows: φ(n, s) = 1 if (n, s) ∊ B_Lower without reaching B_Upper, and φ(n, s) = 0 if (n, s) ∊ B_Upper without reaching B_Lower. Then (n, s) such that φ(n, s) = 0 would lead us not to reject the original null hypothesis and (n, s) such that φ(n, s) = 1 is the value to reject the original null hypothesis. We also let N denote the resample size at stopping. We note that φ of this section, the one based only on which boundary is crossed, is not exactly equivalent to φ_FKH of Section 2.2, which is based on the valid p-value estimated at each stopping point. Although not reported here, we observed in cases we tried that φ_FKH and φ are different only in a very small portion of the boundary, typically in the curtailed part, at which the valid p-value is close to α. In this sense, we call the test proposed in this section an approximate B-value test. Now, aiming to construct an approximate B-value design bounding the resampling risk, we discuss how to determine c₁, c₂, and m, and thus B such that γ(φ, F, B) ≤ γ* for a given choice of γ*. Since the approximate design does not automatically achieve the size of α, we also examine the size of the test following the approximate B-value design. Finally, we review how to estimate the expected resample size for the B-value design.

3.1 Approximate Upperbound of the Resampling Risk

In this section we show how to determine approximate bounds on the resampling risk given a B-value stopping boundary and the test function φ defined above. Suppose that a Monte Carlo sample forms a discrete stochastic process {X₁, X₂, …}, where X_i has a Bernoulli distribution with E(X_i) = p, and let $S_{n} = \sum_{i = 1}^{n} X_{i}$ . For the approximate B-value test associated with the B-value boundary of Section 2.1, we can summarize the decision rule using the following two random variables denoting the minimum values of n at which each boundary is crossed:

\begin{matrix} N^{-} = min {n : S_{n} hits the lower boundary, B_{Lower}}, \\ N^{+} = min {n : S_{n} hits the upper boundary, B_{Upper}} . \end{matrix}

Recall that the resampling risk is γ(φ, F, B) = E_FE(φI(P > α) ∣ P) + E_FE((1 − φ)I(P ≤ α) ∣ P) where P ∼ F. Then for α < p < 1,

\begin{array}{l} E (φ ∣ p) = Pr (N^{-} \leq m and N^{-} < N^{+} ∣ p) \\ = Pr (N^{-} \leq m ∣ p) - Pr (N^{-} \leq m and N^{-} \geq N^{+} ∣ p) \\ = h_{1} (p) - h_{2} (p) \end{array}

and for 0 < p ≤ α,

\begin{array}{l} E (1 - φ ∣ p) = Pr (N^{+} \leq m and N^{+} < N^{-} ∣ p) \\ = Pr (N^{+} \leq m ∣ p) - Pr (N^{+} \leq m and N^{+} \geq N^{-} ∣ p) \\ = h_{3} (p) - h_{4} (p) . \end{array}

In order to use asymptotic results developed for the B-values, consider the stopping boundaries formed only by the B-values without considering the truncation conditions: for each p, let

\begin{array}{l} τ^{+} = min {n \geq n_{0}^{+} : B (\sqrt{n / m}) \geq c_{1}} \\ = min {n \geq n_{0}^{+} : S_{n} - n α \geq c_{1} \sqrt{m α (1 - α)}} \\ = min {n \geq n_{0}^{+} : W_{n} \geq b_{1}} \end{array}

and

\begin{array}{l} τ^{-} = min {n \geq n_{0}^{-} : B (\sqrt{n / m}) \leq c_{2}} \\ = min {n \geq n_{0}^{-} : S_{n} - n α \leq c_{2} \sqrt{m α (1 - α)}} \\ = min {n \geq n_{0}^{-} : W_{n} \leq b_{2}}, \end{array}

where $b_{1} = c_{1} \sqrt{m α (1 - α) / (p (1 - p))}$ , $b_{2} = c_{2} \sqrt{m α (1 - α) / (p (1 - p))}$ , and $W_{n} = \sum_{i = 1}^{n} (X_{i} - α) / \sqrt{p (1 - p)}$ asymptotically behaves like a Brownian motion with drift $μ = (p - α) / \sqrt{p (1 - p)}$ . Then,

\begin{array}{l} h_{1} (p) = Pr (N^{-} \leq m ∣ p) \\ = Pr (τ^{-} < n_{1}^{-} ∣ p) + Pr (τ^{-} \geq n_{1}^{-} and n - S_{n} \geq r_{0} for some n = n_{1}^{-}, \dots, m ∣ p) \\ \leq Pr (W_{m} \leq 0 ∣ p) + Pr (τ^{-} < m and W_{m} > 0 ∣ p) \end{array}

and

\begin{array}{l} h_{3} (p) = Pr (N^{+} \leq m ∣ p) \\ = Pr (τ^{+} < n_{1}^{+} ∣ p) + Pr (τ^{+} \geq n_{1}^{+} and S_{n} \geq r_{1} for some n = n_{1}^{+}, \dots, m ∣ p) \\ \leq Pr (W_{m} \geq 0 ∣ p) + Pr (τ^{+} < m and W_{m} < 0 ∣ p) . \end{array}

By using some asymptotic results developed in the context of sequential analysis, we obtain

\begin{array}{l} γ (φ, F, B) \leq \int_{α}^{1} h_{1} (p) f_{p (X)} (p) d p + \int_{0}^{α} h_{3} (p) f_{p (X)} (p) d p \\ \leq \int_{α}^{1} Φ (- μ \sqrt{m}) f_{p (X)} (p) d p + R_{1} + \int_{0}^{α} 1 - Φ (- μ \sqrt{m})) f_{p (X)} (p) d p + R_{2} \\ = I_{1} + R_{1} + I_{2} + R_{2} \equiv \tilde{γ}, \end{array}

(1)

where f is a density function of P, and

\begin{array}{l} I_{1} & = \int_{α}^{1} Φ (- μ \sqrt{m}) f_{p (X)} (p) d p ≅ \int_{α}^{1} Pr (W_{m} \leq 0) f_{p (X)} (p) d p, \\ I_{2} & = \int_{0}^{α} (1 - Φ (- μ \sqrt{m})) f_{p (X)} (p) d p ≅ \int_{0}^{α} Pr (W_{m} \geq 0) f_{p (X)} (p) d p, \\ R_{1} & = \int_{α}^{1} e^{- 2 μ (b_{2} + ρ)} Φ (- \frac{2 (b_{2} + ρ)}{\sqrt{m}} + μ \sqrt{m}) f_{p (X)} (p) d p ≅ \int_{α}^{1} Pr (τ^{-} < m and W_{m} > 0 ∣ p) f_{p (X)} (p) d p, \\ R_{2} & = \int_{0}^{α} e^{2 μ (b_{1} + ρ)} Φ (- \frac{2 (b_{1} + ρ)}{\sqrt{m}} - μ \sqrt{m}) f_{p (X)} (p) d p ≅ \int_{0}^{α} Pr (τ^{+} < m and W_{m} < 0 ∣ p) f_{p (X)} (p) d p, \end{array}

with ρ ≅ 0.583, a correction factor for a discrete process.

See Appendix for the derivation of the upperbound in (1). Note that I₁ and I₂ depend only on α and m, not on c₁ and c₂, and the major contribution to I₁ + I₂ comes from those p close to α. Also R₁ + R₂ is negligible compared to I₁ + I₂. These are expected from the formulations of I₁, I₂, R₁ and R₂ as shown on the right sides of the expressions above. Since I₁ works as an approximate bound of the resampling risk associated with stopping at the lower boundary between $n_{1}^{-}$ and m, and I₂ serves as an approximate bound of the resampling risk associated with stopping at the upper boundary between $n_{1}^{+}$ and m, we note that the major contribution of the resampling risk comes from decisions made on these parts of the boundary, B_Upper for $n = n_{1}^{+}, \dots, m$ and B_Lower for $n = n_{1}^{-}, \dots, m$ , rather than those on the parallel boundary points, B_Upper for $n = n_{0}^{+}, \dots, n_{1}^{+}$ and B_Lower for $n = n_{0}^{-}, \dots, n_{1}^{-}$ .

Since I₁ and I₂ decrease as m increases at a given α, we can choose m to make I₁ + I₂ slightly less than γ*. Then any choice of (c₁, c₂) to make R₁ + R₂ ≤ γ* − I₁ − I₂ would asymptotically guarantee the design to bound the resampling risk. However, the larger the values of c₁ and |c₂| are chosen, the larger resample size is expected, and thus, at m given or predetermined, it would be desirable to choose c₁ and |c₂| as small as possible while bounding the resampling risk.

The approximate B-value design can be implemented as follows. For a given choice of α and γ* such that γ(φ, Ĥ_α,.₄₇, B) ≤ γ*, we first choose m such that I₁+I₂, for f ∼ Ĥ_α,.₄₇, is slightly smaller than γ*. Then for such m, we find c = c₁ = −c₂ such that g(c) = R₁ +R₂ ≤ γ* − I₁ − I₂. For example, if we want to achieve α = 0.05 and γ(φ, Ĥ_0.05,0.47, B) ≤ 0.01, we can use a simple integration to find I₁ + I₂ = 0.005046 + 0.004930 = 0.009976 for the choice of m = 3,620 and f ∼ Beta(a = 0.3889, b = 2.5234) ≈ Ĥ_0.05,0.47 of Table 2. Finding c such that g(c) = R₁ + R₂ ≤ γ* − I₁ − I₂ = 0.000024 would then bound the resampling risk to be under 0.01. We now note that for the values of c in Table 2, 2.241, 1.960, and 1.645 corresponding to ε = 2(1 −Φ(c)) = 0.025, 0.05, and 0.1, respectively, g(c) = R₁ + R₂ increases as c decreases and R₁ + R₂ < 1.9 × 10⁻⁷, < 2.1 × 10⁻⁶, and < 2.3 × 10⁻⁵, respectively. Thus the smallest value of c among these three, asymptotically bounding the resampling risk under γ* = 0.01, is c = 1.645, which matches with Table 2, and the numerical search over all c less than or equal to 2.241 resulted in c = 1.635 for α = 0.05, γ* = 0.01 and m = 3,620. This indicates that for any m given or predetermined, we can first compute I₁ + I₂ to get an idea on γ* and then can easily find the smallest design (i.e. c) satisfying the constraint on the resampling risk, while the construction of the exact boundary and the choice of m and c associated with φ _FKH would require rather time consuming searches.

Remark 1. The first upperbound of the above expression can be improved by incorporating $\int_{α}^{1} h_{2} (p) f_{p (X)} (p) d p$ and $\int_{0}^{α} h_{4} (p) f_{p (X)} (p) d p$ , but these probabilities are typically negligible compared to $\int_{α}^{1} h_{1} (p) f_{p (X)} (p) d p$ and $\int_{0}^{α} h_{3} (p) f_{p (X)} (p) d p$ , respectively.

Remark 2. When the distribution of P is a point mass at α, the resampling risk is greater than 0.5 since

\begin{array}{l} h_{1} (α) \geq Pr (m - S_{m} \geq r_{0} ∣ α) ≅ P_{μ = 0} (W_{m} \leq 0) = 0.5, \\ h_{3} (α) \geq Pr (S_{m} \geq r_{1} ∣ α) ≅ P_{μ = 0} (W_{m} \geq 0) = 0.5, \end{array}

and $\int_{α}^{1} h_{2} (p) f_{p (X)} (p) d p$ and $\int_{0}^{α} h_{4} (p) f_{p (X)} (p) d p$ are negligible. This explains the values of the resampling risk in Tables 1 and 2 obtained for the point mass at α.

3.2 Size of the Approximate B-value Design

Based on the similar arguments used in Section 3.1, for α given and large m,

\begin{array}{l} E (φ ∣ p) = h_{1} (p) - h_{2} (p) \\ \leq h_{1} (p) \\ \leq Pr (W_{m} \leq 0 ∣ p) + Pr (τ^{-} < m and W_{m} > 0 ∣ p) \\ ≅ Φ (- μ \sqrt{m}) + e^{- 2 μ (b + ρ)} Φ (- \frac{2 (b + ρ)}{\sqrt{m}} + μ \sqrt{m}) \\ = g_{11} (p) + g_{12} (p, c), \end{array}

where $W_{n} = \sum_{i = 1}^{n} (X_{i} - α) / \sqrt{p (1 - ρ)}$ , ρ ≅ 0.583, and g₁₂(p, c) is obtained with η = 0, $μ = \frac{p - α}{\sqrt{p (1 - p)}}$ , and $b = - c_{2} \sqrt{m_{f} α (1 - α) / (p (1 - p)))}$ in Equation 3.28 of Siegmund (1985).

Then, for large m, an upperbound for the size of the test, α*(φ, B), is obtained as

\begin{array}{l} α * (φ, B) = E_{U} E (φ ∣ P) = \int_{0}^{1} E (φ ∣ p) d p \\ \leq \int_{0}^{α} (g_{11} (p) + g_{12} (p, c)) d p + \int_{α}^{1} (g_{11} (p) + g_{12} (p, c)) d p \\ = α_{1} + α_{2} (c) + α_{3} + α_{4} (c) \\ = \tilde{α}, \end{array}

where $α_{1} = \int_{0}^{α} g_{11} (p) d p$ , $α_{2} (c) = \int_{0}^{α} g_{12} (p, c) d p$ , $α_{3} = \int_{α}^{1} g_{11} (p) d p$ , and $α_{4} (c) = \int_{α}^{1} g_{12} (p, c) d p$ .

We note that α₂ and α₄ are negligible compared to α₁ and α₃, and also that α₃ is relatively smaller than α₁. The former is because $\int_{0}^{1} Pr (τ^{-} < m and W_{m} > 0 ∣ p) d p$ is negligible with Pr(τ⁻ < m) being small for p > α and Pr(W_m > 0) being small for p ≤ α, and the latter is expected from a positive drift of W_m when p > α. This is illustrated by numerical integration results summarized in Table 3. In Table 3, we observe that the major contribution to α̃ comes from α₁ + α₃, which only depends on m, not on c, and α̃ is usually slightly over α. Considering that $α^{*} (φ, B) ≅ \tilde{α} - \int_{0}^{1} h_{2} (p) d p = \tilde{α} - \int_{0}^{1} Pr (N^{-} \leq m and N^{-} \geq N^{+} ∣ p) d p$ and that $\int_{0}^{1} h_{2} (p) d p$ is negligible, the approximate B-value design is expected to retain the size, at least approximately. Note that c = 0.908 in the last case was obtained as a smallest value of c bounding the approximate upperbound of the resampling risk, γ̃, under 0.01 at m = 4,999.

Table 3.

Approximate Size

c	ε = 2(1 − Φ(c))	γ̃ = I₁ + I₂ + R₁ + R₂	α	m	α₁	α₃	α̃
2.241	0.025	0.024842	0.05	580	0.046756	0.004016	0.050772
2.241	0.025	0.009976	0.05	3620	0.048616	0.001509	0.050124
2.241	0.025	0.012958	0.01	3620	0.009404	0.000732	0.010135
0.908	0.364	0.009994	0.05	4999	0.048814	0.001276	0.050451

Open in a new tab

3.3 Expected Resample Size

As discussed in Section 3.1, the approximate B-value design can be determined by choosing m to make I₁ + I₂ smaller than the desirable resampling risk bound of γ* and then by choosing the smallest c such that R₁ + R₂ ≤ γ* − I₁ − I₂. We now introduce approximations for E(N), which would help us to study asymptotic efficiencies of B-value designs chosen to bound the resampling risk. As discussed in Sections 3.1 and 3.2, the B-value stopping time can be approximated as a stopping time of a Brownian motion process observed in discrete time, W_n, associated with a truncated parallel boundary, and thus we can use (1.15) of Samuel-Cahn (1974) or (3.17) of Siegmund (1985) with the correction suggested for the discrete time scale. Using these arguments, for T_b = min {n : W_n ≥ b}, we obtain

\begin{array}{l} E_{p} [min (T_{b}, m)] \\ \sim \frac{b + ρ}{μ} + (m - \frac{b + ρ}{μ}) Φ (- μ \sqrt{m} + b / \sqrt{m}) - (m + \frac{b + ρ}{μ}) e^{2 μ (b + ρ)} Φ (- μ \sqrt{m} - b / \sqrt{m}) \\ = \frac{b + ρ}{μ} {1 - Φ (- μ \sqrt{m} + b / \sqrt{m}) - e^{2 μ (b + ρ)} Φ (- μ \sqrt{m} - b / \sqrt{m})} \\ + m {Φ (- μ \sqrt{m} + b / \sqrt{m}) - e^{2 μ (b + ρ)} Φ (- μ \sqrt{m} - b / \sqrt{m})}, \end{array}

where $μ = (p - α) / \sqrt{p (1 - p)}$ , $b = c \sqrt{m α (1 - α) / p (1 - p)}$ , and ρ ≅ 0.583. Using similar arguments in (3.37) of Siegmund (1985), we then may use

l (p) = E_{p} [min (T_{b}, m)] + E_{p} [min (T_{- b}, m)] - m

(2)

as an approximate lower bound of E_p[min(T,m)] for T = min(T_b,T₋_b). For the settings in Tables 1 and 2 with Ĥ_α,.₄₇ and ε = 0.005, 0.025, 0.05, and 0.1 (or c = 2.807, 2.241, 1.960, and 1.645), we find $\int_{0}^{1} l (p) f_{p (X)} (p) d p$ as 291.836, 249.616, 226.213, and 197.680 for m = 580, and 965.547, 810.270, 727.441, and 629.124 for m = 3,620, respectively. These bounds based on (2) are only approximate and also E_p(N) ≤ E_p(min(T,m)), but as discussed in Section 3.6 of Siegmund (1985), we observe that these approximations are close to E(N) in Tables 1 and 2. This indicates that one can use (2) to get a quick and reasonably accurate estimate of the expected resample size without using the FKH algorithm to numerically construct the boundary and computing E(N) for each choice of m and c, which helps one to efficiently choose m and c comparing approximate values of E(N).

4 Example and Discussion

Now, we illustrate the application of the B-value design to the permutation test of Kim et al. (2000) developed to determine the number of change-points in a segmented line regression model. They noted that the test statistic to test the null hypothesis that there are k₀ change-points against the alternative that there are k₁ change-points for 0 ≤ k₀ < k₁, does not follow the classical asymptotic theory, so suggested to use a permutation distribution of the test statistic to estimate the p-value. Joinpoint, software developed by Kim et al. (2000) and available at http://srab.cancer.gov/joinpoint, sequentially conducts the permutation tests to select the number of change-points, and the sequential Monte Carlo tests proposed in Fay et al. (2007) and also in this paper are expected to improve the computational efficiency of Joinpoint. Figure 2 includes cancer incidence rates for (a) Hodgkin lymphoma and (b) Liver and bile duct cancer. These incidence rates are the age adjusted rates for all races and both sexes combined for the SEER-9 population, which includes about 9.5 % of the U.S. population (Surveillance, Epidemiology and End Results Program at National Cancer Institute). The permutation tests of no change-point versus one change-point implemented in R with 3,620 permutated data sets produced the p-values of 0.3944 and 0.0011 for Hodgkin Lymphoma and Liver and bile duct cancer, respectively, which led us to choose the straight line regression model for Hodgkin lymphoma and the one change-point model for Liver and bile duct cancer. Two plots in the lower part of Figure 2 show the trajectories of the B-value test with m = 3, 620, α = 0.05 and c = 2.58281 for these two cancer sites, where we plotted $S_{n} = \sum_{i = 1}^{n} I (T_{i} \geq T_{0})$ , marked as a triangle at each n, and a portion of the parallel boundary in Figure 1 up to n = 1000. S_n crossed the upper boundary when n = 93 for Hodgkin lymphoma, implying to reject $H_{0}^{*} : p \leq 0.05$ , while S_n crossed the lower boundary at n = 698 for Liver and bile duct cancer, leading us not to reject $H_{0}^{*} : p \leq 0.05$ . From the discussion in Sections 2 and 3, we know that the resampling risk of this B-value test is less than or equal to 0.01 and it is much more efficient than the simple curtailed test with r₀ = 3,440 and r₁ = 181.

Hodgkin lymphoma and Liver and bile duct cancer

In this paper, we considered the B-value design to conduct MC tests with a less number of replications while we control the resampling risk. To control the resampling risks, we considered the class of p-value distributions as in Fay and Follmann (2002), and compared the performance of the B-value design with those of a simple curtailed design and the IPO design proposed in Fay and Follmann (2002). The exact B-value design which can be obtained by using the FKH algorithm is shown to be more efficient than the curtailed design in terms of the number of resamples, and than the IPO design in terms of the computational complexity. We then proposed the approximate B-value design which provides some analytic insights on the choice of parameters in numerically constructing the B-value design and which also serves as an approximate alternative that does not require a special algorithm.

There are some open problems that we leave for future research. Although the method of this paper works for a general choice of c₁ and c₂, we considered only the symmetric choice of c₁ and c₂ in numerical examples. Such symmetric choice of c₁ and c₂ is equivalent to the truncated SPRT boundary of Fay et al. (2007) with the Type I and II errors set equally, and it might be of interest to consider asymmetric choices of c₁ and c₂. Exploring how to construct more general designs aiming to minimize the expected resample size involves more details and requires further research. Also, the resampling risk defined in this paper gives the equal weight on Type I and II errors, and one may want to consider the resampling risk differently weighing the two types of errors. For a specific choice of weight, the method of Section 3 can be easily applied to approximate the resampling risk, size, and the expected resample size, but how to choose weights needs to be studied.

In this paper, we discussed how to construct the best design at m given or predetermined to have I₁ + I₂ < γ*, empirically showing that E(N) decreases as c = c(m) decreases for m given. As indicated in numerical examples, however, there are many designs, that is, many choices of (m, c), that satisfy the same constrains on α and γ*, so the problem of determining an optimal design over all (m, c) is an interesting problem to pursue.

In order to conduct a sequential test for $H_{0}^{*} : p \leq α$ versus $H_{1}^{*} : p > α$ , there are other sequential procedures proposed in the literature. For example, one may consider the likelihood ratio boundary as follows. For $S_{k} = \sum_{i = 1}^{k} I (T_{i} \geq T_{0})$ , the sequential likelihood ratio test rejects $H_{0}^{*} : p \leq α$ if

log [L (S_{k}; {\hat{p}}_{MLE}) / L (S_{k}; α)] \geq a

for some m₀ ≤ k ≤ m_f and a > 0, where L denotes the likelihood function. This condition is equivalent to

(k + 1) {\frac{S_{k} + 1}{k + 1} [log (\frac{S_{k} + 1}{k + 1}) - log α] + \frac{k - S_{k}}{k + 1} [log (\frac{k - S_{k}}{k + 1}) - log (1 - α)]} \geq a,

which can be written in terms of the following stopping time:

T = min {k : k H (S_{k} / k) \geq a},

where H(x) = x ln x + (1 − x) ln (1 − x) − x ln(α) − (1 − x) ln(1− α). Then, some asymptotic results, as in Siegmund (1985, Chapter 5), can be used to construct an asymptotic sequential design, but it requires details more complicated than those in Section 3. We plan to pursue it in our future research to compare the performance of the B-value design with other sequential boundaries such as the likelihood ratio boundary.

Acknowledgments

Kim's research was partially supported by NIH Contract HHSN 261200700273P. The author thanks Dr. Michael P. Fay for very helpful discussions and extensive assistance to improve the presentation of the paper and Mark Hachey for the initial construction of Tables 1 and 2.

Appendix: Approximate Upperbound for γ(φ, F, B)

Note that

\begin{array}{l} h_{1} (p) = Pr (N^{-} \leq m ∣ p) \\ = Pr (τ^{-} < n_{1}^{-} ∣ p) + Pr (τ^{-} \geq n_{1}^{-} and n - S_{n} \geq r_{0} for some n = n_{1}^{-}, \dots, m ∣ p) \\ \leq Pr (τ^{-} < n_{1}^{-} ∣ p) + Pr (τ^{-} \geq n_{1}^{-} and S_{m} - m α \leq 0 ∣ p) \\ \leq Pr (τ^{-} < n_{1}^{-} ∣ p) + Pr (τ^{-} \geq n_{1}^{-} and W_{m} \leq 0 ∣ p) \\ = Pr (τ^{-} < n_{1}^{-} ∣ p) + Pr (W_{m} \leq 0 ∣ p) - Pr (τ^{-} < n_{1}^{-} and W_{m} \leq 0 ∣ p) \\ = Pr (W_{m} \leq 0 ∣ p) + Pr (τ^{-} < n_{1}^{-} and W_{m} > 0 ∣ p) \\ \leq Pr (W_{m} \leq 0 ∣ p) + Pr (τ^{-} < m and W_{m} > 0 ∣ p) \\ ≅ Φ (- μ \sqrt{m}) + e^{- 2 μ (b + ρ)} Φ (- \frac{2 (b + ρ)}{\sqrt{m}} + μ \sqrt{m}) \\ = g_{11} (p) + g_{12} (p, c) \end{array}

and

\begin{array}{l} h_{3} (p) = Pr (N^{+} \leq m ∣ p) \\ = Pr (τ^{+} < n_{1}^{+} ∣ p) + Pr (τ^{+} \geq n_{1}^{+} and S_{n} \geq r_{1} for some n = n_{1}^{+}, \dots, m ∣ p) \\ = Pr (τ^{+} < n_{1}^{+} ∣ p) + Pr (τ^{+} \geq n_{1}^{+} and S_{m} \geq r_{1} ∣ p) \\ \leq Pr (τ^{+} < n_{1}^{+} ∣ p) + Pr (τ^{+} \geq n_{1}^{+} and W_{m} \geq 0 ∣ p) \\ = Pr (τ^{+} < n_{1}^{+} ∣ p) + Pr (W_{m} \geq 0 ∣ p) - Pr (W_{m} \geq 0 and τ^{+} < n_{1}^{+}) \\ = Pr (W_{m} \geq 0 ∣ p) + Pr (τ^{+} < n_{1}^{+} and W_{m} < 0 ∣ p) \\ \leq Pr (W_{m} \geq 0 ∣ p) + Pr (τ^{+} < m + and W_{m} < 0 ∣ p) \\ ≅ 1 - Φ (- μ \sqrt{m}) + e^{2 (b + ρ) μ} \cdot Φ (- \frac{2 (b + ρ)}{\sqrt{m}} - μ \sqrt{m}) \\ = g_{31} (p) + g_{32} (p, c) \end{array}

where $W_{m} = \sum_{i = 1}^{m} (Y_{i} - α) / \sqrt{p (1 - p)}$ , $b = c \sqrt{m α (1 - α) / p (1 - p)}$ and ρ ≅ 0.583. Thus,

\int_{α}^{1} h_{1} (p) f_{p (X)} (p) d p \leq \int_{α}^{1} g_{11} (p) f_{p (X)} (p) d p + \int_{α}^{1} g_{12} (p) f_{p (X)} (p) d p = I_{1} + R_{1} = γ_{1}

and

\int_{0}^{α} h_{3} (p) f_{p (X)} (p) d p \leq \int_{0}^{α} g_{31} (p) f_{p (X)} (p) d p + \int_{0}^{α} g_{32} (p) f_{p (X)} (p) d p = I_{2} + R_{2} = γ_{2} .

Therefore, when c₁ = −c₂ = c, for large m, we have

γ (φ, F, B) \leq I_{1} + I_{2} + R_{1} + R_{2} .

Note that $I_{1} = \int_{α}^{1} Φ (- μ \sqrt{m}) f_{p (X)} (p) d p$ works as an upperbound of $\int_{α}^{1} Pr (τ^{-} \geq n_{1}^{-} and W_{m} \leq 0 ∣ p) f_{p (X)} (p) d p$ , which makes a most contribution to γ₁ with $R_{1} = \int_{α}^{1} g_{12} (p, c) f_{p (X)} (p) d p = \int_{α}^{1} Pr (τ^{-} < m and W_{m} > 0 ∣ p) f_{p (X)} (p) d p$ negligible. Similarly, the major contribution to γ₂ comes from $I_{2} = \int_{0}^{α} [1 - Φ (- μ \sqrt{m})] f_{p (X)} (p) d p$ and $R_{2} = \int_{0}^{α} g_{32} (p) f_{p (X)} (p) d p$ is negligible.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Fay MP, Follmann DA. Designing Monte Carlo implementations of permutation or bootstrap hypothesis tests. American Statistician. 2002;56:63–70. [Google Scholar]
Fay MP, Kim HJ, Hachey M. On using truncated sequential probability ratio test boundaries for Monte Carlo implementation of hypothesis tests. Journal of Computational and Graphical Statistics. 2007;16:946–967. doi: 10.1198/106186007X257025. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim HJ, Fay M, Feuer EJ, Midthune DN. Permutation tests for joinpoint regression with applications to cancer rates. Statistics in Medicine. 2000;19:335–351. doi: 10.1002/(sici)1097-0258(20000215)19:3<335::aid-sim336>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
Lan KKG, Wittes J. The B-Value: A tool for monitoring data. Biometrics. 1988;44:579–585. [PubMed] [Google Scholar]
Samuel-Cahn E. Repeated significance test II, for hypotheses about the normal distribution. Communications in Statistics. 1974;3(8):711–733. [Google Scholar]
Siegmund D. Sequential Analysis. New York: Springer-Verlag; 1985. [Google Scholar]

[R1] Fay MP, Follmann DA. Designing Monte Carlo implementations of permutation or bootstrap hypothesis tests. American Statistician. 2002;56:63–70. [Google Scholar]

[R2] Fay MP, Kim HJ, Hachey M. On using truncated sequential probability ratio test boundaries for Monte Carlo implementation of hypothesis tests. Journal of Computational and Graphical Statistics. 2007;16:946–967. doi: 10.1198/106186007X257025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Kim HJ, Fay M, Feuer EJ, Midthune DN. Permutation tests for joinpoint regression with applications to cancer rates. Statistics in Medicine. 2000;19:335–351. doi: 10.1002/(sici)1097-0258(20000215)19:3<335::aid-sim336>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]

[R4] Lan KKG, Wittes J. The B-Value: A tool for monitoring data. Biometrics. 1988;44:579–585. [PubMed] [Google Scholar]

[R5] Samuel-Cahn E. Repeated significance test II, for hypotheses about the normal distribution. Communications in Statistics. 1974;3(8):711–733. [Google Scholar]

[R6] Siegmund D. Sequential Analysis. New York: Springer-Verlag; 1985. [Google Scholar]

PERMALINK

Bounding the Resampling Risk for Sequential Monte Carlo Implementation of Hypothesis Tests

Hyune-Ju Kim

Abstract

1 Introduction