Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 3.
Published in final edited form as: Stat Med. 2023 Nov 14;43(2):279–295. doi: 10.1002/sim.9959

MERIT: Controlling Monte-Carlo error rate in large-scale Monte-Carlo hypothesis testing

Yunxiao Li 1, Yi-Juan Hu 1, Glen A Satten 2
PMCID: PMC10909586  NIHMSID: NIHMS1967549  PMID: 38124426

Abstract

The use of Monte-Carlo (MC) p-values when testing the significance of a large number of hypotheses is now commonplace. In large-scale hypothesis testing, we will typically encounter at least some p-values near the threshold of significance, which require a larger number of MC replicates than p-values that are far from the threshold. As a result, some incorrect conclusions can be reached due to MC error alone; for hypotheses near the threshold, even a very large number (eg, 106) of MC replicates may not be enough to guarantee conclusions reached using MC p-values. Gandy and Hahn (GH)6−8 have developed the only method that directly addresses this problem. They defined a Monte-Carlo error rate (MCER) to be the probability that any decisions on accepting or rejecting a hypothesis based on MC p-values are different from decisions based on ideal p-values; their method then makes decisions by controlling the MCER. Unfortunately, the GH method is frequently very conservative, often making no rejections at all and leaving a large number of hypotheses “undecided”. In this article, we propose MERIT, a method for large-scale MC hypothesis testing that also controls the MCER but is more statistically efficient than the GH method. Through extensive simulation studies, we demonstrate that MERIT controls the MCER while making more decisions that agree with the ideal p-values than GH does. We also illustrate our method by an analysis of gene expression data from a prostate cancer study.

Keywords: bootstrap, false discovery rate, high-dimensional, permutation, reproducibility

1 |. INTRODUCTION

Modern scientific studies often require testing a large number of hypotheses; in particular, modern biological and biomedical studies of -omics such as genomics, proteomics, and metabolomics typically test hundreds or thousands of hypotheses at a time. In these studies, p-values for testing individual hypotheses are first obtained and a procedure that corrects for multiplicity, such as the procedure of Benjamini and Hochberg (BH),1 is then applied to make decisions on the hypotheses. Specifically, denote the m null hypotheses by H1,0,H2,0,,Hm,0. Ideally, the p-values are calculated based on an (accurate) analytical method or exhaustive resampling of replicates, which we call ideal p-values and denote by p1*,p2*,,pm*. When the BH procedure with nominal false-discovery rate (FDR) ϕ is applied to the ideal p-values, they are sorted in an ascending order p(1)*p(2)*p(m)* and then the largest integer k such that p(k)*ϕ×k/m is found; hypotheses corresponding to p(1)*,p(2)*,,p(k)* are rejected while the remaining hypotheses are accepted. Thus, we call τ*ϕ×k/m the BH cutoff for the ideal p-values, which separates the ideal p-values for rejected and accepted hypotheses.

When the ideal p-values cannot be calculated analytically, we frequently obtain p-values by resampling methods such as permutation or bootstrap. For ease of exposition, hereafter we generically refer to any resampling-based method as a Monte-Carlo (MC) method, and refer to individual resamples as MC replicates. The recommended form of an MC p-value is pˆi=Xi+1/(n+1)24 for the ith hypothesis, where n is the number of MC replicates (ie, MC sample size) and Xi is the number of “exceedances”, that is, when the statistic based on an MC replicate “exceeds” the observed test statistic. When using a finite n, there will be a difference between the MC p-value and the ideal p-value.

It is well known that the MC sample size required to test a hypothesis having an ideal p-value near the BH cutoff τ* is much larger than that required for a hypothesis having an ideal p-value that is far from τ*. What is less appreciated is that, when testing a large number of hypotheses, it is very likely that at least some hypotheses have ideal p-values near τ*. As a result, if we repeat the MC experiment with a different set of replicates (and so obtain a slightly different set of MC p-values), there can be noticeable variability in the list of hypotheses that are rejected. The source of this variability is that random fluctuations in the MC procedure for a hypothesis having ideal p-value near the BH cutoff can result in the MC p-value randomly crossing the BH cutoff. Because the MC sample size is finite, these lists can also differ from the list based on the ideal p-values. This is true whether the test procedure adopts a fixed stopping criterion, that is, mandating a fixed MC sample size for all hypotheses, or a sequential stopping criterion,5 that is, allowing some tests to stop early after collecting sufficient information about its decision. Note that even for test procedures (eg, the procedure of Sandve et al.5) that always control the FDR in any list of rejected hypotheses resulting from an arbitrary set of MC replicates, there can still be variability in these lists. This failure in reproducibility is clearly undesirable to investigators. Even a large number (eg, 106) of MC replicates may not be enough to guarantee a list free of MC error.

To date, only Gandy and Hahn (GH)68 have directly addressed this problem. They developed a method that can be coupled with any MC test method that has a sequential stopping criterion. Their method makes decisions on multiple hypotheses by controlling the family-wise Monte-Carlo error rate (MCER), which is the probability that any of their decisions on accepting or rejecting a hypothesis are different from what would have been obtained from the ideal p-values. It works in a sequential manner, giving a decision among “rejected”, “accepted”, and “undecided” to each test after each iteration and continuing sampling only for hypotheses that are “undecided”.

Denote the decisions (between “rejected” and “accepted”) made by the BH procedure that is applied to the ideal p-values by D1*,D2*,,Dm*. Denote the decisions (among “rejected”, “accepted”, and “undecided”) made by an MC test procedure by D^1,D^2,,D^m. Following GH, we define the total number of MC errors to be the sum of all disagreements between the Di*s and D^is among hypotheses that are determined as “rejected” or “accepted” by the MC test procedure:

V=i=1mI(D^i=rejected,Di*=accepted)+i=1mI(D^i=accepted,Di*=rejected), (1)

where I(,) is an indicator function. Controlling the family-wise MCER at level α, that is, Pr(V1)α, assures that, among those tests that the MC test procedure actually decides, the probability that we reach a conclusion that is different from the decision we would make if we had the ideal p-values is less than α. Note that there is no error control on “undecided” hypotheses (D^i=undecided), which would require more MC replicates to reach a decision.

At each iteration of the MC test procedure, the GH method forms a two-sided Robbins-Lai confidence interval (CI)9,10 for each ideal p-value that maintains a joint coverage over all iterations (ie, accounting for the possible differences in MC sample size for each test); the level of each CI is determined by using the Bonferroni correction. Then, the GH method applies the BH procedure to the collection of upper CI limits and, separately, to the collection of lower CI limits. Those tests for which both lower and upper limits are rejected are determined as “rejected,” those for which both limits are accepted are determined as “accepted”; the remaining tests are determined as “undecided.” The GH method controls MCER at every iteration. However, it is extremely conservative due to the use of the Bonferroni correction, and thus tends to make no rejections at all when the MC sample size is not very large.

While the sequential stopping criterion has advantages (eg, it can greatly reduce the MC sample size for tests that are either highly insignificant or highly significant), use of a fixed, pre-determined MC sample size is more common. Compared to the sequential stopping criterion, use of a fixed MC sample size reduces the chance of making different decisions on a hypothesis when the MC experiment is repeated. Further, in many instances there may be little or no computational advantage to stopping some tests early, for example, when the cost of evaluating the stopping criterion exceeds the cost of calculating the test statistics. For these reasons, we now restrict our attention to MC procedures with fixed MC sample sizes.

In this article, we propose a method, called MERIT (Monte-Carlo Error Rate control In large-scale MC hypothesis Testing), that makes decisions on the hypotheses by controlling the same family-wise MCER as the GH method, but it works for MC test procedures with a fixed stopping criterion. We aim to maximize detection efficiency by minimizing the number of “undecided” hypotheses at a given MC sample size or by making “rejected” or “accepted” decisions for all hypotheses with fewer MC replicates. We present our method in Section 2, examine its properties through simulation studies in Section 3, and apply it to a study of prostrate cancer with gene expression data in Section 4. Because one immediate advantage of using a fixed MC sample size is that the Robbins-Lai CI required for the sequential GH approach is much wider than any single-stage CI,11 in Section 2 we also develop a fixed-MC-sample-size version of GH that uses a single-stage CI, to allow for a more fair comparison in Sections 3 and 4. Finally, we conclude with a discussion in Section 5.

2 |. METHODS

Suppose that a fixed number, n, of MC replicates have been collected for testing all m hypotheses. For testing the ith hypothesis Hi,0, let si be the observed test statistic and Sij be the test statistic calculated from the jth MC replicate. The input of our method is the matrix I=Iijm×n, where Iij=ISijsi is the exceedance indicator corresponding to a symmetric, two-sided statistic (note that Iij can be redefined for one-sided tests or tests with asymmetric null distributions). We frequently use the total number of exceedances for each test, Xi=j=1nIij. Because the replicates are sampled independently from the null distribution, we have Xi~Binn,pi*, which is a binomial variable with n trials and “success” rate being the ideal p-value pi*. This is the key assumption we use to develop our method, same as in the development of the GH method.

We partition the total MC errors in (1) into the type-I and type-II MC errors, which correspond to the first and second terms in (1) and are denoted by VI and VII, respectively. In Sections 2.1,2.2, and 2.3, we present a procedure that makes decisions on the hypotheses between “rejected” and “non-rejected” while controlling the family-wise type-I MCER PrVI1 at α1. In Section 2.4, we present another procedure, which is obtained by modifying the first procedure and makes decisions on the hypotheses between “accepted” and “non-accepted” while controlling the family-wise type-II MCER PrVII1 at α2. Either of these procedures can be used separately, if control of only one type of MCER is desired. In Section 2.5, we show how to combine the two procedures to determine each hypothesis as “rejected”, “accepted”, or possibly “undecided”, while controlling the overall family-wise MCER at α1+α2, because Pr(V1)PrVI1+PrVII1). In what follows, we omit the term “family-wise” for simplicity.

2.1 |. A procedure that rejects original hypotheses while controlling the type-I MCER

Recall that τ* is the BH cutoff for the ideal p-values p1*,p2*,,pm*. If τ* is known a priori, the result of the BH procedure can be viewed as the truth of the following hypotheses:

Hi,0*:pi*>τ*vs.Hi,1*:pi*τ*,i=1,2,,m. (2)

Recall that a type-I MC error occurs when a hypothesis Hi,0* is true (ie, not rejected using the ideal p-value) but rejected using the MC p-value; this would happen when the MC p-value is smaller than its corresponding ideal p-value. Then, a test procedure based on MC p-values that controls the family-wise error rate (FWER) for testing Hi,0* at α1 would control the type-I MCER at α1.

In reality, τ* is unknown since the ideal p-values are unknown. A reasonable strategy to control type-I MCER would be to choose some threshold τ satisfying τ<τ*; this is intuitive, since, as we make τ smaller, we require smaller MC p-values to reject a hypothesis, and smaller MC p-values correspond to increased confidence in any rejections we make. Thus, we divide the problem of testing hypotheses in (2) into two sub-problems and develop a two-step procedure. In the first step, we find a lower limit τl for τ* that satisfies τl<τ* with a probability of at least 1α~1), where α~1 is chosen such that 0<α~1<α1. In the second step, we consider the revised hypotheses, treating τ as a constant:

H~i,0:pi*>τvs.H˜i,1:pi*τ,i=1,2,,m. (3)

We develop a test procedure that tests (3) while controlling the FWER at level α1α~1. Note that the same decision on H~i,0 is transferred to Hi,0* and Hi,0. Then, by the relationship

VI1=τ>τ*,VI1{ττ*,VI1}τ>τ*{ττ*,atleastoneHi,0*istruebutrejected}τ>τ*{ττ*,atleastoneH~i,0istruebutrejected},

the type-I MCER PrVI1 is always bounded by the sum of Prτ>τ* and the FWER of the test procedure in testing (3), which is α~1+1α~1α1α~1α~1+α1α~1=α1, since α~1<α1. As the default, we choose α~1=α1/2, that is, partitioning the type-I MCER equally into the two sub-problems. We show in Supplementary Materials S1 that the equal partition scheme generally yields the highest power.

2.2 |. Step 1: Construct a 1α~1-level one-sided CI τ,1 for τ*

We now turn our attention to constructing a one-sided CI τ,1 for τ*. Note that, for any set of p-values, there is a corresponding BH cutoff; thus, we find the lower limit τ for τ* from a set of p-values that are “lower limits” for the ideal p-values pi*. To this aim, we consider the following sets of p-values, with each set indexed by a positive continuous value c:

pc,i=Xi+1+cXi+1n+1+cXi+1,i=1,2,,m. (4)

Recall that the MC p-value pˆi=Xi+1/(n+1) provides a consistent estimate of pi*. The term Xi+1 in (4) is an approximation to npˆi1p^i, which is the estimated standard error of Xi+1; the term 1p^i is omitted because only the small p-values (in the neighborhood of τ*) are of interest. As c goes to 0 and n goes to infinity, pc,i converges to pi*. In addition, pc,i is asymptotically larger than pi* for c>0 because pc,i is always larger than pˆi.

Let F*(t) and Fc(t) be the empirical distribution functions (EDF) of p1*,,pm* and pc,1,,pc,m, respectively, and τ* and τc be their BH cutoffs. Storey et al12 gave a general result that, a BH cutoff τ can also be obtained from an EDF F(t) of p-values by the formula τ=max{t:F(t)t/ϕ,t[0,1]}, which means that τ is the intersection of F(t) and the straight line t/ϕ, as illustrated in Figure 1. When t/ϕ is constrained to [0, 1] as F(t), t must be constrained to [0,ϕ]. Then we see from Figure 1 (right) that the intersection τ must lie in [0,ϕ] and is fully determined by the EDF of p-values in the range of [0,ϕ]. If Fc(t) is always no greater than F*(t) over [0,ϕ], as shown in Figure 1 (left), we must have τcτ*. It follows that

Prτcτ*Pr(FcF*),

in which we use to denote that one function is always no greater than another function for t[0,ϕ]. Note that τc and Fc are random because pc,i involves Xi that is based on MC replicates; τ* and F* are constant because we condition on the observed data. To construct a 1α~1-level one-sided CI τ,1 for τ*, we are interested in c that satisfies

Pr(FcF*)1α~1. (5)

FIGURE 1.

FIGURE 1

An example of F*(t), Fc(t), and Fc(t) and their corresponding BH cutoffs τ*, τc, and τc. The black straight line is t/ϕ, where ϕ=10%. The left plot is a zoom-in view of the bottom left corner of the right plot.

The probability on the left of (5) increases as c increases (because the increase of pc,i leads to the decrease of Fc(t) at a given t). To make the CI for τ* the tightest, we wish to find the smallest c that satisfies (5), denoted by c. Finally, we set the lower limit τ to τc, which is the BH cutoff corresponding to the set of p-values in (4) indexed by c.

It is difficult to find the smallest c that satisfies (5) analytically because F* is unknown. Thus, we propose to obtain c by bootstrap resampling. In the b-th bootstrap, we sample n columns from the input matrix I with replacement to form the bootstrap matrix I(b) and sum up the rows of I(b) to obtain the corresponding numbers of exceedances (X1(b),X2(b),,Xm(b)). The number of bootstrap replicates B needs to be large, for example, B=104. Define the bootstrap p-values indexed by c as

pc,i(b)=Xi(b)+1+cXi(b)+1n+1+cXi(b)+1,i=1,2,,m.

Let F^ and Fc(b) be the EDF for p^1,,p^m and pc,1(b),,pc,m(b), respectively. Then we wish to find c to be the smallest c that satisfies

B1b=1BI(Fc(b)F^)1α˜1, (6)

which is an empirical version of (5). In fact, it is still difficult to directly find the smallest c that satisfies (6). As an even simpler alternative, we find, for every b, the smallest c that guarantees Fc(b)F^, denoted by c(b), and then set c to be the 1α~1 th quantile of c(1),c(2),,c(B). This strategy ensures that 1α~1×B indicator functions in (6) take the value of 1 when evaluated at c=c. The procedure is summarized in Step 1 of Algorithm 1.

2.3 |. Step 2: Test hypotheses in (3) given τ while controlling the FWER at α1α~1

For testing multiple hypotheses while controlling the FWER, step-wise procedures (where the criterion for rejecting hypotheses becomes less stringent in subsequent steps in case some hypotheses have been rejected in an earlier step) such as Holm’s13 are typically more powerful than single-step procedures such as the Bonferroni correction. Thus, we develop a step-wise procedure for testing hypotheses in (3) following the general framework of Romano and Wolf.14,15 We use X1,X2,,Xm defined earlier as the test statistics and let x1,x2,,xm denote their observed values. Let x(1)x(2)x(m) be the ordered observed values, and H~(1),0,H~(2),0,,H~(m),0 be the corresponding hypotheses.

We start by testing the global null H~(1),0H~(2),0H~(m),0, which means that all pi* s are greater than τ. We use x(1) as the test statistic for this test. Romano and Wolf14,15 proposed to calculate the p-value under the global null in this form:

Prmin1jmXjx1, (7)

where each Xj is a binomial random variable with n trials and some “success” rate θj that is under the null and determined below. We use the Bonferroni inequality to obtain an upper bound for (7), namely

j=1mFBin(x(1);n,θj), (8)

Algorithm 1.

The procedure that rejects original hypotheses while controlling type-I MCER

Input: matrix I, nominal FDR ϕ, nominal type-I MCER α1
Step 1: construct a 1α~1-level one-sided CI τ,1 for τ*.
 Calculate X1,X2,,Xm, pˆ1,pˆ2,,pˆm, and then F^.
 For each bootstrap replicate b=1,2,,B,
  Sample n columns from I with replacement to form I(b).
  Calculate (X1(b),X2(b),,Xm(b)), (pc,1(b),pc,2(b),,pc,m(b)), and then Fc(b).
  Find cl(b) to be the smallest c that guarantees Fc(b)F^.
 Find c, the smallest c that satisfies (5), to be the 1α~1 quantile of (cl(1),cl(2),,cl(B)).
 Find the lower limit τ to be the BH cutoff of p-values in (4) indexed by c.
Step 2: test hypotheses in (3) given τ while controlling the FWER at α1α~1.
 Set i=1.
 while im,
  If (9) does not hold, accept all remaining hypotheses H~(i),0 S and exit the loop.
  Otherwise, reject H~(i),0, update i=i+1.
 Whenever H~(i),0 is rejected, H(i),0 is rejected.
Output: a decision for every hypothesis Hi,0 between “rejected” and “non-rejected”.

where FBin(x;n,θ) denotes Pr(Xx) for X~Bin(n,θ). Expression (8) can be calculated analytically, although this computation efficiency comes at the cost of losing some statistical power compared to using (7) directly. Note that the upper bound in (8) is valid under any dependence structure among the m tests.

Now we consider the problem of determining θj s. When all pi*s are close to the boundary τ, we should choose θj=τ for all js. Hansen16 noted that it is possible to gain more power by making adjustments for null hypotheses that are “deep in the null”, that is, pi* s that are not near the boundary τ but instead satisfy pi*τ. In the presence of a large number of tests, we expect many pi*s are “deep in the null”. Intuitively, the hypotheses associated with very large pi*s can be removed and the “effective” number of tests that we need to adjust for can be considerably reduced, leading to improved power. Ideally, we should choose θj=maxτ,pj*. However, pj* is unknown. Hansen16 proposed an estimator for pj* that leads to the choice

θj=τ,ifnxj/nτ/τ1τ2loglogn,xj/n,otherwise.

Therefore, if the statistic xj is sufficiently large, θj is set to the sample-based estimator xj/n, and otherwise it is left unchanged at τ.

Using this θj, if

j=1mFBinx(1);n,θjα1α~1,

we reject H~(1),0 and move to the next joint null H~(2),0H~(m),0 that excludes H~(1),0; otherwise, we stop and declare that none of the hypotheses should be rejected. In general, for testing the joint null H~(i),0H~(m),0 for any i, we use x(i) as the test statistic. If

j=imFBin(x(i);n,θj)α1α˜1, (9)

we reject H~(i),0 and move to the next joint null that excludes H~(i),0; otherwise, we stop and accept the remaining hypotheses. This step-wise procedure asymptotically controls the FWER in testing (3) at α1α~1.15,16 Whenever H~(i),0 is rejected, the corresponding original hypothesis H(i),0 is rejected. This procedure is summarized in Step 2 of Algorithm 1.

The step-wise procedure here can be understood as a combination of Holm’s test and Hansen’s adjustment. We can see that, without Hansen’s adjustment, this procedure reduces to Holm’s test. Specifically, the left hand side of (9), with θj replaced by τ, becomes (mi+1)p(i), where p(i)=FBinx(i);n,τ is the ith smallest p-value among all p-values for testing H~(i),0. Thus, (9) becomes p(i)α1α~1/(mi+1), which is Holm’s test. By adjusting for null hypotheses that are “deep in the null”, we have FBinx(i);n,θj0 for very large θj s and (9) becomes approximately p(i)α1α~1/m0, where m0 is a number that is smaller than mi+1. Therefore, our procedure gains more power than Holm’s test.

2.4 |. A procedure that accepts original hypotheses while controlling the type-II MCER

We can readily modify the procedure in Sections 2.1,2.2, and 2.3 for testing a set of hypotheses that flip the null and alternative hypotheses Hi,0* and Hi,1* in (2). Thus, a rejection of the new null hypothesis, which is now Hi,1*, leads to an acceptance of the original hypothesis Hi,0. As with the previous procedure, we divide the problem into two sub-problems, the first of which is to construct an at-least- 1α~2-level one-sided CI 0,τu for τ*, where 0<α~2<α2, and the second of which is to test the hypotheses that flip the null and alternative hypotheses H~i,0 and H~i,1 in (3) as well as replace τ by τu. The details of this procedure are deferred to Appendix and are summarized in Algorithm 2, which can also be found in that section. One important feature of this algorithm is that τu is found by using a negative value for c in (4), corresponding to the curve Fc(t) in Figure 1, guaranteeing that τ<τu, a result we require in the next subsection.

2.5 |. Combining results of the two procedures

If Hi,0 is rejected by the first procedure and non-accepted by the second procedure, it is determined as “rejected”; if Hi,0 is accepted by the second procedure and non-rejected by the first procedure, it is determined as “accepted”; if Hi,0 is non-rejected and non-accepted, it is determined as “undecided”. Note that our method precludes the possibility that Hi,0 is both rejected and accepted, as this would correspond to simultaneously concluding that both pi*<τ and τu<pi* hold simultaneously, which is not possible since τ<τu. Finally, the overall MCER among the decisions that are either “rejected” or “accepted” (excluding those “undecided”) is controlled at the level that is the sum of the type-I and type-II MCER. The whole workflow of MERIT is depicted in Figure 2.

FIGURE 2.

FIGURE 2

MERIT workflow.

2.6 |. Modifying the GH method for a fixed MC sample size

Because the GH method is fully sequential while MERIT uses a pre-determined MC sample size, it is difficult to compare them directly. For this reason, we created a modified GH method with a fixed stopping rule. We also replaced each two-sided Robbins-Lai interval in GH by two one-sided Wilson intervals17 because the Robbins-Lai interval cannot be one-sided and because the Wilson interval is more efficient.18 In Supplementary Materials S2, we illustrate, using simulated data, that the two-sided Wilson interval (which is the intersection of the two one-sided Wilson intervals) is always narrower than the two-sided Robbins-Lai interval for 0 ≤ p ≤ 0.2, the interesting range of p-values. To complete our modification of GH, we applied the BH procedure to the upper limits of the Wilson intervals to obtain rejected hypotheses and to the lower limits to obtain accepted hypotheses as in the original GH procedure. Similar arguments to those found in Gandy and Hahn6,7 can be used to prove our modified GH procedure controls the type-I and type-II MCER.

3 |. SIMULATION STUDIES

3.1 |. Setup

We conducted extensive simulation studies to evaluate the performance of our method. We evaluated the two procedures of our method (one rejecting hypotheses and one accepting hypotheses). We also compared MERIT to our modified GH method. In addition, we evaluated the naive approach that applies the BH procedure to the MC p-values, for which we evaluated the type-I and type-II MCER separately. We refer to our method, the modified GH method, and the naive method as MERIT, GH, and NAIVE, respectively.

We considered m=100,1000, and 5000 tests. For the bulk of our simulations, we assumed that 80% of m tests are under the null and the remaining tests are under the alternative; we sampled the p-values for tests under the null independently from the uniform distribution U[0,1], and sampled the p-values for tests under the alternative independently from the Gaussian right-tailed probability model: pi*=1ΦZi, where Zi~N(β,1) and Φ is the standard normal cumulative distribution function. We set β to 1.5, 2 and 2.5; a larger value of β implies higher sensitivity to reject the alternative hypotheses. For each pair of values of m and β, we sampled 100 different sets of ideal p-values p1*,p2*,,pm*. In addition to these simulations, we also considered a scenario when 80% tests are under the alternative and a worst-case scenario when all “significant” ideal p-values (defined to be p-values that would be rejected by the BH procedure) are close to the threshold of significance. The details of these last two settings, along with their results, are provided in Supplementary Materials S3 and S4, respectively. Instead of generating actual data and calculating test statistics, we directly simulated the matrix I by independently drawing Bernoulli samples with the “success” rate pi*. The MC sample size n ranged between 5,000 and 1,000,000. The nominal type-I and type-II MCER α1 and α2 were both set to 10%; the nominal FDR ϕ used to define the BH threshold was set to 10%.

For evaluating the rejecting procedure in each method, we use two metrics, the empirical type-I MCER and sensitivity, the latter of which is

Sensitivity=i=1mI(D^i=rejected,Di*=rejected)1i=1mI(Di*=rejected).

For evaluating the accepting procedure in each method, we also use two metrics, the empirical type-II MCER and specificity, given by

Specificity=i=1mI(D^i=accepted,Di*=accepted)1i=1mI(Di*=accepted).

For each set of ideal p-values and MC sample size n, we evaluated each of these metrics using 1,000 replicates of the sampled matrix I.

3.2 |. Simulation results

Figure 3 shows the type-I MCER and sensitivity for the three methods, MERIT, GH, and NAIVE, for testing m=1000 hypotheses. Each box displays the variation of results of a method over the 100 sets of ideal p-values: the top and bottom lines represent the 75% and 25% quantiles and the middle bar indicates the median. Each point in each box represents results from 1,000 replicates. We can see that, for any combination of n and β, both MERIT and GH controlled the type-I MCER below the nominal level, while NAIVE did not. In fact, most of the empirical type-I MCER from MERIT and GH are zero, which is expected because both methods guard against the worse case when all “significant” p-values are concentrated at the threshold of significance, while the p-values in these simulations were spread out. Nevertheless, MERIT always had higher sensitivity than GH, which is more pronounced when n is not very large. In fact, MERIT rejected more hypotheses than GH for every set of ideal p-values (results not shown). As expected, the type-I MCER of NAIVE was highly inflated when n was low; although the error rate decreased as n was increased, it was still above the nominal level for many sets of ideal p-values even when n was as large as one million. One reason that the sensitivity of NAIVE was so high is because it did not allow “undecided” tests, so that every test was categorized either as “rejected” or “accepted”. The results of type-II MCER and specificity are displayed in Figure 4, which shows similar patterns as Figure 3. Note that the specificity remained relatively unchanged for different values of β because specificity pertains to null hypotheses only. The proportions of undecided tests are shown in Figure S5, which demonstrates that MERIT consistently yielded a lower proportion of undecided tests compared to GH in all scenarios.

FIGURE 3.

FIGURE 3

The empirical type-I MCER (upper panel) and sensitivity (lower panel) for m=1000 tests. The gray dashed lines represent the nominal level 10%. Each point represents a different set of ideal p-values; results for each set of p-values are based on 1,000 replicates.

FIGURE 4.

FIGURE 4

The empirical type-II MCER (upper panel) and specificity (lower panel) for m=1000 tests.

The results for m=100 are shown in Figures S6 and S7. With this small number of tests, NAIVE was more likely to control the type-I and type-II MCER because the sampled ideal p-values were less likely to be near the threshold of significance. For the same reason, MERIT had higher sensitivity and specificity than in the m=1000 case (for the same values of n and β). Again, MERIT and GH both controlled the type-I and type-II MCER and MERIT always had higher sensitivity and specificity than GH. However, NAIVE still lost control of MCER for some samples of 100 p-values, even with 106 MC replicates. Note that this behavior with changing m is because our ideal p-values were assumed to be fairly uniformly spread. If we had assumed a worst-case scenario in which most, or all, “significant” ideal p-values were at or near the threshold, we would not expect to see this improved behavior when m is decreased.

The results for m=5000 are shown in Figures S8 and S9. This scale of testing is more commonly seen in modern biological and biomedical studies of omics. Indeed, the prostate cancer data that we will apply our method to has 6033 tests corresponding to 6033 genes. In this case, NAIVE had highly inflated type-I and type-II MCER even with a million MC replicates, implying that the naive results had very poor reproducibility. Again, MERIT and GH both controlled the type-I and type-II MCER and MERIT always had higher sensitivity and specificity than GH. Interestingly, while the sensitivity of GH reduced from m=1000 to m=5000, the sensitivity of MERIT stayed quite robust, which suggests an increasing advantage of MERIT over GH for very large numbers of tests.

4 |. APPLICATION TO THE PROSTATE CANCER STUDY

We applied MERIT, GH, and NAIVE to detect differentially expressed genes in a prostate cancer study.19 The data contains microarray gene expressions for 6,033 genes and 102 subjects comprised of 52 prostate cancer patients and 50 healthy controls. We calculated the t-statistic (assuming equal variance) for each gene based on the observed data and n MC replicates (by permuting the case-control labels), from which we obtained the matrix I. As in the simulation studies, we considered a wide range of values for n and we set the nominal type-I and type-II MCER to 10% and the nominal FDR to 10%. For evaluation of reproducibility, we obtained the results for 10 different runs with different seeds for generating MC replicates.

The results are shown in Figure 5. In all cases, MERIT yielded more rejected tests (when there were any rejections) and more accepted tests than GH; the advantage of MERIT is more clear when n is between 50,000 and 500,000. Interestingly, the relative numbers of rejections and acceptances of the three methods seem to match well with the sensitivity and specificity in the simulation study with m=5000 and β=1.5. To interpret the results in more detail, we focus on the first run of n=100,000 MC replicates. MERIT detected 51 genes as differentially expressed and 5961 genes as non-differentially expressed, and left 21 gene “undecided”; in the same run, GH detected 0 gene as differentially expressed and 5954 as non-differentially expressed, and left 79 genes “undecided”. The chance for a single gene or more to be declared as differentially expressed by MERIT (or GH) but declared as non-differentially expressed with an infinite number of MC replicates is less than 10%; we have a similar guarantee for the genes determined as non-differentially expressed by MERIT (or GH).

FIGURE 5.

FIGURE 5

Results from analysis of the prostate cancer data. The upper panel shows the number of rejected hypotheses and the lower panel shows the last two digits (the first two digits are 59; eg, the last two digits 35 correspond to the complete number 5935) of the number of accepted hypotheses. For n=5,000 and 10,000, both MERIT and GH rejected no hypotheses.

Because MERIT includes a bootstrap step, MERIT will require more run time than NAIVE. For the prostate cancer data, MERIT required ~20 min (not including the generation of MC replicates and the calculation of t statistic) on a single core for one run of n = 1,000,000. Because the bootstrap procedure in Step 1 is the most computationally intensive part in our algorithm, the run time can be significantly reduced by parallelizing the computation of bootstrap replicates to multiple cores. For example, one run of MERIT for the prostate cancer data with n=1,000,000 required only ~2 min on 10 cores.

5 |. DISCUSSION

Monte-Carlo error is a threat to reproducibility in two related ways, when testing hypotheses using resampling techniques such as permutation or bootstrap. First, MC error may lead to false rejection of a hypothesis that we would not reject given the ideal p-value. Second, if we re-run an MC hypothesis testing procedure, we may obtain a different list of rejected or accepted hypotheses. Note that when the outcome “undecided” is available, these two types of MC error are distinct. The control of MCER guaranteed by MERIT means that, even if two runs give different lists of rejected or accepted hypotheses, we are assured that the chance of reaching the wrong conclusion about any hypothesis is bounded by level α. Thus, while successive runs of MERIT may have different lists of rejected or accepted hypotheses, these lists will only differ by switches between “rejected” and “undecided”, or between “accepted” and “undecided”, with probabilities at most α1 and α2, respectively. The situation is very different for the NAIVE procedure where the BH threshold is applied directly to the MC p-values. Here the absence of an “undecided” category means that all errors are switches between “rejected” and “accepted”, even though the overall (experiment-level) false discovery rate may be controlled. Many of these false discoveries will be due to MC error alone.

MERIT (as well as our modification of GH) differ from the original GH proposal in that type-I and type-II MCER can be separately controlled. This may be useful for some users who care more about discoveries (ie, rejections) than they care about lists of hypotheses for which the null hypothesis is accepted. For these users, type-I MCER may be the only error worth controlling. In this situation, we may set the type-II MCER to α2=0 so that we only obtain two possible outcomes: “rejected” or “undecided”. Alternatively, a small type-II MCER α2 can be selected so that hypotheses that are very far from the threshold of significance can still be labeled as “accepted”.

MERIT achieves control of type-I MCER at level α1 by lowering the BH threshold for rejecting hypotheses. We note that we could address the chance of switches between “rejected” and “undecided” hypotheses by lowering this threshold even further. If we lowered the threshold such that the probability of observing any switches between “rejected” and “undecided” was bounded by α3, then we could control the probability of observing any irreproducible rejections by α1+α3. This control would, however, come at the cost that some hypotheses that can be rejected by controlling type-I MCER at level α1 will no longer be rejected by the new criterion. Still, it may be worth considering. Finally, the number of ideal p-values near the threshold of significance controls the number of “undecided” hypotheses in MERIT, and the distance between the MC p-value and this threshold presumably controls the likelihood of a switch between “rejected” and “undecided”. In our experience, when there are a large number of “undecided” hypotheses, changing the nominal FDR level ϕ to a value where there are relatively few empirical p-values can sometimes reduce the number of “undecided” hypotheses.

When there are “undecided” hypotheses, it is tempting to run the MERIT algorithm several times using the same number of replicates, in the hopes of drawing some inference about which hypotheses switch between undecided and decided (eg, rejected). Unfortunately, there does not seem to be a simple way to combine the information in several MERIT runs except using MERIT on the pooled set of replicates. Combining the results in any other way is tantamount to developing a sequential version of MERIT. While developing a sequential version of MERIT seems worthwhile, it may be difficult. One approach might be to consider conditional inference, in which we condition on an “undecided” outcome at the previous step. However, our bootstrap procedure for the second step (see Equation 8) would need to be modified to account for this conditioning. Alternatively, we could consider an alpha spending approach,20 which is commonly used to address multiple decision endpoints in sequential experiments. However, our two-step approach is non-standard and would require custom decision boundaries. Moreover, this approach may not be ideal for practical applications as we may find that simply using the maximum number of MC replicates we can afford gives a higher detection sensitivity and a lower number of undecided tests.

When developing MERIT, we chose to control the family-wise MCER, rather than a percentage or rate of MC errors among all decided tests, since this MC “false discovery rate” may be misleading. The chance of an MC error for hypotheses with ideal p-values well below the BH threshold is typically fairly small; the presence of these “easily rejected” hypotheses in a false-discovery-like list of MC errors can therefore lead to fairly loose control of MC error for hypotheses that are closer to the BH threshold, even though the overall error rate is controlled. In contrast, controlling the family-wise MCER bounds the probability of any MC error, which seems to be more appropriate in the current context.

Here we have considered only MC schemes that have a fixed stopping criterion. We feel this is reasonable, as our impression is that most large-scale MC hypothesis testing reported in the biological sciences used a fixed MC sample size. This also enables us to achieve higher efficiency, as it allows for a fairly sophisticated correction for multiple testing to be applied only once, after all tests have been completed. For fairness, we have also modified the original, sequential GH approach to allow for a more efficient approach when a fixed stopping criterion is used. Even with this boost, we showed that MERIT outperformed GH unless the MC sample size was 1,000,000 or greater, after which performance of the two methods was similar. In this sense, our method is the best available approach to MC hypothesis testing that features control of MCER with a fixed stopping criterion. We have implemented our method in the R package MERIT available on GitHub at https://github.com/yijuanhu/MERIT in formats appropriate for Macintosh or Windows.

Supplementary Material

supplementary material

Funding information

National Institute of General Medical Sciences, Grant/Award Numbers: R01GM116065, R01GM141074

APPENDIX A. A PROCEDURE THAT ACCEPTS ORIGINAL HYPOTHESES WHILE CONTROLLING THE TYPE-II MCER

We consider the following hypotheses that “flip” the hypotheses in (2):

Hi,0*:pi*τ*vs.Hi,1*:pi*>τ*,i=1,2,,m.

Note that a rejection of Hi,0* corresponds to an acceptance of the original hypothesis Hi,0. Recall that a type-II MC error occurs when Hi,0* is true (hence Di*=rejected) but rejected using the MC p-value (hence D^i=accepted). Then, a test procedure based on MC p-values that controls the FWER for testing Hi,0* at α2 would control the type-II MCER at α2. Here, we develop a two-step procedure following the same idea as in Sections 2.1,2.2, and 2.3. In the first step, we construct an at least 1α~2)-level one-sided CI 0,τu for τ*, where 0<α~2<α2. In the second step, we consider the revised hypotheses, treating τu as fixed:

H˜i,0:pi*τuvs.H˜i,1:pi*>τu,i=1,2,,m. (A1)

We develop a test procedure that tests (A1) while controlling the FWER at level α2α~2. Then, by the relationship that

VII1=τu<τ*,VII1{τuτ*,VII1}τu<τ*{τuτ*,atleastoneHi,0*istruebutrejected}{τu<τ*}{τuτ*,atleastoneH~i,0istruebutrejected},

the type-II MCER PrVII1 is always bounded by the sum of Prτu<τ* and the FWER in testing (A1), which is α2. As the default choice, we choose α~2=α2/2.

Step 1: construct a 1α~2-level one-sided CI0,τu for τ*

We find τu from a set of p-values that are “upper limits” for the ideal p-values. To this aim, we consider the following sets of p-values, with each set indexed by a positive continuous value c:

pc,i=Xi+1cXi+1n+1cXi+1. (A2)

Note that we impose the restriction that c<n+1; otherwise n+1cXi+1 may become 0 or even negative and then pc,i is not well-defined. However, negative values of the numerator are allowed, as illustrated in Figure 1 where the curve Fc(t) has non-zero value at t=0. As c goes to 0,pc,i converges to pi*. When c is sufficiently large, pc,i is asymptotically smaller than pi*, as least for small pi*s.

Let Fc(t) be the EDF of pc,1,,pc,m and τc be the BH cutoff. If Fc(t) is always no less than F*(t) over (0,ϕ], as shown in Figure 1 (left), we must have τcτ*. It follows that

Prτcτ*Pr(F*Fc).

To construct a 1α~2-level one-sided CI0,τu for τ*, we are interested in c that satisfies

Pr(F*Fc)1α~2. (A3)

The probability on the left of (A3) increases as c increases (because the decrease of pc,i leads to the increase of Fc(t) at a given t). Thus we wish to find the smallest c that satisfies (A3), denoted by cu. Finally, we set the upper limit τu to τcu.

We obtain cu using the same bootstrap replicates as in Section 2.2. Define the bootstrap p-values indexed by c as

pc,i(b)=Xi(b)+1cXi(b)+1n+1cXi(b)+1,i=1,2,,m.

Let Fc(b) be the EDF for (pc,1(b),,pc,m(b)). We wish to find cu to be the smallest c that satisfies

B1b=1BIF^Fc(b)1α~2,

Algorithm 2.

The procedure that accepts original hypotheses while controlling type-II MCER

Input: matrix I, nominal FDR ϕ, nominal type-II MCER α2
Step 1: construct a 1α~2-level one-sided CI0,τu for τ*.
 Calculate (X1,X2,,Xm), (pˆ1,pˆ2,,pˆm), and then F^.
 For each bootstrap replicate b=1,2,,B,
  Sample n columns from I with replacement to form I(b).
  Calculate (X1(b),X2(b),,Xm(b)), (pc,1(b),pc,2(b),,pc,m(b)), and then Fc(b).
  Find cu(b) to be the smallest c that guarantees F^Fc(b).
 Find cu, the smallest c that satisfies (A3), to be the 1α~2 quantile of (cu(1),cu(2),,cu(B)).
 Find the upper limit τu to be the BH cutoff of p-values in (A2) indexed by cu.
Step 2: Test hypotheses in (A1) given τu while controlling the FWER at α2α~2.
 Set i=m.
 While i1,
  If (A4) does not hold, accept all remaining hypotheses H~(i),0s and exit the loop.
  Otherwise, reject H~(i),0, update i=i1.
 Whenever H~(i),0 is rejected, H(i),0 is accepted.
Output: a decision for every hypothesis Hi,0 between “accepted” and “non-accepted”.

which is the empirical version of (A3). To this aim, we find, for every b=1,,B, the smallest c that guarantees F^Fc(b), denoted by cu(b), and then set cu to be the 1α~2 th quantile of (cu(1),cu(2),,cu(B)). The procedure here is summarized in Step 1 of Algorithm 2.

Step 2: test hypotheses in (A1) given τu while controlling the FWER at (α2α~2)

Let H~(1),0,H~(2),0,,H~(m),0 be the ordered hypotheses (A1) that correspond to the ordered observed test statistics x(1),x(2),,x(m). We start by testing the global null H~(1),0H~(2),0H~(m),0, which means that all pi* are less than τu. We use x(m) as the test statistic for this test. Following Romano and Wolf, we propose to calculate the p-value under the global null:

Prmax1jmXjxm,

where each Xj is a binomial random variable with n trials and “success” rate θj that is under the null. Using the Bonferroni inequality, we obtain an upper bound of the p-value to be

j=1mFBinx(m);n,θj,

where FBin(x;n,θ) denotes Pr(Xx) for X~Bin(n,θ). Using Hansen’s adjustments for null hypotheses that are “deep in the null”, we choose θj to be

θj=τu,ifnxj/nτu/τu1τu2loglogn,xj/n,otherwise.

Using this θj, if

j=1mFBinx(m);n,θjα2α~2,

we reject H~(m),0 and move to the next joint null H~(1),0H~(m1),0 that excludes H~(m),0; otherwise, we stop and declare that none of the hypotheses should be rejected. In general, for testing the joint null H~(1),0H~(i),0 for any i, we use x(i) as the test statistic. If

j=1iFBin'(x(i);n,θj')α2α˜2, (A4)

we reject H~(i),0 and move to the next joint null that excludes H~(i),0; otherwise, we stop and accept the remaining hypotheses. This step-wise procedure asymptotically controls the FWER in testing (A1) at α2α~2. Whenever H~(i),0 is rejected, the corresponding original hypothesis H(i),0 is accepted. This procedure is summarized in Step 2 of Algorithm 2.

Footnotes

SUPPORTING INFORMATION

Additional supporting information can be found online in the Supporting Information section at the end of this article.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available in the R package sda in the repository CRAN.

REFERENCES

  • 1.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol. 1995;57:289–300. [Google Scholar]
  • 2.Davison A, Hinkley D. Bootstrap Methods and their Application. Cambridge: Cambridge University Press; 1997. [Google Scholar]
  • 3.Manly BF. Randomization, Bootstrap and Monte Carlo Methods in Biology. Vol 70. Boca Raton, Florida: CRC press; 2006. [Google Scholar]
  • 4.Phipson B, Smyth GK. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat Appl Genet Mol Biol. 2010;9(1). doi: 10.2202/1544-6115.1585 [DOI] [PubMed] [Google Scholar]
  • 5.Sandve GK, Ferkingstad E, Nygård S. Sequential Monte Carlo multiple testing. Bioinformatics. 2011;27(23):3235–3241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gandy A, Hahn G. MMCTest-a safe algorithm for implementing multiple Monte Carlo tests. Scand J Stat. 2014;41(4):1083–1101. [Google Scholar]
  • 7.Gandy A, Hahn G. A framework for Monte Carlo based multiple testing. Scand J Stat. 2016;43(4):1046–1063. [Google Scholar]
  • 8.Gandy A, Hahn G. QuickMMCTest: quick multiple Monte Carlo testing. Stat Comput. 2017;27(3):823–832. [Google Scholar]
  • 9.Darling D, Robbins H. Some further remarks on inequalities for sample sums. Proc Natl Acad Sci U S A. 1968;60(4):1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lai TL. On confidence sequences. Ann Stat. 1976;4(2):265–280. [Google Scholar]
  • 11.Coe PR, Tamhane AC. Exact repeated confidence intervals for Bernoulli parameters in a group sequential clinical trial. Control Clin Trials. 1993;14(1):19–29. [DOI] [PubMed] [Google Scholar]
  • 12.Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc Series B Stat Methodology. 2004;66(1):187–205. [Google Scholar]
  • 13.Holm S A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6:65–70. [Google Scholar]
  • 14.Romano JP, Wolf M. Exact and approximate stepdown methods for multiple hypothesis testing. J Am Stat Assoc. 2005;100(469):94–108. [Google Scholar]
  • 15.Romano JP, Wolf M. Multiple testing of one-sided hypotheses: combining Bonferroni and the bootstrap. International Conference of the Thailand Econometrics Society. New York: Springer; 2018:78–94. [Google Scholar]
  • 16.Hansen PR. A test for superior predictive ability. J Bus Econ Stat. 2005;23(4):365–380. [Google Scholar]
  • 17.Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22(158):209–212. [Google Scholar]
  • 18.Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Stat Sci. 2001;16:101–117. [Google Scholar]
  • 19.Singh D, Febbo PG, Ross K, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1(2):203–209. [DOI] [PubMed] [Google Scholar]
  • 20.DeMets D, Lan K. Interim analysis: the alpha spending function approach. Stat Med. 1994;13:1341–1352. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplementary material

Data Availability Statement

The data that support the findings of this study are openly available in the R package sda in the repository CRAN.

RESOURCES