Abstract
The use of Monte-Carlo (MC) p-values when testing the significance of a large number of hypotheses is now commonplace. In large-scale hypothesis testing, we will typically encounter at least some p-values near the threshold of significance, which require a larger number of MC replicates than p-values that are far from the threshold. As a result, some incorrect conclusions can be reached due to MC error alone; for hypotheses near the threshold, even a very large number (eg, 106) of MC replicates may not be enough to guarantee conclusions reached using MC p-values. Gandy and Hahn (GH)6−8 have developed the only method that directly addresses this problem. They defined a Monte-Carlo error rate (MCER) to be the probability that any decisions on accepting or rejecting a hypothesis based on MC p-values are different from decisions based on ideal p-values; their method then makes decisions by controlling the MCER. Unfortunately, the GH method is frequently very conservative, often making no rejections at all and leaving a large number of hypotheses “undecided”. In this article, we propose MERIT, a method for large-scale MC hypothesis testing that also controls the MCER but is more statistically efficient than the GH method. Through extensive simulation studies, we demonstrate that MERIT controls the MCER while making more decisions that agree with the ideal p-values than GH does. We also illustrate our method by an analysis of gene expression data from a prostate cancer study.
Keywords: bootstrap, false discovery rate, high-dimensional, permutation, reproducibility
1 |. INTRODUCTION
Modern scientific studies often require testing a large number of hypotheses; in particular, modern biological and biomedical studies of -omics such as genomics, proteomics, and metabolomics typically test hundreds or thousands of hypotheses at a time. In these studies, p-values for testing individual hypotheses are first obtained and a procedure that corrects for multiplicity, such as the procedure of Benjamini and Hochberg (BH),1 is then applied to make decisions on the hypotheses. Specifically, denote the null hypotheses by . Ideally, the p-values are calculated based on an (accurate) analytical method or exhaustive resampling of replicates, which we call ideal p-values and denote by . When the BH procedure with nominal false-discovery rate (FDR) is applied to the ideal p-values, they are sorted in an ascending order and then the largest integer such that is found; hypotheses corresponding to are rejected while the remaining hypotheses are accepted. Thus, we call the cutoff for the ideal p-values, which separates the ideal p-values for rejected and accepted hypotheses.
When the ideal p-values cannot be calculated analytically, we frequently obtain p-values by resampling methods such as permutation or bootstrap. For ease of exposition, hereafter we generically refer to any resampling-based method as a Monte-Carlo (MC) method, and refer to individual resamples as MC replicates. The recommended form of an MC p-value is 2–4 for the ith hypothesis, where is the number of MC replicates (ie, MC sample size) and is the number of “exceedances”, that is, when the statistic based on an MC replicate “exceeds” the observed test statistic. When using a finite , there will be a difference between the MC p-value and the ideal p-value.
It is well known that the MC sample size required to test a hypothesis having an ideal p-value near the BH cutoff is much larger than that required for a hypothesis having an ideal p-value that is far from . What is less appreciated is that, when testing a large number of hypotheses, it is very likely that at least some hypotheses have ideal p-values near . As a result, if we repeat the MC experiment with a different set of replicates (and so obtain a slightly different set of MC p-values), there can be noticeable variability in the list of hypotheses that are rejected. The source of this variability is that random fluctuations in the MC procedure for a hypothesis having ideal p-value near the BH cutoff can result in the MC p-value randomly crossing the BH cutoff. Because the MC sample size is finite, these lists can also differ from the list based on the ideal p-values. This is true whether the test procedure adopts a fixed stopping criterion, that is, mandating a fixed MC sample size for all hypotheses, or a sequential stopping criterion,5 that is, allowing some tests to stop early after collecting sufficient information about its decision. Note that even for test procedures (eg, the procedure of Sandve et al.5) that always control the FDR in any list of rejected hypotheses resulting from an arbitrary set of MC replicates, there can still be variability in these lists. This failure in reproducibility is clearly undesirable to investigators. Even a large number (eg, 106) of MC replicates may not be enough to guarantee a list free of MC error.
To date, only Gandy and Hahn (GH)6–8 have directly addressed this problem. They developed a method that can be coupled with any MC test method that has a sequential stopping criterion. Their method makes decisions on multiple hypotheses by controlling the family-wise Monte-Carlo error rate (MCER), which is the probability that any of their decisions on accepting or rejecting a hypothesis are different from what would have been obtained from the ideal p-values. It works in a sequential manner, giving a decision among “rejected”, “accepted”, and “undecided” to each test after each iteration and continuing sampling only for hypotheses that are “undecided”.
Denote the decisions (between “rejected” and “accepted”) made by the BH procedure that is applied to the ideal p-values by . Denote the decisions (among “rejected”, “accepted”, and “undecided”) made by an MC test procedure by . Following GH, we define the total number of MC errors to be the sum of all disagreements between the and among hypotheses that are determined as “rejected” or “accepted” by the MC test procedure:
| (1) |
where is an indicator function. Controlling the family-wise MCER at level , that is, , assures that, among those tests that the MC test procedure actually decides, the probability that we reach a conclusion that is different from the decision we would make if we had the ideal p-values is less than . Note that there is no error control on “undecided” hypotheses , which would require more MC replicates to reach a decision.
At each iteration of the MC test procedure, the GH method forms a two-sided Robbins-Lai confidence interval (CI)9,10 for each ideal p-value that maintains a joint coverage over all iterations (ie, accounting for the possible differences in MC sample size for each test); the level of each CI is determined by using the Bonferroni correction. Then, the GH method applies the BH procedure to the collection of upper CI limits and, separately, to the collection of lower CI limits. Those tests for which both lower and upper limits are rejected are determined as “rejected,” those for which both limits are accepted are determined as “accepted”; the remaining tests are determined as “undecided.” The GH method controls MCER at every iteration. However, it is extremely conservative due to the use of the Bonferroni correction, and thus tends to make no rejections at all when the MC sample size is not very large.
While the sequential stopping criterion has advantages (eg, it can greatly reduce the MC sample size for tests that are either highly insignificant or highly significant), use of a fixed, pre-determined MC sample size is more common. Compared to the sequential stopping criterion, use of a fixed MC sample size reduces the chance of making different decisions on a hypothesis when the MC experiment is repeated. Further, in many instances there may be little or no computational advantage to stopping some tests early, for example, when the cost of evaluating the stopping criterion exceeds the cost of calculating the test statistics. For these reasons, we now restrict our attention to MC procedures with fixed MC sample sizes.
In this article, we propose a method, called MERIT (Monte-Carlo Error Rate control In large-scale MC hypothesis Testing), that makes decisions on the hypotheses by controlling the same family-wise MCER as the GH method, but it works for MC test procedures with a fixed stopping criterion. We aim to maximize detection efficiency by minimizing the number of “undecided” hypotheses at a given MC sample size or by making “rejected” or “accepted” decisions for all hypotheses with fewer MC replicates. We present our method in Section 2, examine its properties through simulation studies in Section 3, and apply it to a study of prostrate cancer with gene expression data in Section 4. Because one immediate advantage of using a fixed MC sample size is that the Robbins-Lai CI required for the sequential GH approach is much wider than any single-stage CI,11 in Section 2 we also develop a fixed-MC-sample-size version of GH that uses a single-stage CI, to allow for a more fair comparison in Sections 3 and 4. Finally, we conclude with a discussion in Section 5.
2 |. METHODS
Suppose that a fixed number, , of MC replicates have been collected for testing all hypotheses. For testing the ith hypothesis , let be the observed test statistic and be the test statistic calculated from the jth MC replicate. The input of our method is the matrix , where is the exceedance indicator corresponding to a symmetric, two-sided statistic (note that can be redefined for one-sided tests or tests with asymmetric null distributions). We frequently use the total number of exceedances for each test, . Because the replicates are sampled independently from the null distribution, we have , which is a binomial variable with trials and “success” rate being the ideal p-value . This is the key assumption we use to develop our method, same as in the development of the GH method.
We partition the total MC errors in (1) into the type-I and type-II MC errors, which correspond to the first and second terms in (1) and are denoted by and , respectively. In Sections 2.1,2.2, and 2.3, we present a procedure that makes decisions on the hypotheses between “rejected” and “non-rejected” while controlling the family-wise type-I MCER at . In Section 2.4, we present another procedure, which is obtained by modifying the first procedure and makes decisions on the hypotheses between “accepted” and “non-accepted” while controlling the family-wise type-II MCER at . Either of these procedures can be used separately, if control of only one type of MCER is desired. In Section 2.5, we show how to combine the two procedures to determine each hypothesis as “rejected”, “accepted”, or possibly “undecided”, while controlling the overall family-wise MCER at , because . In what follows, we omit the term “family-wise” for simplicity.
2.1 |. A procedure that rejects original hypotheses while controlling the type-I MCER
Recall that is the BH cutoff for the ideal p-values . If is known a priori, the result of the BH procedure can be viewed as the truth of the following hypotheses:
| (2) |
Recall that a type-I MC error occurs when a hypothesis is true (ie, not rejected using the ideal p-value) but rejected using the MC p-value; this would happen when the MC p-value is smaller than its corresponding ideal p-value. Then, a test procedure based on MC p-values that controls the family-wise error rate (FWER) for testing at would control the type-I MCER at .
In reality, is unknown since the ideal p-values are unknown. A reasonable strategy to control type-I MCER would be to choose some threshold satisfying ; this is intuitive, since, as we make smaller, we require smaller MC p-values to reject a hypothesis, and smaller MC p-values correspond to increased confidence in any rejections we make. Thus, we divide the problem of testing hypotheses in (2) into two sub-problems and develop a two-step procedure. In the first step, we find a lower limit for that satisfies with a probability of at least ), where is chosen such that . In the second step, we consider the revised hypotheses, treating as a constant:
| (3) |
We develop a test procedure that tests (3) while controlling the FWER at level . Note that the same decision on is transferred to and . Then, by the relationship
the type-I MCER is always bounded by the sum of and the FWER of the test procedure in testing (3), which is , since . As the default, we choose , that is, partitioning the type-I MCER equally into the two sub-problems. We show in Supplementary Materials S1 that the equal partition scheme generally yields the highest power.
2.2 |. Step 1: Construct a -level one-sided CI for
We now turn our attention to constructing a one-sided CI for . Note that, for any set of p-values, there is a corresponding BH cutoff; thus, we find the lower limit for from a set of p-values that are “lower limits” for the ideal p-values . To this aim, we consider the following sets of p-values, with each set indexed by a positive continuous value :
| (4) |
Recall that the MC p-value provides a consistent estimate of . The term in (4) is an approximation to , which is the estimated standard error of ; the term is omitted because only the small p-values (in the neighborhood of ) are of interest. As goes to 0 and goes to infinity, converges to . In addition, is asymptotically larger than for because is always larger than .
Let and be the empirical distribution functions (EDF) of and , respectively, and and be their BH cutoffs. Storey et al12 gave a general result that, a BH cutoff can also be obtained from an EDF of p-values by the formula , which means that is the intersection of and the straight line , as illustrated in Figure 1. When is constrained to [0, 1] as , must be constrained to . Then we see from Figure 1 (right) that the intersection must lie in and is fully determined by the EDF of p-values in the range of . If is always no greater than over , as shown in Figure 1 (left), we must have . It follows that
in which we use to denote that one function is always no greater than another function for . Note that and are random because involves that is based on MC replicates; and are constant because we condition on the observed data. To construct a -level one-sided CI for , we are interested in that satisfies
| (5) |
FIGURE 1.

An example of , , and and their corresponding BH cutoffs , , and . The black straight line is , where . The left plot is a zoom-in view of the bottom left corner of the right plot.
The probability on the left of (5) increases as increases (because the increase of leads to the decrease of at a given ). To make the CI for the tightest, we wish to find the smallest that satisfies (5), denoted by . Finally, we set the lower limit to , which is the BH cutoff corresponding to the set of p-values in (4) indexed by .
It is difficult to find the smallest that satisfies (5) analytically because is unknown. Thus, we propose to obtain by bootstrap resampling. In the b-th bootstrap, we sample columns from the input matrix with replacement to form the bootstrap matrix and sum up the rows of to obtain the corresponding numbers of exceedances . The number of bootstrap replicates needs to be large, for example, . Define the bootstrap p-values indexed by as
Let and be the EDF for and , respectively. Then we wish to find to be the smallest that satisfies
| (6) |
which is an empirical version of (5). In fact, it is still difficult to directly find the smallest that satisfies (6). As an even simpler alternative, we find, for every , the smallest that guarantees , denoted by , and then set to be the th quantile of . This strategy ensures that indicator functions in (6) take the value of 1 when evaluated at . The procedure is summarized in Step 1 of Algorithm 1.
2.3 |. Step 2: Test hypotheses in (3) given while controlling the FWER at
For testing multiple hypotheses while controlling the FWER, step-wise procedures (where the criterion for rejecting hypotheses becomes less stringent in subsequent steps in case some hypotheses have been rejected in an earlier step) such as Holm’s13 are typically more powerful than single-step procedures such as the Bonferroni correction. Thus, we develop a step-wise procedure for testing hypotheses in (3) following the general framework of Romano and Wolf.14,15 We use defined earlier as the test statistics and let denote their observed values. Let be the ordered observed values, and be the corresponding hypotheses.
We start by testing the global null , which means that all s are greater than . We use as the test statistic for this test. Romano and Wolf14,15 proposed to calculate the p-value under the global null in this form:
| (7) |
where each is a binomial random variable with trials and some “success” rate that is under the null and determined below. We use the Bonferroni inequality to obtain an upper bound for (7), namely
| (8) |
Algorithm 1.
The procedure that rejects original hypotheses while controlling type-I MCER
| Input: matrix , nominal FDR , nominal type-I MCER |
| Step 1: construct a -level one-sided CI for . |
| Calculate , , and then . |
| For each bootstrap replicate , |
| Sample columns from with replacement to form . |
| Calculate , , and then . |
| Find to be the smallest that guarantees . |
| Find , the smallest that satisfies (5), to be the quantile of . |
| Find the lower limit to be the BH cutoff of p-values in (4) indexed by . |
| Step 2: test hypotheses in (3) given while controlling the FWER at . |
| Set . |
| while , |
| If (9) does not hold, accept all remaining hypotheses S and exit the loop. |
| Otherwise, reject , update . |
| Whenever is rejected, is rejected. |
| Output: a decision for every hypothesis between “rejected” and “non-rejected”. |
where denotes for . Expression (8) can be calculated analytically, although this computation efficiency comes at the cost of losing some statistical power compared to using (7) directly. Note that the upper bound in (8) is valid under any dependence structure among the tests.
Now we consider the problem of determining s. When all are close to the boundary , we should choose for all s. Hansen16 noted that it is possible to gain more power by making adjustments for null hypotheses that are “deep in the null”, that is, s that are not near the boundary but instead satisfy . In the presence of a large number of tests, we expect many are “deep in the null”. Intuitively, the hypotheses associated with very large can be removed and the “effective” number of tests that we need to adjust for can be considerably reduced, leading to improved power. Ideally, we should choose . However, is unknown. Hansen16 proposed an estimator for that leads to the choice
Therefore, if the statistic is sufficiently large, is set to the sample-based estimator , and otherwise it is left unchanged at .
Using this , if
we reject and move to the next joint null that excludes ; otherwise, we stop and declare that none of the hypotheses should be rejected. In general, for testing the joint null for any , we use as the test statistic. If
| (9) |
we reject and move to the next joint null that excludes ; otherwise, we stop and accept the remaining hypotheses. This step-wise procedure asymptotically controls the FWER in testing (3) at .15,16 Whenever is rejected, the corresponding original hypothesis is rejected. This procedure is summarized in Step 2 of Algorithm 1.
The step-wise procedure here can be understood as a combination of Holm’s test and Hansen’s adjustment. We can see that, without Hansen’s adjustment, this procedure reduces to Holm’s test. Specifically, the left hand side of (9), with replaced by , becomes , where is the ith smallest p-value among all p-values for testing . Thus, (9) becomes , which is Holm’s test. By adjusting for null hypotheses that are “deep in the null”, we have for very large s and (9) becomes approximately , where is a number that is smaller than . Therefore, our procedure gains more power than Holm’s test.
2.4 |. A procedure that accepts original hypotheses while controlling the type-II MCER
We can readily modify the procedure in Sections 2.1,2.2, and 2.3 for testing a set of hypotheses that flip the null and alternative hypotheses and in (2). Thus, a rejection of the new null hypothesis, which is now , leads to an acceptance of the original hypothesis . As with the previous procedure, we divide the problem into two sub-problems, the first of which is to construct an at-least- -level one-sided CI for , where , and the second of which is to test the hypotheses that flip the null and alternative hypotheses and in (3) as well as replace by . The details of this procedure are deferred to Appendix and are summarized in Algorithm 2, which can also be found in that section. One important feature of this algorithm is that is found by using a negative value for in (4), corresponding to the curve in Figure 1, guaranteeing that , a result we require in the next subsection.
2.5 |. Combining results of the two procedures
If is rejected by the first procedure and non-accepted by the second procedure, it is determined as “rejected”; if is accepted by the second procedure and non-rejected by the first procedure, it is determined as “accepted”; if is non-rejected and non-accepted, it is determined as “undecided”. Note that our method precludes the possibility that is both rejected and accepted, as this would correspond to simultaneously concluding that both and hold simultaneously, which is not possible since . Finally, the overall MCER among the decisions that are either “rejected” or “accepted” (excluding those “undecided”) is controlled at the level that is the sum of the type-I and type-II MCER. The whole workflow of MERIT is depicted in Figure 2.
FIGURE 2.

MERIT workflow.
2.6 |. Modifying the GH method for a fixed MC sample size
Because the GH method is fully sequential while MERIT uses a pre-determined MC sample size, it is difficult to compare them directly. For this reason, we created a modified GH method with a fixed stopping rule. We also replaced each two-sided Robbins-Lai interval in GH by two one-sided Wilson intervals17 because the Robbins-Lai interval cannot be one-sided and because the Wilson interval is more efficient.18 In Supplementary Materials S2, we illustrate, using simulated data, that the two-sided Wilson interval (which is the intersection of the two one-sided Wilson intervals) is always narrower than the two-sided Robbins-Lai interval for 0 ≤ p ≤ 0.2, the interesting range of p-values. To complete our modification of GH, we applied the BH procedure to the upper limits of the Wilson intervals to obtain rejected hypotheses and to the lower limits to obtain accepted hypotheses as in the original GH procedure. Similar arguments to those found in Gandy and Hahn6,7 can be used to prove our modified GH procedure controls the type-I and type-II MCER.
3 |. SIMULATION STUDIES
3.1 |. Setup
We conducted extensive simulation studies to evaluate the performance of our method. We evaluated the two procedures of our method (one rejecting hypotheses and one accepting hypotheses). We also compared MERIT to our modified GH method. In addition, we evaluated the naive approach that applies the BH procedure to the MC p-values, for which we evaluated the type-I and type-II MCER separately. We refer to our method, the modified GH method, and the naive method as MERIT, GH, and NAIVE, respectively.
We considered , and 5000 tests. For the bulk of our simulations, we assumed that 80% of tests are under the null and the remaining tests are under the alternative; we sampled the p-values for tests under the null independently from the uniform distribution , and sampled the p-values for tests under the alternative independently from the Gaussian right-tailed probability model: , where and is the standard normal cumulative distribution function. We set to 1.5, 2 and 2.5; a larger value of implies higher sensitivity to reject the alternative hypotheses. For each pair of values of and , we sampled 100 different sets of ideal p-values . In addition to these simulations, we also considered a scenario when 80% tests are under the alternative and a worst-case scenario when all “significant” ideal p-values (defined to be p-values that would be rejected by the BH procedure) are close to the threshold of significance. The details of these last two settings, along with their results, are provided in Supplementary Materials S3 and S4, respectively. Instead of generating actual data and calculating test statistics, we directly simulated the matrix by independently drawing Bernoulli samples with the “success” rate . The MC sample size ranged between 5,000 and 1,000,000. The nominal type-I and type-II MCER and were both set to 10%; the nominal FDR used to define the BH threshold was set to 10%.
For evaluating the rejecting procedure in each method, we use two metrics, the empirical type-I MCER and sensitivity, the latter of which is
For evaluating the accepting procedure in each method, we also use two metrics, the empirical type-II MCER and specificity, given by
For each set of ideal p-values and MC sample size , we evaluated each of these metrics using 1,000 replicates of the sampled matrix .
3.2 |. Simulation results
Figure 3 shows the type-I MCER and sensitivity for the three methods, MERIT, GH, and NAIVE, for testing hypotheses. Each box displays the variation of results of a method over the 100 sets of ideal p-values: the top and bottom lines represent the 75% and 25% quantiles and the middle bar indicates the median. Each point in each box represents results from 1,000 replicates. We can see that, for any combination of and , both MERIT and GH controlled the type-I MCER below the nominal level, while NAIVE did not. In fact, most of the empirical type-I MCER from MERIT and GH are zero, which is expected because both methods guard against the worse case when all “significant” p-values are concentrated at the threshold of significance, while the p-values in these simulations were spread out. Nevertheless, MERIT always had higher sensitivity than GH, which is more pronounced when is not very large. In fact, MERIT rejected more hypotheses than GH for every set of ideal p-values (results not shown). As expected, the type-I MCER of NAIVE was highly inflated when was low; although the error rate decreased as was increased, it was still above the nominal level for many sets of ideal p-values even when was as large as one million. One reason that the sensitivity of NAIVE was so high is because it did not allow “undecided” tests, so that every test was categorized either as “rejected” or “accepted”. The results of type-II MCER and specificity are displayed in Figure 4, which shows similar patterns as Figure 3. Note that the specificity remained relatively unchanged for different values of because specificity pertains to null hypotheses only. The proportions of undecided tests are shown in Figure S5, which demonstrates that MERIT consistently yielded a lower proportion of undecided tests compared to GH in all scenarios.
FIGURE 3.

The empirical type-I MCER (upper panel) and sensitivity (lower panel) for tests. The gray dashed lines represent the nominal level 10%. Each point represents a different set of ideal p-values; results for each set of p-values are based on 1,000 replicates.
FIGURE 4.

The empirical type-II MCER (upper panel) and specificity (lower panel) for tests.
The results for are shown in Figures S6 and S7. With this small number of tests, NAIVE was more likely to control the type-I and type-II MCER because the sampled ideal p-values were less likely to be near the threshold of significance. For the same reason, MERIT had higher sensitivity and specificity than in the case (for the same values of and ). Again, MERIT and GH both controlled the type-I and type-II MCER and MERIT always had higher sensitivity and specificity than GH. However, NAIVE still lost control of MCER for some samples of 100 p-values, even with 106 MC replicates. Note that this behavior with changing is because our ideal p-values were assumed to be fairly uniformly spread. If we had assumed a worst-case scenario in which most, or all, “significant” ideal p-values were at or near the threshold, we would not expect to see this improved behavior when m is decreased.
The results for are shown in Figures S8 and S9. This scale of testing is more commonly seen in modern biological and biomedical studies of omics. Indeed, the prostate cancer data that we will apply our method to has 6033 tests corresponding to 6033 genes. In this case, NAIVE had highly inflated type-I and type-II MCER even with a million MC replicates, implying that the naive results had very poor reproducibility. Again, MERIT and GH both controlled the type-I and type-II MCER and MERIT always had higher sensitivity and specificity than GH. Interestingly, while the sensitivity of GH reduced from to , the sensitivity of MERIT stayed quite robust, which suggests an increasing advantage of MERIT over GH for very large numbers of tests.
4 |. APPLICATION TO THE PROSTATE CANCER STUDY
We applied MERIT, GH, and NAIVE to detect differentially expressed genes in a prostate cancer study.19 The data contains microarray gene expressions for 6,033 genes and 102 subjects comprised of 52 prostate cancer patients and 50 healthy controls. We calculated the t-statistic (assuming equal variance) for each gene based on the observed data and MC replicates (by permuting the case-control labels), from which we obtained the matrix . As in the simulation studies, we considered a wide range of values for and we set the nominal type-I and type-II MCER to 10% and the nominal FDR to 10%. For evaluation of reproducibility, we obtained the results for 10 different runs with different seeds for generating MC replicates.
The results are shown in Figure 5. In all cases, MERIT yielded more rejected tests (when there were any rejections) and more accepted tests than GH; the advantage of MERIT is more clear when is between 50,000 and 500,000. Interestingly, the relative numbers of rejections and acceptances of the three methods seem to match well with the sensitivity and specificity in the simulation study with and . To interpret the results in more detail, we focus on the first run of MC replicates. MERIT detected 51 genes as differentially expressed and 5961 genes as non-differentially expressed, and left 21 gene “undecided”; in the same run, GH detected 0 gene as differentially expressed and 5954 as non-differentially expressed, and left 79 genes “undecided”. The chance for a single gene or more to be declared as differentially expressed by MERIT (or GH) but declared as non-differentially expressed with an infinite number of MC replicates is less than 10%; we have a similar guarantee for the genes determined as non-differentially expressed by MERIT (or GH).
FIGURE 5.

Results from analysis of the prostate cancer data. The upper panel shows the number of rejected hypotheses and the lower panel shows the last two digits (the first two digits are 59; eg, the last two digits 35 correspond to the complete number 5935) of the number of accepted hypotheses. For and 10,000, both MERIT and GH rejected no hypotheses.
Because MERIT includes a bootstrap step, MERIT will require more run time than NAIVE. For the prostate cancer data, MERIT required ~20 min (not including the generation of MC replicates and the calculation of t statistic) on a single core for one run of n = 1,000,000. Because the bootstrap procedure in Step 1 is the most computationally intensive part in our algorithm, the run time can be significantly reduced by parallelizing the computation of bootstrap replicates to multiple cores. For example, one run of MERIT for the prostate cancer data with required only ~2 min on 10 cores.
5 |. DISCUSSION
Monte-Carlo error is a threat to reproducibility in two related ways, when testing hypotheses using resampling techniques such as permutation or bootstrap. First, MC error may lead to false rejection of a hypothesis that we would not reject given the ideal p-value. Second, if we re-run an MC hypothesis testing procedure, we may obtain a different list of rejected or accepted hypotheses. Note that when the outcome “undecided” is available, these two types of MC error are distinct. The control of MCER guaranteed by MERIT means that, even if two runs give different lists of rejected or accepted hypotheses, we are assured that the chance of reaching the wrong conclusion about any hypothesis is bounded by level . Thus, while successive runs of MERIT may have different lists of rejected or accepted hypotheses, these lists will only differ by switches between “rejected” and “undecided”, or between “accepted” and “undecided”, with probabilities at most and , respectively. The situation is very different for the NAIVE procedure where the BH threshold is applied directly to the MC p-values. Here the absence of an “undecided” category means that all errors are switches between “rejected” and “accepted”, even though the overall (experiment-level) false discovery rate may be controlled. Many of these false discoveries will be due to MC error alone.
MERIT (as well as our modification of GH) differ from the original GH proposal in that type-I and type-II MCER can be separately controlled. This may be useful for some users who care more about discoveries (ie, rejections) than they care about lists of hypotheses for which the null hypothesis is accepted. For these users, type-I MCER may be the only error worth controlling. In this situation, we may set the type-II MCER to so that we only obtain two possible outcomes: “rejected” or “undecided”. Alternatively, a small type-II MCER can be selected so that hypotheses that are very far from the threshold of significance can still be labeled as “accepted”.
MERIT achieves control of type-I MCER at level by lowering the BH threshold for rejecting hypotheses. We note that we could address the chance of switches between “rejected” and “undecided” hypotheses by lowering this threshold even further. If we lowered the threshold such that the probability of observing any switches between “rejected” and “undecided” was bounded by , then we could control the probability of observing any irreproducible rejections by . This control would, however, come at the cost that some hypotheses that can be rejected by controlling type-I MCER at level will no longer be rejected by the new criterion. Still, it may be worth considering. Finally, the number of ideal p-values near the threshold of significance controls the number of “undecided” hypotheses in MERIT, and the distance between the MC p-value and this threshold presumably controls the likelihood of a switch between “rejected” and “undecided”. In our experience, when there are a large number of “undecided” hypotheses, changing the nominal FDR level to a value where there are relatively few empirical p-values can sometimes reduce the number of “undecided” hypotheses.
When there are “undecided” hypotheses, it is tempting to run the MERIT algorithm several times using the same number of replicates, in the hopes of drawing some inference about which hypotheses switch between undecided and decided (eg, rejected). Unfortunately, there does not seem to be a simple way to combine the information in several MERIT runs except using MERIT on the pooled set of replicates. Combining the results in any other way is tantamount to developing a sequential version of MERIT. While developing a sequential version of MERIT seems worthwhile, it may be difficult. One approach might be to consider conditional inference, in which we condition on an “undecided” outcome at the previous step. However, our bootstrap procedure for the second step (see Equation 8) would need to be modified to account for this conditioning. Alternatively, we could consider an alpha spending approach,20 which is commonly used to address multiple decision endpoints in sequential experiments. However, our two-step approach is non-standard and would require custom decision boundaries. Moreover, this approach may not be ideal for practical applications as we may find that simply using the maximum number of MC replicates we can afford gives a higher detection sensitivity and a lower number of undecided tests.
When developing MERIT, we chose to control the family-wise MCER, rather than a percentage or rate of MC errors among all decided tests, since this MC “false discovery rate” may be misleading. The chance of an MC error for hypotheses with ideal p-values well below the BH threshold is typically fairly small; the presence of these “easily rejected” hypotheses in a false-discovery-like list of MC errors can therefore lead to fairly loose control of MC error for hypotheses that are closer to the BH threshold, even though the overall error rate is controlled. In contrast, controlling the family-wise MCER bounds the probability of any MC error, which seems to be more appropriate in the current context.
Here we have considered only MC schemes that have a fixed stopping criterion. We feel this is reasonable, as our impression is that most large-scale MC hypothesis testing reported in the biological sciences used a fixed MC sample size. This also enables us to achieve higher efficiency, as it allows for a fairly sophisticated correction for multiple testing to be applied only once, after all tests have been completed. For fairness, we have also modified the original, sequential GH approach to allow for a more efficient approach when a fixed stopping criterion is used. Even with this boost, we showed that MERIT outperformed GH unless the MC sample size was 1,000,000 or greater, after which performance of the two methods was similar. In this sense, our method is the best available approach to MC hypothesis testing that features control of MCER with a fixed stopping criterion. We have implemented our method in the R package MERIT available on GitHub at https://github.com/yijuanhu/MERIT in formats appropriate for Macintosh or Windows.
Supplementary Material
Funding information
National Institute of General Medical Sciences, Grant/Award Numbers: R01GM116065, R01GM141074
APPENDIX A. A PROCEDURE THAT ACCEPTS ORIGINAL HYPOTHESES WHILE CONTROLLING THE TYPE-II MCER
We consider the following hypotheses that “flip” the hypotheses in (2):
Note that a rejection of corresponds to an acceptance of the original hypothesis . Recall that a type-II MC error occurs when is true (hence ) but rejected using the MC p-value (hence ). Then, a test procedure based on MC p-values that controls the FWER for testing at would control the type-II MCER at . Here, we develop a two-step procedure following the same idea as in Sections 2.1,2.2, and 2.3. In the first step, we construct an at least )-level one-sided CI for , where . In the second step, we consider the revised hypotheses, treating as fixed:
| (A1) |
We develop a test procedure that tests (A1) while controlling the FWER at level . Then, by the relationship that
the type-II MCER is always bounded by the sum of and the FWER in testing (A1), which is . As the default choice, we choose .
Step 1: construct a -level one-sided for
We find from a set of p-values that are “upper limits” for the ideal p-values. To this aim, we consider the following sets of p-values, with each set indexed by a positive continuous value :
| (A2) |
Note that we impose the restriction that ; otherwise may become 0 or even negative and then is not well-defined. However, negative values of the numerator are allowed, as illustrated in Figure 1 where the curve has non-zero value at . As goes to converges to . When is sufficiently large, is asymptotically smaller than , as least for small .
Let be the EDF of and be the BH cutoff. If is always no less than over , as shown in Figure 1 (left), we must have . It follows that
To construct a -level one-sided for , we are interested in that satisfies
| (A3) |
The probability on the left of (A3) increases as increases (because the decrease of leads to the increase of at a given ). Thus we wish to find the smallest that satisfies (A3), denoted by . Finally, we set the upper limit to .
We obtain using the same bootstrap replicates as in Section 2.2. Define the bootstrap p-values indexed by as
Let be the EDF for . We wish to find to be the smallest that satisfies
Algorithm 2.
The procedure that accepts original hypotheses while controlling type-II MCER
| Input: matrix , nominal FDR , nominal type-II MCER |
| Step 1: construct a -level one-sided for . |
| Calculate , , and then . |
| For each bootstrap replicate , |
| Sample columns from with replacement to form . |
| Calculate , , and then . |
| Find to be the smallest that guarantees . |
| Find , the smallest that satisfies (A3), to be the quantile of . |
| Find the upper limit to be the BH cutoff of p-values in (A2) indexed by . |
| Step 2: Test hypotheses in (A1) given while controlling the FWER at . |
| Set . |
| While , |
| If (A4) does not hold, accept all remaining hypotheses and exit the loop. |
| Otherwise, reject , update . |
| Whenever is rejected, is accepted. |
| Output: a decision for every hypothesis between “accepted” and “non-accepted”. |
which is the empirical version of (A3). To this aim, we find, for every , the smallest that guarantees , denoted by , and then set to be the th quantile of . The procedure here is summarized in Step 1 of Algorithm 2.
Step 2: test hypotheses in (A1) given while controlling the FWER at ()
Let be the ordered hypotheses (A1) that correspond to the ordered observed test statistics . We start by testing the global null , which means that all are less than . We use as the test statistic for this test. Following Romano and Wolf, we propose to calculate the p-value under the global null:
where each is a binomial random variable with trials and “success” rate that is under the null. Using the Bonferroni inequality, we obtain an upper bound of the p-value to be
where denotes for . Using Hansen’s adjustments for null hypotheses that are “deep in the null”, we choose to be
Using this , if
we reject and move to the next joint null that excludes ; otherwise, we stop and declare that none of the hypotheses should be rejected. In general, for testing the joint null for any , we use as the test statistic. If
| (A4) |
we reject and move to the next joint null that excludes ; otherwise, we stop and accept the remaining hypotheses. This step-wise procedure asymptotically controls the FWER in testing (A1) at . Whenever is rejected, the corresponding original hypothesis is accepted. This procedure is summarized in Step 2 of Algorithm 2.
Footnotes
SUPPORTING INFORMATION
Additional supporting information can be found online in the Supporting Information section at the end of this article.
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in the R package sda in the repository CRAN.
REFERENCES
- 1.Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B Methodol. 1995;57:289–300. [Google Scholar]
- 2.Davison A, Hinkley D. Bootstrap Methods and their Application. Cambridge: Cambridge University Press; 1997. [Google Scholar]
- 3.Manly BF. Randomization, Bootstrap and Monte Carlo Methods in Biology. Vol 70. Boca Raton, Florida: CRC press; 2006. [Google Scholar]
- 4.Phipson B, Smyth GK. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat Appl Genet Mol Biol. 2010;9(1). doi: 10.2202/1544-6115.1585 [DOI] [PubMed] [Google Scholar]
- 5.Sandve GK, Ferkingstad E, Nygård S. Sequential Monte Carlo multiple testing. Bioinformatics. 2011;27(23):3235–3241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Gandy A, Hahn G. MMCTest-a safe algorithm for implementing multiple Monte Carlo tests. Scand J Stat. 2014;41(4):1083–1101. [Google Scholar]
- 7.Gandy A, Hahn G. A framework for Monte Carlo based multiple testing. Scand J Stat. 2016;43(4):1046–1063. [Google Scholar]
- 8.Gandy A, Hahn G. QuickMMCTest: quick multiple Monte Carlo testing. Stat Comput. 2017;27(3):823–832. [Google Scholar]
- 9.Darling D, Robbins H. Some further remarks on inequalities for sample sums. Proc Natl Acad Sci U S A. 1968;60(4):1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lai TL. On confidence sequences. Ann Stat. 1976;4(2):265–280. [Google Scholar]
- 11.Coe PR, Tamhane AC. Exact repeated confidence intervals for Bernoulli parameters in a group sequential clinical trial. Control Clin Trials. 1993;14(1):19–29. [DOI] [PubMed] [Google Scholar]
- 12.Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J R Stat Soc Series B Stat Methodology. 2004;66(1):187–205. [Google Scholar]
- 13.Holm S A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6:65–70. [Google Scholar]
- 14.Romano JP, Wolf M. Exact and approximate stepdown methods for multiple hypothesis testing. J Am Stat Assoc. 2005;100(469):94–108. [Google Scholar]
- 15.Romano JP, Wolf M. Multiple testing of one-sided hypotheses: combining Bonferroni and the bootstrap. International Conference of the Thailand Econometrics Society. New York: Springer; 2018:78–94. [Google Scholar]
- 16.Hansen PR. A test for superior predictive ability. J Bus Econ Stat. 2005;23(4):365–380. [Google Scholar]
- 17.Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22(158):209–212. [Google Scholar]
- 18.Brown LD, Cai TT, DasGupta A. Interval estimation for a binomial proportion. Stat Sci. 2001;16:101–117. [Google Scholar]
- 19.Singh D, Febbo PG, Ross K, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002;1(2):203–209. [DOI] [PubMed] [Google Scholar]
- 20.DeMets D, Lan K. Interim analysis: the alpha spending function approach. Stat Med. 1994;13:1341–1352. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data that support the findings of this study are openly available in the R package sda in the repository CRAN.
