Abstract
Motivated by laboratory experiments that fail to reach significance, we developed a small sample size approach to designing a subsequent experiment that controls overall type I error and achieves sufficient conditional power. We focus on experiments with leukemia cells, and use a specific example in Chronic Lymphocytic Leukemia to discuss unanticipated patient variance and difficult to predict interaction effect sizes. We emphasize the importance of achieving significance in the first run of an experiment, which results in simplifying the multiple considerations usually associated with interim analysis and decision making in adaptive clinical trials. Within the context of combination testing for an adaptive laboratory experiment, we show that a range of reasonable options for the futility cut-off, effect size estimation, and significance level for the first run provide similar power and expected overall sample size. We contrast this approach to a naive procedure in which a second unplanned experiment is run based on non-significance in the first experiment, and data are combined as if they were obtained from one run.
Keywords: Conditional error function, Conditional power, Small sample size, Sample size re-estimation
1 Introduction
Motivating examples for this methodological research were experiments on cells from patients suffering from Chronic Lymphocytic Leukemia. In one experiment, a test condition was expected to promote CD154 expression, a clinically relevant target. Based on previous experimental results of doubtful relevance, a sample size of 6 was proposed. In this type of leukemia cell experiment each patient provides a sufficient number of cells so that their cells can be observed under all conditions of the experiment. Still, prior experience with these leukemia cell experiments had us concerned about variance across patients in the differences between the effects of test vs. control conditions. So we raised the issue of adding replicates, if needed, in a second run in this and other similar leukemia cell experiments. It is common for us to simultaneously plan a whole series of experiments with the scientists. Each is planned to provide a component of a coherent story about the mechanisms that control the disease, and how treatment conditions change the cells behavior and ameliorate their proliferation. For the whole series of experiments, it is important to reach significance on almost all of them in order to confirm the proposed mechanism, and so this is another motivation for planning a method to recover from non-significance. Clearly considering all experiments simultaneously is beyond the scope of this paper, so we focused on one of the most important ones in the series to motivate our approach. In the example of leukemia cell type experiments only two conditions were tested, so that a comparison of means with a one-sample t-test seemed appropriate. A large scientifically meaningful effect size was assumed to assess a sufficiently powerful sample size. Planning on another run was informal at first, because the developments proposed here were not yet fully developed. We simply used a more conservative alpha and found that after running this experiment, the one-sided p-value was p1 = 0.06. The effect size was smaller than expected, because patient variance was larger than expected. However, the results looked promising and a second run of the experiment was planned.
In adaptive designs for small sample size experiments with animals or human (patient donor) cells, considerations are obviously distinct from clinical trials. For example, there is typically no accrual time in these experiments, because all animals or patients’ cells are available to run through an experiment simultaneously, and great emphasis is placed on reaching significance within that first run. For reasons stated above, we do not pre-plan sample size for the second run, and expect that a second run will be rarely be needed. Instead we spend most of α on the first run, and reserve little for the second run. There is a great cost to set up an experiment a second time, and so is considered an inefficient options by scientists. In other words, it would be difficult to justify to scientists a design with a pre-planned second run of an experiment, when they are convinced a single efficient run should suffice. Therefore, the Chen et al (2004) and Gao et al (2008) large sample size procedures, which assume a preplanned second stage, are not directly relevant to our situation. For example, their findings that type I error can be easily controlled with a seamless test, given promising results from the first stage, cannot be compared directly to our findings.
In consideration of these fundamental distinctions and others of lab experiments, we propose that the main situation scientists wish to avoid is having no mitigation plan when significance is not achieved in the first run of an experiment. Because there is usually a great resource cost in repeating a run of an experiment, investigators invest much effort in collecting sufficient preliminary data to ensure that the design of a single full run of an experiment will provide the desired conclusions. External pilot studies and similar prior experiments (e.g. use of a familiar mouse model) provide confidence and refinement during design and planning. Some investigators go so far as to use a much larger sample size than preliminary data suggest in order to avoid running an experiment a second time. Still, when significance is not achieved, but results look promising, a decision will often be made to repeat the experiment. Investigators then tend to ask statisticians to determine the number of additional replications to combine with the first run to achieve significance, with no recognition that type I error was already spent on the first test. Besides the leukemia cell experiments, we have also found that failed significance occurs in animal experiments that require a long observation period (e.g. when outcomes like mortality, disease onset, or tumor size are measured over months). Also in leukemia cell experiments where patient cell-by condition random effects are present, it is difficult to anticipate accurately both the variance of response and especially the effect size. Even if these were anticipated correctly, power is usually set to 80%, so that 20% of experiments would fail to reach significance when presumed effect sizes are correct. With pre-planning within this context, we suggest an easy to use method that controls overall type I error. It provides confidence when deciding the size of the second run of an experiment, given the results of the first run look promising (but failed to reach significance). Finally, our experience with more recent grant reviews indicates that thorough justification of animal numbers is becoming more critical, and attention to proper design and reporting of animal experiments is increasing (Festing, 2010). We think that the proposed method satisfies the demand for efficiency and sophistication in the planning of these experiments, while maintaining simplicity.
We are not considering the first stage as an internal pilot (Kairalla et al, 2010; Proschan, 2005) because of our common experience in which the investigator feels that he/she has developed the ideal protocol, through preliminary studies and literature, and has sufficient confidence to attempt a single conclusive experiment. The small size of laboratory experiments and the simultaneous running of all animals or tissue samples also make the use of internal pilots less applicable than for clinical trials. For example, if the experiment only needs a sample size of 6 mice per group to achieve desired power, then an internal pilot on one or two mice would not be sufficiently informative about re-designing the study. Instead an external pilot with a few mice is usually used to refine the design or intervention before settling on a final design. The same can be said about donor cell experiments. Internal pilots can be useful with somewhat larger experiments. For example, in the approach suggested by Denne and Jennison (1999), estimating the unknown variance increases power substantially. An internal pilot in larger experiments could be used to provide the first stage sample size in our approach, and then our suggestions below for planning a second run in the face of failed significance would be applicable.
Posch and Bauer (1999), Lehmacher and Wassmer (1999), Posch et al (2003), and Bretz et al (2009) showed that several adaptive designs can be formulated in terms of a simple conditional error function. We explored two combination approaches in small sample size, one based on the p-value product method introduced by Bauer and Kohne (1994) (BK) and the other based on a p-value inverse normal transformation (IN) discussed in Lehmacher and Wassmer (1999), Posch and Bauer (1999), and Bretz et al (2009). In both approaches we emphasize an upfront choice of α1 (first run significance level) that is close to α, for reasons stated above. Typically α1 is set much lower for combination tests, and sometimes is determined as a function of other choices. A futility level (no second run) α0 must also be chosen for this approach in order to define significance criteria for the second run. For these combination tests, when the second experiment is run because the p-value from the first is between α1 and α0, the p-value from the second experiment is combined with the first, and critical values are chosen to control the overall type I error at α. Here we describe the simple steps in implementing these approaches for small sample sizes and apply it to our motivating example.
In contrast to large sample approaches that fail to control type I error adequately with small sample sizes (Posch and Bauer, 2000; Wassmer, 1998), it is clear that the BK and IN methods can be easily adapted to control type I error in small sample size experiments. The IN method requires pre-specified weights for the individual run's p-values and numerical integration, but the range of weights can be restricted by the small range we adopt for α0 to α1. In power simulations we show that imprecision of effect size estimates has little consequence for a range of mistaken assumptions about effect size and reasonable choices of α0 in our context. We were particularly interested in the effect of small degrees of freedom associated with the two stages of the experiment, because in small sample sizes the degrees of freedom associated with the two p-values could reduce power relative to treating the data as if it came from one experiment with larger degrees of freedom. Within our context, power loss is negligible. Our results are therefore similar to what has been shown in large sample sizes (Posch and Bauer, 2000; Wassmer, 1998). The BK and IN methods in small sample size experiments seem to have power that is close to that of an “optimal” t-test from the seamless combination of data from the two experiments as if they were one, despite the possibility of quite different sample sizes in each stage (Bauer and Kohne, 1994). We show that a “naive” method of adding replicates and combining data from two stages seamlessly fails to control type I error, with no important gain in power over BK or IN methods. We also studied the effect of setting α1 lower than a choice that fits our circumstances in order to uncover any substantial effects on power or total sample size. Finally, we also studied the effects of overly optimistic effect sizes at the first run to determine the ability of the methods to recover from such mistaken assumption.
2 Extension of combination tests for small sample sizes
2.1 Hypothesis Testing
Though our experiment on patient's cells used a one-sample test, we consider first the more common two sample problem. We propose a combination test of the one-sided hypothesis:
where μy and μx are the population means from group x and y. Let n1 denote the proposed sample size in each group for the first run. If the p-value from the first run gives p1 α1, we stop with a rejection of the null. If p1 α0, we stop with no conclusion (presumably loss of interest in pursuing the alternative with a resource demanding second experiment). If α1 < p1 < α0, we proceed to the second run. Sample size n2 can be estimated based on data from the first run. If we proceed to the second run, the null hypothesis H0 is rejected at the final analysis if C(p1, p2) ≤ c, where p2 is the p-value based only on the second run data. C(p1, p2) is a monotonically increasing combination function of p1 and p2, and c is the critical value that controls overall type I error α.
2.2 Choosing Parameters to Control Type I Error
Overall type I error is controlled by calculation of c, given a preplanned choice of α0 and α1;
| (1) |
where α(p1) is the conditional error function defined in the Appendix. For the BK method, H0 is rejected at the end of the second run if p1 × p2 c. With α0 and α1 chosen first, c is simply obtained by considering that
| (2) |
In other references to the BK method, given the overall type I error probability (α) and the futility value for α0, a c is chosen to satisfy , which then determines the first run significance level α1. In addition in these approaches, first run power is dependent on these choices. Given our need to achieve significance in the first run, our approach must be different. Instead, small sample size experiments (SBK), we choose α0 and α1 first so that power of the first run is 1−β and the second run sample size does not become unreasonably large.
For the IN method with small sample size (SIN), the weighted inverse normal combination function leads to rejection of H0 at the end of the second run if . In order to control type I error at α, c is chosen to satisfy the following equation:
| (3) |
Given α0, α1, and pre-specified w1 and w2, c must be solved numerically.
2.3 Proposed n2 Calculation with Small n1
Let CPdδ(n2, c|p1) denote the conditional probability that p-value, p2, based on n2 observations in each group is less than A(p1), given p1, effect size (μy − μx)/σ = δ, and n1, where σ is the population standard deviation. This conditional power has been derived for a noncentral t-distribution in the Appendix. Setting CPδ(n2,c|p1,n1) = 1 − β yields
| (4) |
so that the sample size for the second run, n2, should satisfy the above equality. Fixing conditional power and c, one can solve this equation iteratively to obtain n2. For combination tests with small sample size, the noncentrality parameter h (or d) must be chosen, but an estimate from the first run will have a wide confidence interval. (Consider that you only need an estimate of δ, when the first run results are not significant). In our context, the β in Equation (4) is the same as overall β. Therefore, we explored how limited information about δ from a small sample size performs in estimating n2. With small n1, we suggest simply using the estimated effect size from the first run, . We compare this to a more conservative approach, which uses an estimated lower confidence bound on δ to obtain n2. Cumming and Finch (2001) provided a confidence interval for estimated effect size. The lower bound for η, ηL, satisfies the following equation:
where p is the probability of the observed t value in the upper tail. Once the lower bound for h is found, the lower bound for δ is . The confidence intervals are slightly skewed (based on the noncentral t distribution). Despite this asymmetry, the and the 50%tile of the confidence interval differs very little when n1 is above 4. With from , we solve Equation (4) iteratively to obtain n2. We will compare these two choices to show the slight gain in power that is achieved with this more conservative approach, and justify our suggestion of simply using .
The method to calculate the n2 is also provided for the one sample problem (e.g. patient donor cell experiments), which is provided in the Appendix. We provide n2 values in Table 4 for SBK and SIN (in italics) methods as defined in this section for the one-sample problem using the noncentral t for small values of n1 and the first run result (p1) assuming a desired power of 80%. This table used rather than the lower confidence bound because of this conservative approach's higher ratio of n2/n1 and its insignificant increase in power described below. For the SBK method, the critical value, cSBK = 0.0031 was calculated by solving Equation (2) given α = 0.025, α1 = 0.02, and α0 = 0.1. For the SIN method, we used a Riemann sum to approximate the integration and solved Equation (3) numerically. For α0 = 0.1, cSIN = 0.0145, given and . Simple programs are available upon request from first author.
Table 4.
The second run sample size, n2, that achieves conditional power of 80% as a function of p1 and n1, for a one sample t-test with one-sided α = 0.025 and α0 = 0.1. For each n1, the first row uses the SBK method and the second row in italics uses the SIN method. For SBK, the critical value, cSBK = .0031. For SIN, cSIN = .0145 is numerically obtained with and .
| p1 | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n 1 | .025 | .03 | .035 | .04 | .045 | .05 | .055 | .06 | .065 | .07 | .075 | .08 | .085 | .09 | .095 | .1 | |
| α = .025 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 5 | 5 | 5 | 6 | 6 | 7 | 7 | 8 | 8 | 9 |
| α1 = .02 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 5 | 5 | 5 | 6 | 6 | 7 | 7 | 8 | 8 | |
| α0 = .1 | 4 | 3 | 4 | 4 | 5 | 5 | 6 | 7 | 7 | 8 | 9 | 9 | 10 | 11 | 12 | 12 | 13 |
| β = 0.2 | 4 | 4 | 5 | 5 | 5 | 6 | 6 | 7 | 8 | 8 | 9 | 9 | 10 | 11 | 12 | 12 | |
| 5 | 4 | 5 | 6 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | |
| 5 | 5 | 6 | 7 | 7 | 8 | 9 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | ||
| 6 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 | 15 | 16 | 17 | 18 | 20 | 21 | 23 | |
| 6 | 7 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 20 | 21 | ||
| 7 | 6 | 7 | 9 | 10 | 11 | 12 | 14 | 15 | 16 | 18 | 19 | 21 | 22 | 24 | 26 | 27 | |
| 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 16 | 17 | 18 | 19 | 21 | 22 | 24 | 25 | ||
| 8 | 7 | 9 | 10 | 11 | 13 | 14 | 16 | 18 | 19 | 21 | 23 | 24 | 26 | 28 | 30 | 32 | |
| 8 | 9 | 10 | 12 | 13 | 14 | 15 | 17 | 18 | 20 | 21 | 23 | 24 | 26 | 28 | 29 | ||
| 9 | 8 | 10 | 11 | 13 | 15 | 17 | 18 | 20 | 22 | 24 | 26 | 28 | 30 | 32 | 34 | 36 | |
| 9 | 11 | 12 | 13 | 15 | 16 | 18 | 19 | 21 | 23 | 24 | 26 | 28 | 30 | 32 | 34 | ||
| 10 | 9 | 11 | 13 | 15 | 17 | 19 | 21 | 23 | 25 | 27 | 29 | 31 | 33 | 36 | 38 | 41 | |
| 10 | 12 | 13 | 15 | 17 | 18 | 20 | 22 | 24 | 25 | 27 | 29 | 31 | 33 | 36 | 38 | ||
| 11 | 10 | 12 | 14 | 16 | 19 | 21 | 23 | 25 | 28 | 30 | 32 | 35 | 37 | 40 | 42 | 45 | |
| 12 | 13 | 15 | 17 | 18 | 20 | 22 | 24 | 26 | 28 | 30 | 32 | 35 | 37 | 40 | 42 | ||
| 12 | 11 | 14 | 16 | 18 | 21 | 23 | 25 | 28 | 30 | 33 | 35 | 38 | 41 | 44 | 47 | 50 | |
| 13 | 15 | 16 | 18 | 20 | 22 | 24 | 27 | 29 | 31 | 33 | 36 | 38 | 41 | 43 | 46 | ||
| 13 | 12 | 15 | 17 | 20 | 22 | 25 | 28 | 30 | 33 | 36 | 39 | 42 | 45 | 48 | 51 | 54 | |
| 14 | 16 | 18 | 20 | 22 | 24 | 27 | 29 | 31 | 34 | 36 | 39 | 42 | 44 | 47 | 50 | ||
| 14 | 13 | 16 | 19 | 22 | 24 | 27 | 30 | 33 | 36 | 39 | 42 | 45 | 48 | 52 | 55 | 59 | |
| 15 | 17 | 20 | 22 | 24 | 26 | 29 | 31 | 34 | 37 | 39 | 42 | 45 | 48 | 51 | 55 | ||
| 15 | 14 | 17 | 20 | 23 | 26 | 29 | 32 | 35 | 39 | 42 | 45 | 49 | 52 | 56 | 59 | 63 | |
| 16 | 19 | 21 | 24 | 26 | 29 | 31 | 34 | 37 | 39 | 42 | 45 | 49 | 52 | 55 | 59 | ||
2.4 Choice of α0 and α1
As argued in the introduction, it is critical to reach significance with one (first stage) run of the experiment, because of set up costs. Therefore, we choose α1 close to α = 0.05 or to 0.025 for the more justifiable one-sided test. n1 is chosen to provide the desired power (e.g. 80%) for the first run, given α1, ignoring the possibility of moving to the second run. Because of the discrete values for power in small sample sizes, it turns out that for the usual choices of effect sizes that generate a small sample size, the n1 for α1 = 0.04 or 0.02 is either the same as or one larger when α1 is set to 0.05 or 0.025. Adding one additional animal or patient cell sample in each group is a minor cost relative to the setup of the experiment. In contrast, setting alpha to a much lower level for the first run (and reserving more for the second run) would be unacceptable to scientists, who expect to test at the usual significance level. For example if the p-value from the first run was 0.01, and we required something smaller for α1 to declare significance, our approach would be rejected. Still, we did explore the effects of a smaller α1 on second run sample sizes in the simulations.
Choice of α0 is consequential, because it impacts both overall power and the potential size of n2. Larger choice of α0 allows the second run of an experiment even with an unexpectedly small . Given that n1 was chosen based on a scientifically meaningful d, scientists typically lose interest in expending resources with a much smaller than expected. In addition with larger values of α0, the n2 can be so many times larger than n1 that a scientist would choose to not run a second experiment of that size. Larger α0 is also associated with a smaller value for the significance cri- terion c (Equation (1)). In contrast, larger α0 increases the probability of running a second experiment. This increases power, but not by much unless the true δ is much lower than our first run guess, as we show in simulations below. For example, a reasonable ratio of n2 to n1 should be less than 5. For SBK Figure 1(a) provides the ratio of n2 to n1 using different α0. The plotted n2 corresponds to p1 = α0 (maximum n2 at each α0). Note that α0 ∈ (0.052,0.25) for α = 0.05, α1 = 0.04, and α0 ∈ (0.026,0.25) for α = 0.025, α1 = 0.02. For SBK p1 × p2 p1, so it only makes sense to consider cases where α1 c (because we reject the null if p1 × p2 < c). The minimum of α0 = 0.052 or 0.026 were chosen to satisfy α1 ∈ c. From all the curves, α0 2 (0.1,0.15) seems to keep the n2/n1 ratio reasonable. Interestingly Figure 1(a) shows that the n2/n1 ratios are almost independent of n1. The curves correspond to n = 4, 6, 8, and 10 (which correspond to d = 2.6, 1. 9, 1.6, and 1.4), and are from the bottom to the top in each group, respectively. The just noticeable ordering from bottom to the top for small to large n1 is explained by subtle effects of degrees of freedom on conditional power. Figure 1(a) used rather than the lower confidence bound on for calculating n2. produces much higher and unreasonable ratios as seen in Figure 1(b). We address the slight improvement in power achieved by using the lower bound next. With larger choices of α0, smaller values of will be pursued in a second run, but power improvements seem small in most cases. This is at least partially explained by the fact that higher α0 produce smaller critical values for c. For example, for α1 = 0.04, α0 = 0.1, c = 0.011, while with an increase in α0 = 0.2, c = 0.0062. Other examples of the effect are described later in Table 2. Highly similar results could be plotted for SIN.
Fig. 1.

The effect of α0 on the ratio n2/n1. (a) used for calculating n2; (b) used the lower confidence bound on and curves from (a) simply carried over for reference. For each group of curves, from the bottom to the top are for n1 = 4,6,8, and 10, respectively. n2 achieves conditional power of 80%.
Table 2.
Simulation results comparing the power of different procedures and different first run guesses of δ (δ1 with n1 = 4 (δ1 = 2.6 standard deviations), α = 0.025, α1 = 0.02, and α0 = 0.1,0.15,0.2 for the two sample problem. For SBK, the critical value, cSBK = .0031 for α0 = 0.1, cSBK = .0025 for α0 = 0.15, and cSBK = .0022 for α0 = 0.2. For SIN, cSIN = .0145 for α0 = 0.1, cSIN = .0109 for α0 = 0.15, and cSIN = .0095 for α0 = 0.2 are numerically obtained with and . P1→2 is the probability of a second run.
| α0 = 0.1 | α0 = 0.15 | α0 = 0.2 | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True δ | 0.4δ1 | 0.6δ1 | 0.8δ1 | δ 1 | 1.2δ1 | 0.4δ1 | 0.6δ1 | 0.8δ1 | δ 1 | 1.2δ1 | 0.4δ1 | 0.6δ1 | 0.8δ1 | δ 1 | 1.2δ1 | |
| power | standard (0.02) | 0.20 | 0.41 | 0.64 | 0.83 | 0.94 | 0.20 | 0.41 | 0.64 | 0.83 | 0.94 | 0.20 | 0.41 | 0.64 | 0.83 | 0.94 |
| standard (0.025) | 0.24 | 0.46 | 0.69 | 0.86 | 0.95 | 0.24 | 0.46 | 0.69 | 0.86 | 0.95 | 0.24 | 0.46 | 0.69 | 0.86 | 0.95 | |
| SBK () | 0.43 | 0.73 | 0.92 | 0.98 | 1 | 0.52 | 0.81 | 0.95 | 0.99 | 1 | 0.59 | 0.86 | 0.97 | 1 | 1 | |
| SBK () | 0.52 | 0.77 | 0.93 | 0.98 | 1 | 0.62 | 0.85 | 0.96 | 0.99 | 1 | 0.70 | 0.90 | 0.98 | 1 | 1 | |
| SIN | 0.42 | 0.74 | 0.92 | 0.98 | 1 | 0.51 | 0.81 | 0.95 | 0.99 | 1 | 0.60 | 0.86 | 0.97 | 1 | 1 | |
| Naive | 0.46 | 0.76 | 0.92 | 0.98 | 1 | 0.56 | 0.82 | 0.96 | 0.99 | 1 | 0.64 | 0.88 | 0.97 | 1 | 1 | |
| E(n2|0.02 < p1 < α0) | SBK () | 8.02 | 7.31 | 6.74 | 6.02 | 5.51 | 12.1 | 10.3 | 9.10 | 7.55 | 6.54 | 16.9 | 13.4 | 10.7 | 8.80 | 6.78 |
| SBK () | 26.0 | 22.7 | 20.0 | 16.8 | 14.7 | 55.8 | 42.9 | 35.3 | 24.9 | 19.1 | 143 | 94.2 | 60.5 | 39.2 | 21.3 | |
| SIN | 7.78 | 7.20 | 6.65 | 6.08 | 5.77 | 11.7 | 10.3 | 8.93 | 7.79 | 6.47 | 16.1 | 13.3 | 10.6 | 9.04 | 7.11 | |
| Naive | 6.80 | 6.43 | 5.89 | 5.50 | 4.93 | 9.36 | 8.55 | 7.30 | 6.43 | 5.47 | 12.5 | 10.3 | 8.50 | 6.77 | 5.81 | |
| P 1→2 | SBK | 0.33 | 0.38 | 0.28 | 0.15 | 0.06 | 0.44 | 0.44 | 0.32 | 0.17 | 0.06 | 0.52 | 0.50 | 0.34 | 0.17 | 0.06 |
| SIN | 0.32 | 0.37 | 0.28 | 0.15 | 0.06 | 0.44 | 0.45 | 0.32 | 0.17 | 0.06 | 0.51 | 0.49 | 0.34 | 0.17 | 0.06 | |
| Naive | 0.30 | 0.33 | 0.23 | 0.12 | 0.04 | 0.41 | 0.40 | 0.27 | 0.13 | 0.05 | 0.49 | 0.44 | 0.29 | 0.14 | 0.04 | |
| E(n2) | SBK () | 2.65 | 2.75 | 1.91 | 0.91 | 0.35 | 5.33 | 4.58 | 2.91 | 1.29 | 0.39 | 8.74 | 6.70 | 3.63 | 1.47 | 0.44 |
| SBK () | 8.59 | 8.54 | 5.65 | 2.55 | 0.93 | 24.5 | 19.0 | 11.3 | 4.26 | 1.15 | 74.3 | 47.0 | 20.5 | 6.56 | 1.38 | |
| SIN | 2.52 | 2.64 | 1.87 | 0.91 | 0.36 | 5.11 | 4.65 | 2.84 | 1.31 | 0.41 | 8.26 | 6.52 | 3.58 | 1.52 | 0.46 | |
| Naive | 2.02 | 2.10 | 1.37 | 0.67 | 0.21 | 3.84 | 3.39 | 1.98 | 0.85 | 0.26 | 6.09 | 4.53 | 2.45 | 0.92 | 0.25 | |
| E(n) | SBK () | 6.65 | 6.75 | 5.91 | 4.91 | 4.35 | 9.33 | 8.58 | 6.91 | 5.29 | 4.39 | 12.74 | 10.70 | 7.63 | 5.47 | 4.44 |
| SBK () | 12.59 | 12.54 | 9.65 | 6.55 | 4.93 | 28.5 | 23.0 | 15.3 | 9.26 | 5.15 | 78.3 | 51.0 | 24.5 | 10.56 | 5.38 | |
| SIN | 6.52 | 6.64 | 5.87 | 4.91 | 4.36 | 9.11 | 8.65 | 6.84 | 5.31 | 4.41 | 12.26 | 10.52 | 7.58 | 5.52 | 4.46 | |
| Naive | 6.02 | 6.10 | 5.37 | 4.67 | 4.21 | 7.84 | 7.39 | 5.98 | 4.85 | 4.26 | 10.09 | 8.53 | 6.45 | 4.92 | 4.25 | |
2.5 Calculation of c for SBK and Choice of Weights for SIN
As shown above, the critical value, c, is calculated by solving the equations to control type I error at α given α0, α1, and pre-specified conditional error function α(p1). For the SBK method, Equation (2) gives c = −(α − αa1)/{log(α0) − log(α1)}. For the SIN method, c can only be solved numerically. Moreover, the pre-specified weight w1 and w2 must be chosen. A typical choice in a conventional two-stage design is and . However, in our context, we emphasize achieving significance in the first run, so that the second run sample size, n2, is not pre-specified. Therefore, we pre-specify w1 and w2 based on our expectation of the ratio of n2=n1 within our context. We show later in Figure 1 and Table 2 that for a range of reasonable choices of α0 and α1, the expected value of n2=n1 will be around 2. So we chose and . If the first experiment fails to achieve significance at α1, but also produces a p-value below α0, we move onto the second run of the experiment. The n2 then is calculated based on the first run information by solving Equation (4). The n2 for the SBK method is obtained by solving given the conditional power and c. For the SIN method, n2 can be obtained by solving . Clearly, the SBK method in our specific setting is much simpler to use. Further we show that the n2 for achieving 80% conditional power are only slightly lower for SIN than for SBK (n2 tables are not provided, but are available on request).
2.6 Approximate Large Sample Size Method for Choosing n2
For large n1 and assuming the variance is known (normality of the test statistic instead of the noncentral t), Equation (4) for SBK for the two sample problem has a closed form for . With from the first run, . This corresponds to the equation for estimating n2 for the BK method in Posch and Bauer (2000). With this large sample size method of estimating n2, we of course expected n2 to be consistently smaller than our noncentral t approach when using for small sample sizes, but we were not sure by how much. When α = 0.025, α1 = 0.02, α0 = 0.1, and n1 ≤ 15, 25% of the large sample n2 are the same as, and the remaining 75% are 1 smaller than the n2 that the noncentral t method produces. For α0 = 0.15, 10% are the same, 80% are 1 smaller than, and 10% are 2 smaller than the noncentral t method. For the alternative large sample size approach for the one-sample problem, the equation is the same because .
3 Applying SBK to the motivating example and Simulations
3.1 Applying SBK to our motivating example
Using SBK in our example with α = 0.025, α1 = 0.02, and α0 = 0.1, the corresponding c = 0.0031 (Equation (2)). For the first run, with a sample size of 6 (n1), we had 80% power to detect an effect size of 1.6 standard deviations using α1 = 0.02. Because of the discrete values for power in small sample sizes, the n1 for one-sided α1 = 0.02 is the same as when α1 was set to 0.025. The p-value from the first run (based on a one sample t-test on differences) was p1 = 0.06, which was within our second run decision range. The p1 = 0.06 corresponded to an effect size estimate of 0.76 standard deviations, so is smaller than expected. Estimating n2 by solving the equation tn2−1, η, 0.8=t−1,1−0.031/0.06 iteratively, which is based on 80% conditional power, another 12 (See Table 4) patient's cell samples were needed for the second run. After we ran the second stage, the p-value was p2 = 0.000035. Combining these two runs’ p-values, p1 × p2 = 0.060 × 0.000035 = 0.0000021 < 0.0031 = c, which gave us the significance we needed for confidently reporting the results.
3.2 Simulations for E(n2), Type I and II Error
Simulations were used to compare testing options and conditions regarding type I and II errors, the probability of a second run, and sample size for n2 and total n. The simulations were based on the two sample problem. In our simulations, we estimated the type I and II error and all other parameters using 10,000 replicates of the procedure, with a Monte Carlo error of 0.005.
3.2.1 Type I error for a “naive” Method
A naive procedure might combine n1 and n2 observations in a single seamless test when faced with non-significance in an initial experiment and might make no adjustment to the significance criteria. Based on the effect observed in a first run, which fails the significance test, a total sample size, ntotal, could be calculated so that n2 = ntotal −n1. This might be the approach taken by an investigator who naively expects that the seamless significance test still controls type I error. A similar procedure for calculating a n2 based on conditional power was outlined in Chen et al (2004) and Gao et al (2008) for large sample size. They found that the type I error is not inflated when the conditional power is above approximately 50% given a pre-planned two-stage design, otherwise the significance criteria must be adjusted. We showed above that reasonable choices of α0 and α1 can produce an n2 that is 5 times larger than n1, so we did not expect that type I error would be as well controlled as in Chen et al. Note also that our naive approach should be more biased than the “naive” approach defined by Wittes et al (1999), where only the variance parameter drives second run sample size and stopping is not part of the plan. The results in Table 1 show that type I error is not controlled using this naive procedure even with our choices of low values for α0. Note that a significance level of 0.025 was used in the naive procedure for the first run test. Simulated type I errors for our SBK and SIN methods are also provided simply for comparison, and are as expected. This naive procedure did not control type I error, mainly because it makes no adjustment for testing twice nor using the first run data to both test the hypothesis and determine n2. In the examples we ran, testing twice without adjusting alpha seemed to contribute 40% to the inflated error rate and the capitalizing on chance by using first run data to drive n2 contributes around 60%. We also used the significance level of 0.02 at the first run for the naive procedure in order to mimic the significance level used in SBK and SIN methods. We found that the type I error is still not controlled, but it is inflated less by around 20% in our examples.
Table 1.
Simulation results comparing the type I error of the naive procedure, SBK, and SIN using with α = 0.025, α1 = 0.02, and α0 = 0.1,0.15 for the two sample problem. Standard error of simulation is 0.0016. For SBK, the critical value, cSBK = .0031 for α0 = 0.1 and cSBK = .0025 for α0 = 0.15. For SIN, cSIN = .0145 for α0 = 0.1 and cSIN = .0109 for α0 = 0.15 are numerically obtained with and .
| Naive Procedure (0.025) | SBK | SIN | ||
|---|---|---|---|---|
| α0 = 0.1 | n1 = 4 | 0.0405 | 0.0252 | 0.0258 |
| n1 = 6 | 0.0341 | 0.0246 | 0.0235 | |
| n1 = 8 | 0.0332 | 0.0239 | 0.0249 | |
| n1 = 10 | 0.0342 | 0.0246 | 0.0251 | |
| α0 = 0.15 | n1 = 4 | 0.0402 | 0.0266 | 0.0252 |
| n1 = 6 | 0.0373 | 0.0243 | 0.0242 | |
| n1 = 8 | 0.0377 | 0.0250 | 0.0243 | |
| n1 = 10 | 0.0329 | 0.0242 | 0.0241 |
3.2.2 Effect on power of α0 for SBK when the presumed value of δ is incorrect in the first run
From Figure 2, we let the true δ range from 20% to 140% for the first run guess of effect size (δ1). The overall power was calculated and compared between the SBK procedure for small sample sizes using and the conservative . Figure 2(a) provides the power of these two choices under a number of different values of d and n1 with α0 = 0.1. Note that n1 = 4, 6, 8, and 10 are ordered from the top to the bottom in each group of curves. Figure 2(b) is for α0 = 0.15. With these choices, the SBK procedure with conservative (dashed lines) did not increase the overall power by much compared to the SBK procedure with (solid lines), even though it increased n2 substantially (see Figure 1(b)). Figure 2(c) provides the probability of moving to the second run using SBK with for α0 = 0.1 and α0 = 0.15, respectively. Of course with α0 = 0.15, the probability of having the second run is higher than with α0 = 0.1, but not by much. The probability of having the second run is quite low when the true δ is too much lower or higher than the initial guess (δ1).
Fig. 2.

The overall power and probability of a second run using SBK with α = 0.025. Horizontal axis represents the true value of d as a % of the first run guess of d (d1). (a) shows the overall power with α0 = 0.1 using or ; (b) shows the overall power with α0 = 0.15 using or ; (c) shows the probability of a second run using with α0 = 0.1 and α0 = 0.15. For each group of curves, from the top to the bottom are n1 = 4,6,8, and 10, respectively.
3.2.3 Comparison with other options
In Table 2, the power, the average sample size of n2 given a second run , the average sample size of n2 overall (including n2 = 0, ) and the average total sample size (E(n)) are compared among two SBK procedures (using or ), the SIN procedure with pre-specified weight, two standard tests (with significance level of 0.02 or 0.025), and the naive procedure (with significance level of 0.025 at the first run). Here means the probability of a second run. The first run sample size is n1 = 4 for all procedures, α0 = 0.1, 0.15, and 0.2, and the true d are 40%,60%,80%,100%, and 120% of the first run guess of δ (δ1). Table 2 shows the increase in power using the adaptive approach as compared to relying only on the first run (standard test) when δ1 is incorrect. With its lack of type I error control, the naive procedure had slightly higher power than the SBK procedure, but not much higher in these cases. Compared to the SIN procedure, the SBK procedure using d^ showed similar performance, but slightly larger sample sizes for the second run when the true d is much lower than our first run guess. The expected value of n2/n1 will be around 2 with α1 = 0.02. With larger α0, there is of course a higher probability of a second run as seen in Figure 2(c). The average sample size of n2 for SBK with were larger than those with but showed only a slight power improvement. Posch et al (2004) discuss conditional power using combination tests as compared to a seamless t-test. Their simulation showed considerable loss of power of combination testing at small sample size n1 = 2, but negligible for n1 = 10. In our simulation with n1 = 4, only small differences in power were observed between the biased naive test and the SBK test.
The power, the average sample size of n2 given a second run, the average total n, and the probability of a second run are also compared between the SBK and SIN procedures using for different choices of α1. Although we argued that α1 should be close to α to emphasize the high importance of reaching significance in the first run (e.g. α = 0.02), we still wanted to compare this to lower values of α1 (e.g. α1 = 0.01,0.001). In Figure 3, the first run sample size is n1 = 4 for all procedures. For fixed α1, the SBK and SIN procedures have similar power in Figure 3(a) (For the SBK procedure, p1 × p2≤p1, it only makes sense to consider cases where α1 c, because we reject the null if p1 × p2 < c. So the critical value, c, for SBK is not available for the case of small α1 (e.g., α1 = 0.001). There is no similar limitation for SIN.) Where they can be compared, the SBK procedure produces slightly larger sample sizes than SIN for the second run when the true d is much lower than our first run guess, which is similar to what we observed in Table 2. Surprisingly, for both methods, the power is slightly decreased as the α1 gets smaller. Of course, the probability of having a second run for both SBK and SIN with smaller α1 is much higher than those with α1 = 0.02 especially when the true d is close to our first run guess in Figure 3(c), and as expected the second run sample size is smaller with smaller α1 in Figure 3(d). However the overall (total) expected sample size is similar for all choices of α1 in Figure 3(b). Sometimes this total n is slightly larger and sometimes smaller with smaller α1 . Therefore, using a smaller α1 than we suggest (thus conflicting with our emphasis on achieving significance in the first run), provides no efficiency benefit in our context. We did notice that the expected value of n2/n1 is around 1 when α1 = 0.01 or 0.001, which is smaller than the ratio for the larger α1. For this reason, we also ran the simulations for SIN with equal weights . However, the results for power, , and P1→2 are similar to the SIN procedure with the weights of and . Quite similar results are found for much larger first run sample size (n1 = 48).
Fig. 3.

Simulation results using SBK and SIN for different choices of α1 = 0.02,0.01,0.001. First run guesses of d(d1 = 2.6 standard deviations), n1 = 4, α = 0.025, and α0 = 0.1. Horizontal axis represents the true value of d as a % of the first run guess of δ (δ1). and were used for SIN.(a) shows the overall power; (b) shows the average total sample size; (c) shows the probability of a second run; (d) shows the average sample size of the second run given a second run.
4 Discussion
The extra effort of applying our proposed approach to laboratory experiments presumes that type I error control is critical for reaching unambiguous conclusions about results of these experiments. In our experience, scientific reviewers require this level of control in both publications and the planning of grants. Thus this method provides a sufficiently sophisticated approach for such reviews, but is not so complex that it is beyond the intuitive understanding by scientists of the statistical issues raised by adding replications. It also responds to calls for more efficient use of animals.
Within our context, which highly values running an experiment once to avoid the cost of repeating the experimental set up, considerations for sample size estimation for a potential second run are surprisingly simple. Based on reasonable choices of parameters for experiments and our simulations, we showed that the multi-dimensional considerations that could be complex in other contexts were not so in our context. Although conditional power could be low when the initial guess of effect size is overly optimistic, our requirement that p1 be small (corresponding to lower futility values of α0) before considering a second run, appropriately limits second run sample size and reduces the differences in power among options such as conservative choices of effect size estimation, or a naive method that does not control type I error with two tests.
The choice of weights for SIN discussed in section 2.5 is based on an expectation of the second stage sample size, n2. If the actual new sample size is substantially different from expected, the weights could lower the efficiency of the test. Note that subjects in one stage will have more weight than in the other stage, which has been controversial in human trials.
In planning a potential adaptive experiment with emphasis on achieving significance in the first run, this SBK method also requires choice of the first run significance level (α1), which then determines the combined results significance criterion c. Choosing α1 close to the usual significance level rather than a much smaller value, as is typical in adaptive clinical trials, is consistent with the scientists need to avoid setting up a costly experiment a second time. In addition, this choice produces slightly more power than choosing smaller α1 with no increase in total sample size in our context. If one wishes to use a non-parametric test to obtain p-values, the sample sizes can be obtained through the inflation factor method as outlined in PASS 2008 (NCSS LLC, Kaysville, Utah, USA).
When a second run is needed to reach significance, we know that the estimate of effect size will be biased when results from the two runs are combined (Bauer and Kohne, 1994). We suggest that only the results from the first run be used for estimation, and the combination of data be reserved for hypothesis testing. Of course, the focus of laboratory experiments is on hypotheses testing rather than point estimation. In addition it is unusual to use effect estimates from laboratory experiments for planning future ones, but coefficients of variation often are used for sample size calculations, and obtaining these from the first run makes most sense. Although estimation based on first stage data is unbiased, it does exclude additional data from the second stage, and this could produce a confidence interval that is inconsistent with the hypothesis test.
Table 3.
Critical values for the SBK and SIN procedures, with α = 0.025, α1 = 0.02, 0.01, 0.001, and α0 = 0.1, 0.15, 0.2 for the two sample problem.
| α1 = 0.02 | α1 = 0.01 | α1 = 0.001 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| α0 = 0.1 | α0 = 0.15 | α0 = 0.2 | α0 = 0.1 | α0 = 0.15 | α0 = 0.2 | α0 = 0.1 | α0 = 0.15 | α0 = 0.2 | |
| cSBK | 0.0031 | 0.0025 | 0.0022 | 0.0065 | 0.0055 | 0.0050 | NA | NA | NA |
| cSIN | 0.0145 | 0.0109 | 0.0095 | 0.0392 | 0.0292 | 0.0248 | 0.0556 | 0.0416 | 0.0353 |
Acknowledgements
This work is partially supported by Award Number UL1RR025755 from the National Center for Research Resources. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Center for Research Resources or the National Institutes of Health. We appreciate the comments of Drs. Xiaobai Li and Lianbo Yu.
Appendix
Conditional Error function
Posch and Bauer (1999) and Bretz et al (2009) defined a conditional error function for adaptive combination tests as:
where the indication function 1[·] is 1 if C(p1,x) ≤ c and 0 otherwise. For example, the BK approach with Fisher's product test and the IN approach with the weighted inverse normal combination have the following conditional error functions,
and
where w1 and w2 are pre-specified weight such that and Φ denotes the cumulative distribution function of the standard normal distribution.
Conditional Power
Under normality
where and are independent, and ( and are the sample variances from the second run data for group x and y, respectively). Thus has a noncentral t-distribution with degrees freedom of 2(n2 – 1) and noncentrality parameter . We can then write
where G2(n2−1);η (·) denotes the cumulative distribution function of the noncentral t-distribution with 2(n2−1) degrees of freedom and noncentrality parameter η.
Determination of n2 for One Sample Patient Donor Cell Experiments
For our experiment on patient's cells, the null and alternative hypotheses are H0 : μ 0, H1 : μ > 0, and the per patient contrast has C ~ N(μ,s2), where μ is the population mean and s2 is the population variance. The conditional power is
where effect size δ = μ=σ, and and are independent. Therefore, has a noncentral t-distribution with degrees freedom of n2−1 and noncentrality parameter . This leads to . Therefore, the sample size for the second run, n2, can be calculated by solving the equation . For the SBK method, n2 is obtained by solving the equation .
References
- Bauer P, Kohne K. Evaluation of experiments with adaptive interim analyses. Biometrics. 1994;50:1029–1041. [PubMed] [Google Scholar]
- Bretz F, Koening F, Brannath W, Glimm E, Posch M. Tutorial in biostatistics: Adaptive designs for confirmatory clinical trials. Statistics In Medicine. 2009;28:1181–1217. doi: 10.1002/sim.3538. [DOI] [PubMed] [Google Scholar]
- Chen YHJ, DeMets DL, Lan KKG. Increasing the sample size when the un-blinded interim result is promising. Statistics in Medicine. 2004;23:1023–1038. doi: 10.1002/sim.1688. [DOI] [PubMed] [Google Scholar]
- Cumming G, Finch S. A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement. 2001;61:532–574. [Google Scholar]
- Denne JS, Jennison C. Estimating the sample size for a t-test using an internal pilot. Statistics In Medicine. 1999;18:1575–1585. doi: 10.1002/(sici)1097-0258(19990715)18:13<1575::aid-sim153>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
- Festing M. Statistics and animals in biomedical research. Significance. 2010;7:176–177. [Google Scholar]
- Gao P, Ware JH, Mehta C. Sample size re-estimation for adaptive sequential design in clinical trials. Journal of Biopharmaceutical Statistics. 2008;18:1184–1196. doi: 10.1080/10543400802369053. [DOI] [PubMed] [Google Scholar]
- Kairalla J, Muller K, Coffey C. Combining an internal pilot with an interim analysis for single degree of freedom tests. Communication in Statistics - Theory and Methods. 2010;39:3717–3738. doi: 10.1080/03610920903353709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lehmacher W, Wassmer G. Adaptive sample size calculation in group sequential trials. Biometrics. 1999;55:1286–12,901. doi: 10.1111/j.0006-341x.1999.01286.x. [DOI] [PubMed] [Google Scholar]
- Posch M, Bauer P. Adaptive two stage designs and the conditional error function. Biometrical Journal. 1999;41:689–696. [Google Scholar]
- Posch M, Bauer P. Interim analysis and sample size reassessment. Biometrics. 2000;56:1170–1176. doi: 10.1111/j.0006-341x.2000.01170.x. [DOI] [PubMed] [Google Scholar]
- Posch M, Bauer P, Brannath W. Issues in designing flexible trials. Statistics In Medicine. 2003;22:953–969. doi: 10.1002/sim.1455. [DOI] [PubMed] [Google Scholar]
- Posch M, Timmesfeld N, Konig F, Muller HH. Conditional rejection probabilities of student’s t-test and design adaptations. Biometrical Journal. 2004;46:389–403. [Google Scholar]
- Proschan MA. Two-stage sample size re-estimation based on a nuisance parameter: a review. Journal of Biopharmaceutical Statistics. 2005;15:559–574. doi: 10.1081/BIP-200062852. [DOI] [PubMed] [Google Scholar]
- Wassmer G. A comparison of two methods for adaptive interim analyses in clinical trials. Biometrics. 1998;54:696–705. [PubMed] [Google Scholar]
- Wittes J, Schabenberger O, Zucker D, Brittain E, M P. Internal pilot studies i: type i error rate of the naive t-test. Statistics In Medicine. 1999;18:3481–3491. doi: 10.1002/(sici)1097-0258(19991230)18:24<3481::aid-sim301>3.0.co;2-c. [DOI] [PubMed] [Google Scholar]
