A practical approach to sample size calculation for fixed populations

Maurits Kaptein

doi:10.1016/j.conctc.2019.100339

. 2019 Feb 26;14:100339. doi: 10.1016/j.conctc.2019.100339

A practical approach to sample size calculation for fixed populations

Maurits Kaptein ^1,^∗

PMCID: PMC6403069 PMID: 30886936

Abstract

Researchers routinely compute desired sample sizes of clinical trials to control type-i and type-ii errors. While for many experimental designs sample size calculations are well-known, it remains an active area of research. Work in this area focusses predominantly on controlling properties of the trial. In this paper we provide ready-to-use methods to compute sample sizes using an alternative objective, namely that of maximizing the outcome for a whole population. Considering the expected outcome of both the trial, and the resulting guideline, we formulate and numerically analyze the expected value of the entire allocation procedure. Our approach strongly relates to theoretical work presented in the 60's which demonstrated the effectiveness of allocation procedures that incorporate population sizes when planning experiments over designs that focus solely on error rates within the trial. We add to this work by a) extending to alternative designs (mean comparisons not assuming equal variances and comparisons of proportions), b) providing easy-to-use software to compute sample sizes for multiple experimental designs, and c) presenting numerical analysis that demonstrate the efficiency of the suggested approach.

Keywords: Sample size calculation, Clinical trial, Decision policies

1. Introduction

Investigators should properly calculate sample sizes before the start of their randomized controlled trials (RCTs) and adequately describe the details in their published report(s) [21]. The landmark article by Freiman, Chalmers, and Smith [14] was one of the first to highlighted the importance of sample size calculations: numerous previously reported RCTs were severely underpowered and hence their failure to identify the efficacy of the treatments under scrutiny could hardly be considered decisive evidence. Precise estimation and powerful testing are innately connected to the number of observations collected and hence a-priori sample size considerations should be an integral part of RCT planning.

Despite the fact that for many well known RCT designs (e.g., those testing for differences in means, differences in proportions, etc.) sample size calculations are well known, the accurate computation of sample sizes for complex designs is still an active area of research. Several authors have recently considered sample size computations for specific — more complex — experimental designs [10,17,23,29]. Furthermore, researchers have recently focussed on Bayesian methods for computing sample sizes [4], and have considered the embedding of the trial within its larger context [26]. In all of these cases, sample size calculations aim to control the type i (false-positive) and type ii (false-negative) error rates of the RCT over repeated executions of the trial given that the assumptions made regarding the population that entered the sample size calculations are accurate.

In this paper we examine an alternative objective to determining sample sizes in RCTs. We consider the RCT as merely the first stage in a two-stage treatment allocation policy that, ultimately, allocates one out of a set of competing treatments to all individuals suffering from a specific disease (the population). The RCT, combined with the resulting guidelines for clinical practice, jointly decide which patient in the population receives what treatment. Given this setup, sample size calculations can be motivated by a desire to maximize the expected overall outcome over all patients in a population. This alternative objective for sample size calculation has been studied before in the 60's—a literature we discuss in section 2.2—and its optimization leads to a demonstrably more effective allocation procedure than attained when planning trial sizes solely based on error rates. We hope to contribute by reviving this idea and bringing it to clinical practice by providing an easy-to-use software package to compute sample sizes according to this criterion for various designs, and by numerically examining the differences between the standard approach and the one advocated in this work.

In the remainder of this work we first formalize the problem at hand and motivate our focus on two-stage allocation procedures (an RCT resulting in a deterministic guideline). Next, we review prior work in this area and motivate how our work contributes. In section 3 we introduce the open-source and freely available [R] package ssev that allows researchers and practitioners to easily compute optimal sample sizes for various two-group comparisons. Next, we present a number of numerical results to further illustrate the impact of changing the sample size planning objective from the trial to population; we demonstrate that for small populations our current trials are often overly large, while for large populations they are overly small. Finally, we reflect on our presented results and discuss possible future extensions.

2. Problem formalization and relations to the RCT

The general problem we consider can be phrased in the language of potential outcomes [19,20]. Consider $i = 1, \dots, N$ patients in population P, each with potential outcome $y_{i} (k)$ for treatment $k = 1, \dots, K$ . We are interested in evaluating the performance of different treatment allocation policies π that allocate, for each patient i in the population, one of the K treatments. Specifically, we are interested in the performance of a subset of all possible treatment allocation policies that we coin two-stage allocation policies:

1. In Stage I a number of patients n (where often $n ≪ N$ ) is randomly selected from the population, and we randomly assign one of the K treatments to each of these patients. Thus, the probability that a patient selected in this stage receives treatment k is $p_{k}^{I} = \frac{1}{K}$ . Note that in the remainder of this article we will use the notation $n (k)$ and $\bar{y} (k)$ for the sample size and sample mean computed over all patients who received treatment k and we will use $y_{i} (\cdot)$ to denote the observed value for unit i irrespective of the treatment received.
2. In Stage II we use the data collected in Stage I to select one of the k treatments using some decision procedure δ, and we subsequently subscribe the selected treatment $k = k^{*}$ to the remaining $N - n$ patients in P. Thus, in stage two we have $p_{k}^{I I} = 1$ if $k = k^{*}$ and $p_{k} = 0$ otherwise. In practice this is done by including treatment $k^{*}$ into our guidelines.

We are interested in the performance of these two-stage allocation policies in terms of its expected outcome per unit when executed in a population of size N. Thus, we are interested in:

E (π_{N}) = \frac{E [\sum_{i = 1}^{N} y_{i} (\cdot)]}{N}

= \frac{\sum_{i = 1}^{N} \sum_{k = 1}^{K} p_{k}^{(i)} y_{i} (k)}{N}

= \frac{\sum_{i = 1}^{n} \sum_{k = 1}^{K} p_{k}^{I} y_{i} (k)}{N} + \frac{\sum_{i = (n + 1)}^{N} \sum_{k = 1}^{K} p_{k}^{I I} y_{i} (k)}{N}

= \frac{\sum_{i = 1}^{n} \sum_{k = 1}^{K} \frac{1}{K} y_{i} (k)}{N} + \frac{\sum_{i = (n + 1)}^{N} \sum_{k = 1}^{K} Pr (k = k^{*}) y_{i} (k)}{N}

(1)

where the expectation is over the random sampling and allocation in Stage I and possibly over a random component of the decision procedure δ in Stage II that determines the probability that a specific treatment k is selected. In the second line of Equation (1) we use $p_{k}^{(i)}$ to denote the probability that treatment k is selected for patient i, while in the third line we split up the expectation value of the experiment and the resulting guideline using $p_{k}^{I}$ and $p_{k}^{I I}$ respectively since within each stage $p_{k}$ is a constant. In the last line these probabilities are provided: $p_{k}^{I} = \frac{1}{K}$ , and $p_{k}^{I I} = Pr (k = k^{*})$ which, with slight abuse of notation, denotes the probability that a specific treatment is selected for inclusion into the guidelines $k = k^{*}$ . Note that for a given population P of size N, when considering a fixed number of treatments K, the value of $E (π_{N})$ depends on the choice of n and the specification of $P r (k = k^{*})$ , i.e., the probability the decision procedure δ selects treatment k. Hence, in this setting for a given population, $E (π_{N}) = f (n, δ)$ . Ultimately, we are interested in finding n, given the current approach to δ, such that $E (π_{N})$ is maximized.

2.1. Completing the two-stage approach using current RCT practice

The two-stage allocation policy defined above provides a simplified formalization of our current practice of testing treatments using RCTs. Stage I encompasses the RCT itself, and subsequently Stage II encompasses the decision to, based on the RCTs results, adopt one of the K treatments [16]. The formalization is simplified as we do not consider the common practice of putting prospective treatments k through several rounds of testing [22,24]. Our conceptual treatment can however easily be extended to such a situation as Eq. (1) would still hold but would need to be partitioned into more than two stages. Furthermore, our formalization is simplified in the sense that we do not consider the—relatively common—situation in which new treatments are developed over time, and thus are not available for a subset(s) of patients at some points in time (assuming the patients are treated sequentially) [18]. Finally, we assume that the population size N is known; this assumption will never be exactly met, but often reasonable estimate can be made in many cases in which for specific diseases incidence rates are known [12,13].

To closely relate our two-stage formalization to existing RCT practice, we have to specify the decision rule δ and our choice of the sample size n; indeed, in our current practice these are intimately related. Our decision rule δ is — despite much modern work advocating other approaches [22] — often based on the practice of null hypothesis significance testing: we specify a null hypothesis $H_{0}$ , and we specify acceptable levels of α and β, the probabilities of making a type i or type ii error respectively [21]. Next, we make a statement about a meaningful alternative hypothesis (e.g., the effect size of interest). Given choices for each of these we can, in many situations, compute the minimal sample size n that controls the error rates given that our assumptions regarding the hypotheses involved are correct. Next, after conducting the trial of size n it is standard practice to compute a p-value and if $p < α$ we reject the null hypothesis and accept the alternative. In practice rejecting the null hypothesis often leads researchers to select the treatment with the highest mean outcome during the trial (thus $k^{*} = \underset{k}{arg max} \bar{y} (k)$ ) while not rejecting the null often leads researchers to select the current status-quo.1 Depending on the study design and the choice of α the probability of rejecting $H_{0}$ and the probability of selecting treatments k if $H_{a}$ is accepted are readily provided by standard power calculations. Jointly this completes the specification of the decision procedure δ and hence the specification of $p_{k}^{I}$ and $p_{k}^{I I}$ necessary to evaluate Eq. (1).

From the analysis above it is clear that in our current practice $E (π_{N})$ is defined by our choice of α, β, and our assumptions regarding $H_{0}$ and $H_{a}$ (or the effect size): these jointly define δ and n. However, note that this is not a necessity; even if we stick close to current practice by performing a null-hypothesis significance test we could relax our focus on controlling error rates and rather focus on maximizing $E (π_{N})$ . A simple method to generate alternative two-stage treatment allocation policies that is very close to current practice would be to keep our standard level of α, keep our standard decision procedure, but determine n such that $E (π_{N})$ . This can be done by adding to the current assumptions (e.g., $H_{0}$ and some estimate of the effect size) an informed estimate of N, the population size. After choosing N, we can, for many different designs, evaluate Eq. (1) and select n such that $E (π)$ is maximized. When doing so the power, $1 - β$ , will follow from the procedure. This is the approach implemented in the package ssev we present below.

2.2. Prior work and a motivation for two-stage approaches

Surely, others must have considered treatment allocation policies that maximize the expected outcome of the full allocation procedure as opposed to controlling type I and type II errors within the trial? There is actually a very large literature that considers the analysis of different treatment allocation procedures and indeed focusses on the overall outcome of the procedure (often called reward in this literature). This literature on the multi-armed-bandit (MAB) problem—which formalizes the decision problem we described above as a problem in which, sequentially, a gambler selects different arms of a slot-machine, each with a potentially different pay-off, such that she maximizes her rewards—is too large to properly review; we refer the interested reader to Robins [18] or Gittins, Glazebrook and Weber [15].

In the decades that the MAB problem has been studied, we have been able to bound the expected rewards of distinct policies [5], and we have developed allocation policies that are asymptotically optimal [2,27]. We have also connected this mostly theoretical literature directly to our practice in clinical trials [3]. However, the literature on the MAB problem has primarily focussed on allocation policies other than the two-stage policies since any two-stage procedure is provably suboptimal [5]: optimal solutions to the MAB problem effectively balance exploration (learning the effects of each treatment) and exploitation (selecting the best treatment). Optimal allocation policies smoothly balance these two objectives by—effectively—decreasing $p_{k}^{(i)}$ smoothly from $\frac{1}{K}$ to 0 for all $k \neq k^{*}$ as i increases. The exact rate of the decrease depends on the observed data and the structure of the problem, but any optimal policy will have a smooth decrease as opposed to the step-wise decrease we see in two-stages policies. Effectively, two-stage policies first explore (when $i \leq n$ ) and subsequently move to exploitation (when $i > n$ ). This sudden change from exploitation to exploration does not yield an optimal reward, and hence two-stage policies (coined ε-first in the MAB literature [25]), are not considered particularly interesting.

However, despite the fact that they are not (asymptotically) optimal, two-stage treatment allocation policies have a practical benefits over alternative allocation policies that constantly change $p_{k}^{(i)}$ . The two-stage policy is clearly separated into a trial in which all possible treatments are considered, and the subsequent guideline stage in which only one specific treatment needs to be considered. This makes that after the trial we can inform medical professionals of the results of the trial and they do not need to consider alternatives. We can inform patients of the “best” treatment without needing to resort to complex explanations to justify changing probabilities for each patient. And, finally, we can distribute a single treatment (e.g., a medication) to all treatment locations, as opposed to distributing all possible treatments for the (often unlikely) event that a treatments is selected by the policy. These practical benefits of two-stage policies over smooth allocation policies have resulted in a slow uptake of smooth policies in practice [16]. Therefore, we focus specifically on two-stage allocation policies and study alternative methods of determining n; the main parameter that drives the step from exploration to exploitation.

Notably, even when focussing solely on two-stage decision procedures that are close to current practice, this work is not the first in its kind: in the 60's a body of theoretical work emerged studying the required sample size when aiming to maximize the expected outcome when choosing between treatments. Initially work focussed on choosing between two treatments from normal populations with variances known [7]. The work was quickly extended to allowing for multiple stages [8], or multiple treatments [11]. Researchers also examined fully sequential allocation [1,9]; an approach closer to the MAB literature. The analysis was further extended to alternative decision rules such as play the winner [28] and to dichotomous outcomes [6]. These all works convincingly demonstrate the effectiveness gains of including the population size in computations of the sample size, a message we also demonstrate in this work. We deviate from this prior work by focussing more strongly on current RCT practice (i.e., by including a null-hypothesis significance test within the decision procedure a case not included in these prior analyses2) and by providing easy to use software to compute sample sizes for comparisons of two treatments.

3. An easy to use [R] package for sample size computation

Instead of focussing on an analytical treatment of different two-stage decision procedures as has been done in prior work [7,11], we focus on creating easy-to-use software to compute sample sizes for practical RCT designs while staying close to the current null-hypothesis testing practice. Here we present the ssev [R] package that allows researchers to include population sizes in their RCT planning when setting up comparisons between two groups (i.e., $K = 2$ ) when comparing means (using t-tests with equal variances assumed or not assumed) or proportions.

The ssev package is available on CRAN, and is easily installed using the following [R] commands:

install.packages("ssev")

library (ssev)

After installing the package the compute_sample_size function is available to compute sample sizes that maximize the expected outcome of the two-stage approach described below for various cases. For example, a call to

compute_sample_size(means = c(0,.5), sds = 1, N = 500000)computes the sample size when comparing two means which are expected to differ by $\frac{1}{2}$ , assuming equal variances, $σ^{2} = 1$ (i.e., Cohen's $d = \frac{1}{2}$ ) and a population size of $N = 500000$ . The call provides the output presented in Fig. 1 which shows that using conventional power calculations (with default choices $α = . 05$ and $1 - β = . 8$ ) the traditional RCT would require a sample size of 64 per group, while in this case a sample size that maximizes the expected outcome $E (π_{N})$ of a two-stage procedure would require a sample size of 261 per group. When choosing this larger sample size, the expected mean reward of the two-stage procedure over the full population would increase by more than $10 %$ . Table 1 details the arguments to the compute_sample_size function.

Fig. 1 — Example output of the ssev package.

Table 1.

Arguments for the ssev package to compute sample sizes.

means	A vector of length 2 containing the (assumed) means of the two groups in the case of continuous outcomes.
sds	A vector containing the (assumed) standard deviations of the two groups. When only one element is supplied equal variances are assumed.
proportions	A vector of length 2 containing the (assumed) proportions of the two groups in the case of dichotomous outcomes.
N	Estimated population size.
power	Desired power for the classical RCT (i.e. $1 - β$ ).
sig.level	Significance level of the test used (i.e., α).
ties	Probability of choosing the first group in case of a tie (i.e., in case $H_{0}$ is not rejected).
.verbose	Whether or not verbose output should be provided, default FALSE.
…	further arguments passed on to or from other methods.

Open in a new tab

The ssev package computes the desired optimal sample sizes using numerical optimization routines in combination with standard power calculations provided in earlier [R] packages (e.g., the MESS and pwr packages). The implementation is relatively straightforward: for each design a simple utility function to compute the expected value of the complete two stage procedure as a function of the sample size n is created which implements Equation (1). Computing the expected value of the RCT is straightforward for all designs included in the package (mean comparisons assuming equal or unequal variances, proportion comparisons), but the probabilities of rejecting $H_{0}$ , and subsequently the probability of selecting one of the $K = 2$ arms given that $H_{0}$ is rejected, differ; these are however readily provided using standard power calculation packages. Numerical optimization is then used to evaluate the expected value function for the desired design for values $2 \leq n \leq N$ and select the value of n that maximizes the expected outcome.

4. Numerical analysis when comparing 2 groups

To gain additional understanding of the effectiveness and efficiency of our proposed method we present a number of numerical evaluations. First, we examine the differences in effectiveness–in terms of expected outcomes—and sample size between the common RCT procedure and our proposed approach. Next, we examine how under- and over-estimates of the population size N affect the computed sample size n.

4.1. Efficiency over current RCT practice

Table 2 presents the difference in expected outcomes—in terms of relative gains—between the common RCT and the method outlined in this paper. We examine three differences in means $d \in {. 2, . 5, . 8}$ assuming either equal variances $σ_{1}^{2} = σ_{2}^{2} = 1$ or unequal variances $σ_{1}^{2} = σ_{2}^{2} = 9$ and three differences in proportions $p \in {. 1, . 2, . 3}$ for different population sizes $N \in {10^{2}, 10^{3}, \dots, 10^{8}}$ . It is clear from the table that in all cases, the optimal sample size leads to a higher expected outcome, $E (π_{N})$ , than current RCT practice with relative differences often exceeding $10 %$ .

Table 2.

Gain of the optimal procedure over common RCT practice in relative percentages.

	Design	d	$10^{2}$	$10^{3}$	$10^{4}$	$10^{5}$	$10^{6}$	$10^{7}$	$10^{8}$
1	Eq. Var.	0.2	5.072	11.969	5.189	10.066	10.935	11.057	11.073
2		0.5	20.578	3.163	9.539	10.811	10.994	11.018	11.021
3		0.8	2.050	6.455	9.977	10.560	10.640	10.650	10.651
4	Uneq. Var.	0.2	1.975	8.444	0.259	7.494	10.545	11.033	11.099
5		0.5	5.750	5.319	6.014	10.237	10.961	11.061	11.074
6		0.8	11.544	0.440	8.441	10.583	10.910	10.953	10.959
7	Prop.	0.1	0.439	1.704	0.359	0.909	1.018	1.034	1.036
8		0.2	3.638	0.064	1.350	1.719	1.776	1.784	1.785
9		0.3	4.363	0.744	1.939	2.178	2.212	2.217	2.217

Open in a new tab

Table 3 provides further details: the table shows the differences in the size of a single group (i.e., $n / 2$ ) between the common RCT and the optimal scheme suggested in this paper. It is clear that for small population sizes RCTs often require too large sample sizes (borrowing a term from the MAB literature, in these cases the RCT over-explores), while for large populations the sample sizes selected using common power calculations are too low (in these cases these studies over-exploit and hence too often choose the wrong treatment to end up in the subsequent guideline).

Table 3.

Difference in sample size between the choice that maximizes the expected outcome and the traditional RCT. Reported is $n_{r c t} - n_{o p t i m a l}$ ; thus, positive entries indicate that the RCT would select a larger sample than the optimal procedure. Clearly, for large populations (e.g., $N > 10^{5}$ ) our current RCTs are often too small.

	Design	d	$10^{2}$	$10^{3}$	$10^{4}$	$10^{5}$	$10^{6}$	$10^{7}$	$10^{8}$
1	Eq. Var.	0.2	29	178	−303	−708	−1064	−1400	−1724
2		0.5	26	−34	−101	−159	−213	−266	−316
3		0.8	6	−25	−49	−71	−92	−112	−131
4	Uneq. Var.	0.2	31	285	193	−2169	−4093	−5836	−7493
5		0.5	28	109	−277	−595	−878	−1146	−1404
6		0.8	26	−20	−161	−278	−386	−489	−588
7	Prop.	0.1	11	200	−207	−570	−897	−1209	−1511
8		0.2	29	−13	−105	−186	−262	−335	−406
9		0.3	20	−20	−56	−88	−118	−148	−177

Open in a new tab

4.2. Robustness to population size estimation

As a final comparison to gain additional insight into the proposed procedure Table 4 provides the difference in the number of subjects in each group for a trial comparing two means with equal variances ( $σ^{2} = 1$ ) and different effect-sizes $d \in {. 2, . 5, . 8}$ when the size of the population N is over-estimated or under-estimated by $10 %$ . Thus, the first entry of 1 in Table 4 indicates that when the population size of $10^{2}$ is under-estimated by $10 %$ (i.e., it is estimated at 90), versus when it is over estimated by $10 %$ (i.e., at 110) the optimal sample size differs by only one unit per group in this case. Clearly, as sample sizes increase, the effect of a (proportional) error in estimating the sample size increase and the estimated group size is more variable. In the RCT case, in which the difference between the two over- and under-estimation does not depend on the population size N, the results are 160, 26, and 10 respectively. This indicates that for small population sizes the proposed optimal procedure is less sensitive to erroneous estimates of the population size than the RCT is. For larger sample sizes the optimal procedure becomes more variable to errors in estimating the sample size: this is however easily explained as for large populations the potential benefits of additional experimentation (e.g., a larger n) steadily increase.

Table 4.

Comparison of optimal sample sizes in terms of number of subjects per group for varying population sizes.

	Design	d	$10^{2}$	$10^{3}$	$10^{4}$	$10^{5}$	$10^{6}$	$10^{7}$	$10^{8}$
4	Optimal	.2	1	15	202	382	532	672	806
5		.5	1	25	56	80	103	125	145
6		.8	3	15	26	35	44	53	60

Open in a new tab

5. Conclusions and discussion

In this paper we discussed an alternative approach to computing sample sizes in randomized clinical trials and we have provided easy-to-use software package to carry out the procedure. The approach we suggest here considers the trial as merely the first stage of the larger process of allocating treatments to patients which can be split up into two distinct stages: first we learn about the effectiveness of treatments during the trial, and subsequently we select and administer the treatment that was most successful in the trial to the remaining patients by including it in our clinical guidelines. We have motivated that the expected outcome of these two-stage allocation policies depends on the choice of sample size n, and the decision procedure δ that is used when moving from stage Stage I to Stage II. In the current planning of RCTs we often focus on properties of the first stage (in terms of type i and type ii errors), and because of this n is fixed for a given decision procedure δ. We suggest relaxing our fixation on the properties of the trial, and subsequently changing the decision procedure δ, such that we can freely choose a sample size n that maximizes the expected outcome over the full two stage procedure. Admittedly, doing so introduces a need for informed estimates of the population size N when planning a trial. This seems cumbersome as it is something we are not generally used to. However, we would be tempted to argue that for many diseases incidence and prevalence rates—which would allow us to make informed estimates of N—are available.

A lot of prior work has considered alternatives allocation schemes compared to the traditional RCT; we have provided pointers to both the MAB literature—in which fully adaptive allocation schemes are discussed—as well as to earlier results demonstrating the effectiveness of the two stage approach we propose here [7]. We are well aware that the two-stage approach we examine in this work does not actually maximize the expected outcome of the sequential allocation of treatments over all units: more flexible allocation policies that constantly change $p_{k}^{(i)}$ can achieve a higher outcome. However, we believe that two-stage approaches have sufficient practical benefits to, in some cases, be preferred over more flexible sequential allocation procedures [2]. Hence, optimizing two-stage allocation policies provides a useful addition to the current literature. Our contribution is primarily of an applied nature; we build on earlier ideas to provide an easy-to-use software package that allows for the computation of optimal sample sizes for a number of common RCT designs.

The current paper also numerically examined the differences between current RCT practice and our suggested approach. Qualitatively, the main results are intuitive: For small populations we need smaller samples, while for larger populations we need larger samples, to maximize our expected outcome. Furthermore, a willingness to make assumptions regarding N improves our robustness to choices of the clinically meaningful effect-size of the treatment d when N is small. However, we have left a number of avenues unexplored: first of all we restricted ourselves to merely varying β; as also α is inherently arbitrary we might wish to also vary α when computing n in a two-stage allocation policy. Also, despite setting up the problem for arbitrary choice of K, the package ssev currently handles only a choice of $Pr (k = k^{*}) = c = \frac{1}{K}$ ; we feel this is a meaningful contribution but future work should extend the implemented methods to including more complex designs. Finally, in our treatment of the problem we currently only focus on the direct outcomes and we do not include possible differences in costs between the two stages (the trial might be more expensive to carry out than the guidelines), or plausible variable costs during the second stage: these are welcome extensions to explore in future work. However, for now we hope the current work at the very least inspires those planning out RCTs to consider alternatives to standard power calculations advocated in many introductory text books; easily available alternatives that are close to current practice might provide an accessible step in the direction of more flexible trial planning and sample size computation.

Footnotes

In our numerical analysis below we assume $Pr (k = k^{*}) = c = \frac{1}{K}$ in such cases. This default choice is motivated by the idea that prior to the study, all k arms are equally likely to be superior and hence a random choice after a failed trial seems reasonable. However, in many situations this choice might not be reasonable; e.g., it is unlikely that a placebo is adapted after a failed trial. In such cases one might want to change the ties parameter in the ssev package (see Section 3).

Prior work mostly uses $k^{*} = \underset{k}{arg max} \bar{y} (k)$ ; we stay closer to current RCT practice by chosing $k^{*} = \underset{k}{arg max} \bar{y} (k)$ only when $H_{0}$ is rejected.

References

1.Anscombe F. Sequential medical trials. J. Am. Stat. Assoc. 1963;58(302):365–383. [Google Scholar]
2.Auer P., Cesa-Bianchi N., Fischer P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002;47(2–3):235–256. [Google Scholar]
3.Bartroff J., Lai T.L., Shih M.-C. vol. 298. Springer Science & Business Media; 2012. (Sequential Experimentation in Clinical Trials: Design and Analysis). [Google Scholar]
4.Brakenhoff T., Roes K., Nikolakopoulos S. Bayesian sample size re-estimation using power priors. Stat. Methods Med. Res. 2018 doi: 10.1177/0962280218772315. 0962280218772315. [DOI] [PubMed] [Google Scholar]
5.Bubeck S., Cesa-Bianchi N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends® Mach. Learn. 2012;5(1):1–122. [Google Scholar]
6.Canner P.L. Selecting one of two treatments when the responses are dichotomous. J. Am. Stat. Assoc. 1970;65(329):293–306. [Google Scholar]
7.Colton T. A model for selecting one of two medical treatments. J. Am. Stat. Assoc. 1963;58(302):388–400. [Google Scholar]
8.Colton T. A two-stage model for selecting one of two treatments. Biometrics. 1965;21(1):169–180. [Google Scholar]
9.Cornfield J., Halperin M., Greenhouse S.W. An adaptive procedure for sequential clinical trials. J. Am. Stat. Assoc. 1969;64(327):759–770. [Google Scholar]
10.Cunningham T.D., Johnson R.E. Design effects for sample size computation in three-level designs. Stat. Methods Med. Res. 2016;25(2):505–519. doi: 10.1177/0962280212460443. [DOI] [PubMed] [Google Scholar]
11.Dunnett C.W. On selecting the largest of k normal population means. J. Roy. Stat. Soc. B. 1960:1–40. [Google Scholar]
12.Dye C., Scheele S., Dolin P., Pathania V., Raviglione M.C. Global burden of tuberculosis: estimated incidence, prevalence, and mortality by country. JAMA. 1999;282(7):677–686. doi: 10.1001/jama.282.7.677. [DOI] [PubMed] [Google Scholar]
13.Feigin V.L., Lawes C.M., Bennett D.A., Anderson C.S. Stroke epidemiology: a review of population-based studies of incidence, prevalence, and case-fatality in the late 20th century. Lancet Neurol. 2003;2(1):43–53. doi: 10.1016/s1474-4422(03)00266-7. [DOI] [PubMed] [Google Scholar]
14.Freiman J.A., Chalmers T.C., Smith H., Jr., Kuebler R.R. The importance of beta, the type ii error and sample size in the design and interpretation of the randomized control trial: survey of 71 negative trials. N. Engl. J. Med. 1978;299(13):690–694. doi: 10.1056/NEJM197809282991304. [DOI] [PubMed] [Google Scholar]
15.Gittins J., Glazebrook K., Weber R. John Wiley & Sons; 2011. Multi-armed Bandit Allocation Indices. [Google Scholar]
16.Kaptein M.C. 2018. Computational Personalization: Data Science Methods for Personalized Health. [Google Scholar]
17.Qiu S.-F., Poon W.-Y., Tang M.-L. Sample size determination for disease prevalence studies with partially validated data. Stat. Methods Med. Res. 2016;25(1):37–63. doi: 10.1177/0962280212439576. [DOI] [PubMed] [Google Scholar]
18.Robbins H. Herbert Robbins Selected Papers. Springer; 1985. Some aspects of the sequential design of experiments; pp. 169–177. [Google Scholar]
19.Rubin D.B. Direct and indirect causal effects via potential outcomes. Scand. J. Stat. 2004;31(2):161–170. [Google Scholar]
20.Rubin D.B. Causal inference using potential outcomes: design, modeling, decisions. J. Am. Stat. Assoc. 2005;100(469):322–331. [Google Scholar]
21.Schulz K.F., Grimes D.A. Sample size calculations in randomised trials: mandatory and mystical. Lancet. 2005;365(9467):1348–1353. doi: 10.1016/S0140-6736(05)61034-3. [DOI] [PubMed] [Google Scholar]
22.Sedgwick P. Phases of clinical trials. BMJ Br. Med. J. (Clin. Res. Ed.) 2011:343. [Google Scholar]
23.Shan G. Sample size calculation for agreement between two raters with binary endpoints using exact tests. Stat. Methods Med. Res. 2018;27(7):2132–2141. doi: 10.1177/0962280216676854. [DOI] [PubMed] [Google Scholar]
24.Spiegelhalter D.J., Abrams K.R., Myles J.P. vol. 13. John Wiley & Sons; 2004. (Bayesian Approaches to Clinical Trials and Health-Care Evaluation). [Google Scholar]
25.Tran-Thanh L., Chapman A., Cote E. M. d., Rogers A., Jennings N.R. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence. AAAI Press; 2010. ε-first policies for budget-limited multi-armed bandits; pp. 1211–1216. [Google Scholar]
26.Whitehead A.L., Julious S.A., Cooper C.L., Campbell M.J. Estimating the sample size for a pilot randomised trial to minimise the overall trial sample size for the external pilot and main trial for a continuous outcome variable. Stat. Methods Med. Res. 2016;25(3):1057–1073. doi: 10.1177/0962280215588241. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Whittle P. Multi-armed bandits and the gittins index. J. Roy. Stat. Soc. B. 1980:143–149. [Google Scholar]
28.Zelen M. Play the winner rule and the controlled clinical trial. J. Am. Stat. Assoc. 1969;64(325):131–146. [Google Scholar]
29.Zhu H., Zhang S., Ahn C. Sample size considerations for split-mouth design. Stat. Methods Med. Res. 2017;26(6):2543–2551. doi: 10.1177/0962280215601137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib1] 1.Anscombe F. Sequential medical trials. J. Am. Stat. Assoc. 1963;58(302):365–383. [Google Scholar]

[bib2] 2.Auer P., Cesa-Bianchi N., Fischer P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002;47(2–3):235–256. [Google Scholar]

[bib3] 3.Bartroff J., Lai T.L., Shih M.-C. vol. 298. Springer Science & Business Media; 2012. (Sequential Experimentation in Clinical Trials: Design and Analysis). [Google Scholar]

[bib4] 4.Brakenhoff T., Roes K., Nikolakopoulos S. Bayesian sample size re-estimation using power priors. Stat. Methods Med. Res. 2018 doi: 10.1177/0962280218772315. 0962280218772315. [DOI] [PubMed] [Google Scholar]

[bib5] 5.Bubeck S., Cesa-Bianchi N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends® Mach. Learn. 2012;5(1):1–122. [Google Scholar]

[bib6] 6.Canner P.L. Selecting one of two treatments when the responses are dichotomous. J. Am. Stat. Assoc. 1970;65(329):293–306. [Google Scholar]

[bib7] 7.Colton T. A model for selecting one of two medical treatments. J. Am. Stat. Assoc. 1963;58(302):388–400. [Google Scholar]

[bib8] 8.Colton T. A two-stage model for selecting one of two treatments. Biometrics. 1965;21(1):169–180. [Google Scholar]

[bib9] 9.Cornfield J., Halperin M., Greenhouse S.W. An adaptive procedure for sequential clinical trials. J. Am. Stat. Assoc. 1969;64(327):759–770. [Google Scholar]

[bib10] 10.Cunningham T.D., Johnson R.E. Design effects for sample size computation in three-level designs. Stat. Methods Med. Res. 2016;25(2):505–519. doi: 10.1177/0962280212460443. [DOI] [PubMed] [Google Scholar]

[bib11] 11.Dunnett C.W. On selecting the largest of k normal population means. J. Roy. Stat. Soc. B. 1960:1–40. [Google Scholar]

[bib12] 12.Dye C., Scheele S., Dolin P., Pathania V., Raviglione M.C. Global burden of tuberculosis: estimated incidence, prevalence, and mortality by country. JAMA. 1999;282(7):677–686. doi: 10.1001/jama.282.7.677. [DOI] [PubMed] [Google Scholar]

[bib13] 13.Feigin V.L., Lawes C.M., Bennett D.A., Anderson C.S. Stroke epidemiology: a review of population-based studies of incidence, prevalence, and case-fatality in the late 20th century. Lancet Neurol. 2003;2(1):43–53. doi: 10.1016/s1474-4422(03)00266-7. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Freiman J.A., Chalmers T.C., Smith H., Jr., Kuebler R.R. The importance of beta, the type ii error and sample size in the design and interpretation of the randomized control trial: survey of 71 negative trials. N. Engl. J. Med. 1978;299(13):690–694. doi: 10.1056/NEJM197809282991304. [DOI] [PubMed] [Google Scholar]

[bib15] 15.Gittins J., Glazebrook K., Weber R. John Wiley & Sons; 2011. Multi-armed Bandit Allocation Indices. [Google Scholar]

[bib16] 16.Kaptein M.C. 2018. Computational Personalization: Data Science Methods for Personalized Health. [Google Scholar]

[bib17] 17.Qiu S.-F., Poon W.-Y., Tang M.-L. Sample size determination for disease prevalence studies with partially validated data. Stat. Methods Med. Res. 2016;25(1):37–63. doi: 10.1177/0962280212439576. [DOI] [PubMed] [Google Scholar]

[bib18] 18.Robbins H. Herbert Robbins Selected Papers. Springer; 1985. Some aspects of the sequential design of experiments; pp. 169–177. [Google Scholar]

[bib19] 19.Rubin D.B. Direct and indirect causal effects via potential outcomes. Scand. J. Stat. 2004;31(2):161–170. [Google Scholar]

[bib20] 20.Rubin D.B. Causal inference using potential outcomes: design, modeling, decisions. J. Am. Stat. Assoc. 2005;100(469):322–331. [Google Scholar]

[bib21] 21.Schulz K.F., Grimes D.A. Sample size calculations in randomised trials: mandatory and mystical. Lancet. 2005;365(9467):1348–1353. doi: 10.1016/S0140-6736(05)61034-3. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Sedgwick P. Phases of clinical trials. BMJ Br. Med. J. (Clin. Res. Ed.) 2011:343. [Google Scholar]

[bib23] 23.Shan G. Sample size calculation for agreement between two raters with binary endpoints using exact tests. Stat. Methods Med. Res. 2018;27(7):2132–2141. doi: 10.1177/0962280216676854. [DOI] [PubMed] [Google Scholar]

[bib24] 24.Spiegelhalter D.J., Abrams K.R., Myles J.P. vol. 13. John Wiley & Sons; 2004. (Bayesian Approaches to Clinical Trials and Health-Care Evaluation). [Google Scholar]

[bib25] 25.Tran-Thanh L., Chapman A., Cote E. M. d., Rogers A., Jennings N.R. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence. AAAI Press; 2010. ε-first policies for budget-limited multi-armed bandits; pp. 1211–1216. [Google Scholar]

[bib26] 26.Whitehead A.L., Julious S.A., Cooper C.L., Campbell M.J. Estimating the sample size for a pilot randomised trial to minimise the overall trial sample size for the external pilot and main trial for a continuous outcome variable. Stat. Methods Med. Res. 2016;25(3):1057–1073. doi: 10.1177/0962280215588241. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Whittle P. Multi-armed bandits and the gittins index. J. Roy. Stat. Soc. B. 1980:143–149. [Google Scholar]

[bib28] 28.Zelen M. Play the winner rule and the controlled clinical trial. J. Am. Stat. Assoc. 1969;64(325):131–146. [Google Scholar]

[bib29] 29.Zhu H., Zhang S., Ahn C. Sample size considerations for split-mouth design. Stat. Methods Med. Res. 2017;26(6):2543–2551. doi: 10.1177/0962280215601137. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A practical approach to sample size calculation for fixed populations

Maurits Kaptein

Abstract

1. Introduction

2. Problem formalization and relations to the RCT

2.1. Completing the two-stage approach using current RCT practice

2.2. Prior work and a motivation for two-stage approaches

3. An easy to use [R] package for sample size computation

Fig. 1.

Table 1.

4. Numerical analysis when comparing 2 groups

4.1. Efficiency over current RCT practice

Table 2.

Table 3.

4.2. Robustness to population size estimation

Table 4.

5. Conclusions and discussion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A practical approach to sample size calculation for fixed populations

Maurits Kaptein

Abstract

1. Introduction

2. Problem formalization and relations to the RCT

2.1. Completing the two-stage approach using current RCT practice

2.2. Prior work and a motivation for two-stage approaches

3. An easy to use [R] package for sample size computation

Fig. 1.

Table 1.

4. Numerical analysis when comparing 2 groups

4.1. Efficiency over current RCT practice

Table 2.

Table 3.

4.2. Robustness to population size estimation

Table 4.

5. Conclusions and discussion

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases