Summary
This paper develops Bayesian sample size formulae for experiments comparing two groups, where relevant pre-experimental information from multiple sources can be incorporated in a robust prior to support both the design and analysis. We use commensurate predictive priors for borrowing of information, and further place Gamma mixture priors on the precisions to account for preliminary belief about the pairwise (in)commensurability between parameters that underpin the historical and new experiments. Averaged over the probability space of the new experimental data, appropriate sample sizes are found according to criteria that control certain aspects of the posterior distribution, such as the coverage probability or length of a defined density region. Our Bayesian methodology can be applied to circumstances that compare two normal means, proportions or event times. When nuisance parameters (such as variance) in the new experiment are unknown, a prior distribution can further be specified based on pre-experimental data. Exact solutions are available based on most of the criteria considered for Bayesian sample size determination, while a search procedure is described in cases for which there are no closed-form expressions. We illustrate the application of our sample size formulae in the design of clinical trials, where pre-trial information is available to be leveraged. Hypothetical data examples, motivated by a rare-disease trial with elicited expert prior opinion, and a comprehensive performance evaluation of the proposed methodology are presented.
Keywords: Bayesian experimental designs, Historical data, Rare-disease trials, Robustness, Sample size
1. Introduction
Conventionally, sample size has often been determined to control certain aspects of the sampling distribution of a test statistic (Desu and Raghavarao, 1990). This is typically considered from a frequentist perspective that operating characteristics, e.g., type I error rate and power, should be maintained for detecting a meaningful magnitude of the difference. For data that are assumed to be i.i.d. normal, sample size may be a function also of nuisance parameters such as unknown variances. Fixing such parameters to certain values may leave the determination inaccurate, or only locally optimal, as an arbitrary guess could deviate far from the true value. The Bayesian framework has been argued to be more advantageous to sample size determination (SSD), since it allows uncertainty to be described in a prior for the parameters (O’Hagan and Forster, 2004). Moreover, it brings about the possibility of incorporating pre-experimental information, if available, in a prior for the parameter of interest and/or nuisance parameters. Considerable attention has thus been given to Bayesian SSD; see, for example, Clarke and Yuan (2006).
Two main kinds of methodology written in the literature are ‘hybrid classical and Bayesian’ and ‘proper Bayesian’ SSD (Spiegelhalter et al., 2004). With the former, sample size would be chosen to ensure that the predictive power, obtained by averaging the frequentist power function over the prior distribution for the unknown parameter(s), reaches a desired target level. By contrast, ‘proper Bayesian’ SSD approaches refer to those for which the final analysis of data would also be Bayesian. Joseph et al. (1995) derive formulae for binomial experiments comparing two proportions; specifically, sample sizes are sought to ensure, for example, an adequate coverage probability or width of a defined interval of the posterior ‘success’ rate. Joseph and Bélisle (1997) concentrate on normal distributions and use normal-gamma conjugate priors, for experiments that estimate either single normal means or the difference between two normal means. In the context of clinical trials, Whitehead et al. (2008) develop Bayesian methods resembling frequentist formulations of the SSD problem in exploratory trials, to demonstrate the treatment effect based on posterior interval probabilities. These fully Bayesian approaches shed light on the option of incorporating pre-experimental data into a prior for both the design and analysis consistently.
This paper is focused on fully Bayesian SSD permitting the use of pre-experimental data from multiple sources. Our research is partly motivated by the efficient design and analysis of clinical trials that evaluate a new treatment for rare diseases (EMA, 2006), where asking for a sample size to achieve the frequentist power is often infeasible. Pre-trial information, collected from historical studies which had been conducted under similar circumstances, or elicited from expert opinion, could play an essential role. The proposed methodology would nonetheless be generic: it can be applied to areas where there is a need to use pre-experimental data formally through the mechanism of specifying priors. For instance, the sample size for environmental water quality evaluation could be limited: borrowing strength from historical water monitoring data has been considered helpful (Duan et al., 2006).
2. Methods
2.1. Borrowing of historical information from multiple sources
Suppose there are K relevant sets of data, y1,…, yK, to specify a prior for the parameter, denoted by μΔ, underpinning a new experiment. Let θ1,…, θK denote the counterparts of μΔ, specific to each historical experiment k = 1,…, K. Following Zheng and Wason (2022), we specify K commensurate predictive distributions by the source of information, which are formulated as conditional normal distributions with an unknown mean θk and precision νk (the variance would thus be ):
| (1) |
where each is regarded as equivalent to μΔ in terms of the parameter space. More precisely, it means that the parameter space for , as projected from a pre-experimental parameter θk, would be defined with the same or comparable set of parameter values to that of μΔ. The precision νk is sometimes referred to as a commensurate parameter. Different from the original proposal with a spike-and-slab prior on each νk, we consider a mixture of conjugate priors for analytical derivations; that is, for the predictive precision:
| (2) |
where wk is the prior mixture weight, on the scale of [0, 1], to represent preliminary scepticism about how commensurate θk and μΔ would be. The hyperparameters are chosen so that the first mixture component with a01, b01 has the density concentrated on small values of νk, while the second mixture component with a02, b02 has density covering larger values of νk. A large prior mixture weight allocated to either component distribution would thus result in sufficient down-weighting (with no borrowing at all as one extreme) or strong borrowing of historical information (with fully pooling as the other extreme), respectively. Stipulating 0 < wk < 1 in (2) produces a compromise between the two extreme cases. The strength of this Gamma mixture prior is then tuned by wk, which can be interpreted as the prior probability of incommensurability.
Then, has a Normal-Gamma mixture distribution. By integrating out the nuisance parameter νk, we further obtain
| (3) |
which is a two-component mixture of non-standardised (shifted and scaled) t distributions. In particular, the component t distributions have their location parameters identically as θk yet scale parameters as and , respectively. Detailed derivation of (3) and the demonstration of it being a non-standardised t mixture distribution are given in Section A of the Supporting Information. For easing the synthesis of K predictive priors later on, we approximate this unimodal t mixture distribution by a normal distribution that
| (4) |
This approximation is based on the first two moments of the non-standardised t mixture distribution, which are analytically available; see Section B of the Supporting Information for details. The variance of the normal approximation, takes account of the dispersion of both t mixture components. The goodness of such normal approximation to the original t mixture distribution depends on the degrees of freedom, 2α01 and 2α02, and the scale parameters, and which are of the investigators’ choice. We show the numerical accuracy of this approximation in Section C of the Supporting Information.
With the normal approximation given by (4), we stipulate μΔ as a linear combination of K ≥ 2 hypothetical random variables, , projected from the pre-experimental parameters. That is, , for k = 1,…,K. These synthesis weights p1,…,pK sum to 1, with each reflecting the relative importance of a corresponding pre-experimental dataset to constitute the collective predictive prior for μΔ. Pragmatically, one may associate these synthesis weights with the prior probabilities of commensurability, i.e., 1 − wk, so that a pre-experimental dataset thought of as more commensurate would be assigned a larger pk to derive the collective prior for μΔ. Applying the convolution operator for the sum of normal random variables, μΔ has a normal prior distribution. Suppose each pre-experimental dataset leads to an estimate of . We thus obtain a normal collective prior that
| (5) |
with
being the marginal prior means and variances. It accounts for both the variability in a pre-experimental dataset yk and the postulated level of incommensurability, wk, through the Gamma mixture prior placed on the predictive precision, νk. We give more details in Section D of the Supporting Information for this derivation. Using Bayes’ Theorem, this collective prior will be updated by the new experimental data, denoted by yK+1, to a robust posterior.
2.2. Criteria for the Bayesian sample size determination
Most Bayesian SSD criteria aim to control certain property of the posterior, denoted by fp(μΔ | y1,…, yK, yK+1), wherein yK+1 are unobserved at the design stage. It is important to state that uncertainty of sampling a set of data as yK+1, from the entire probability space needs to be accounted for. Thus, strictly speaking, the Bayesian SSD criteria can only maintain the average properties of the posterior.
Joseph and Bélisle (1997) propose specifying a density region, R(yK+1), bounded by r and r + ℓ0, to contain possible parameter values. Here, ℓ0 is the desired interval length and r chosen so that R(yK+1) is the highest posterior density (HPD) interval; so-called HPD because this interval includes the mode of the posterior distribution. This specification can ensure the coverage probability of R(yK+1) to be at least 1 − α, when averaged over all possible samples. Formally, it requires that
| (6) |
where denotes the probability space and fd(yK+1) the marginal distribution of the sample, i.e., the new experimental data. For the property of controlling the coverage probability, it is often referred to as the average coverage criterion (ACC). The posterior distribution in our context would be unimodal and symmetric about the posterior mean, as we can envisage from the collective prior given by (5). We would then simply stipulate the HPD interval as
which coincides with the alpha-expectation tolerance region by Fraser and Guttman (1956).
An alternative to the ACC is the average length criterion (ALC), which limit the interval length to be at most ℓ for a posterior interval that has a coverage probability of 1 − α0 (Joseph and Bélisle, 1997). Let ℓ′(yK+1) be the random interval length of the posterior credible interval dependent on the unobserved new experimental data. Targeting a fixed coverage probability of 1 − α0, one may solve ℓ′(yK+1) to meet
where r would be specified to give the HPD interval as that for the ACC above. Averaged over all possible samples, the ALC requires that
| (7) |
The ALC could be more favoured than the ACC, since Bayesian practitioners are keen to report, for example, a 95% credible interval for the posterior mean, in the analysis.
As we can see from (6) and (7), sample sizes chosen to meet the ACC or ALC rely on the marginal, predictive distribution of yK+1; that is, . When fd(yK+1) also depends on nuisance parameters, say, the variance being unknown, it becomes . In our context, priors for unknown μΔ and would be specified based on pre-experimental information. The predictive distribution f(yK+1) would thus formally be fd(yK+1 | y1,…, yK), given our π(μΔ | y1,…,yK) and .
We consider one additional criterion relating to the moments of the posterior distribution. For practicality reasons, we focus on the second central moment only, so the criterion would be referred to as the average posterior variance criterion (APVC) hereafter. Given a fixed level of dispersion ϵ0, a suitable sample size is chosen to ensure that
As Adcock (1997) commented, this criterion is equivalent to using the L2-norm loss function for inferences: . It is also worth noting that the literature also documented many other Bayesian approaches to SSD, e.g., based on the use of utility theory (Lindley, 1997) and Bayes factors (Weiss, 1997); the latter is relevant to pursuing the control of type I error rate and power in hypothesis testing problems.
We note that the fixed values of ℓ0, α0 and ϵ0 are all positive real numbers. Unlike the frequentist statistical significance levels, there is no convention to set these thresholds. It is most likely to be backed up by supporting details from the field of application; questions such as what is the meaningful range of μΔ that can provide compelling evidence of a difference between A and B may be discussed with a subject-matter expert.
2.3. Sample size required for comparing two normal means
Consider the comparison of two normal means, denoted by μj, j = A, B, in a new experiment. The difference μΔ = μA − μB > 0 would indicate A is superior to B. Let Xij be the measured outcome from experimental unit i assigned to group j. We assume these measurements are independent, random samples drawn from populations with overall mean μj and a common variance . Letting nj be the groupwise sample sizes, the sample means follow an asymptotic normal distribution by the central limit theorem; that is, , for j = A, B. This further leads to .
For cases of known variance
When the common variance is known, μΔ = μA − μB has a normal prior based on pre-experimental datasets y1,…, yK, as was given in (5). Since the joint likelihood of the nA+nB measurements in the new experiment , we formulate the data likelihood in terms of which is regarded as a random variable. We further derive the posterior distribution as
| (8) |
with
where is the realisaton of . The marginal distribution (unconditional on μΔ) for the difference in sample means is
which corresponds to fd(yK+1) in Section 2.2.
As Joseph and Bélisle (1997) noted, the ACC and ALC result in the same outcome for cases where the variance is known. Hence, we illustrate using the ACC in the following. Letting the HPD interval (r, r + ℓ0) stretch symmetrically around the posterior mean η, the coverage can be computed by
where Φ(·) is the cumulative distribution function of the standard normal distribution. We thus have , which can be rearranged as
| (9) |
where zα/2 is the upper (α/2)-th quantile of the standard normal distribution, i.e., Φ−1(1 − α/2). Similarly, averaging over the entire data space, the APVC gives
| (10) |
When , the prior variance for μΔ based on pre-experimental data, is so small that the right-hand side of the inequalities above becomes zero or negative, little or no information would be required to accrue from a new experiment.
For cases of unknown variance
When is unknown, we assume that the quantity , where χ2(c) refers to a Chi-square distribution with c degrees of freedom (Gelman et al., 2013). This is equivalent to specifying that ; hence, the larger value c takes, the more converges to the prior variance for μΔ | y1,…, yK. The marginal posterior for μΔ will then be obtained by intergrating out the nuisance parameter :
| (11) |
that is, the posterior is proportional to the product of normal and non-standardised t kernels (Ahsanullah et al., 2014). Detailed steps for deriving (11) are given in Section E of the Supporting Information. In particular, the t density kernel (with the location and scale parameters being and , respectively) can be related to a normal kernel with the same location parameter and the variance as , conditional on . The posterior (11) can thus be further developed as
| (12) |
with
which is consistent with (8) but here with unknown . We can also find the distribution for unconditional on μΔ as
see the derivation also in Section E of the Supporting Information. Apparently, this marginal distribution for relies on prior distribution for the unknown , which may yield different solutions of nA and nB across the Bayesian SSD criteria considered in this paper.
Let the interval (a, a + ℓ0) be symmetric about μN given the marginal posterior for μΔ in (12). The sample size is found requiring ,based on the ACC; thus
where zα/2 denotes the upper (α/2)-th quantile of the standard normal distribution. We rewrite the expression and obtain
| (13) |
where is the pdf of an distribution. The reader may compare this inequality with what was obtained for cases where is known in (9).
Applying the ALC, we need to average the random credible interval length over the marginal distribution for which varies with . According to the definition of ALC, we obtain that
| (14) |
which does not have a closed-form solution. This requires a search over the integers for nA and nB to find the smallest sum that satisfies the inequality. With the APVC, we would likewise remove the dependence on by intergration. The formula thus becomes
| (15) |
3. Application
Hampson et al. (2014) present a Bayesian approach for elicitation of expert opinion on model parameters for enhanced design and analysis of rare-disease trials. An elicitation meeting (Hampson et al., 2015) was held for the MYPAN trial, which compares the efficacy of a new treatment (labelled A) relative to the standard of care (labelled B) for polyarteritis nodosa, a rare and severe inflammatory blood vessel disease. Priors were elicited from the input of 15 experts individually. Specifically, opinion was sought on (i) the probability that a patient given B would achieve disease remission within 6 months (a dichotomous event), and (ii) the log-odds ratio of remission rates. Consensus distributions for the remission rates were obtained, with the mode at 71% for A and 74% for B.
In line with the original assumptions for the MYPAN trial, we suppose the Bernoulli probability is not close to 0 or 1, so the log-odds ratio of treatment benefit, i.e., θk = log[(ρAk(1 − ρBk))/((1 − ρAk)ρBk)], would be approximately normally distributed (Agresti, 2003). Here, ρjk denotes the probability of remission for patients receiving treatment j = A, B. We regard the expert opinion as a type of pre-trial information, and further assume it had been summarised in the form of . Eliciting such expert opinion is a non-trivial problem; we refer the interested reader to the literature such as Dias et al. (2017). Furthermore, Hampson et al. (2014) detailed the elicitation process for reaching a probabilistic summary for the log-odds ratio. For illustrative purposes, we assume five sets of expert opinion had been summarised as N(−0.26, 0.25), N(−0.24, 0.23), N(−0.37, 0.22), N(−0.34, 0.36) and N(−0.32, 0.26) to inform θk. Opinion would also be sought on wk, k = 1,…, 5, to represent the experts’ skepticism about the predictability of each pre-trial parameter θk towards the parameter μΔ, measured on the continuous scale of 0 to 1. In this example, we suppose such pre-trial information is valued about equally, with w1 = 0.15, w2 = 0.20, w3 = 0.17, w4 = 0.13, w5 = 0.20. In practice, the trial statistician could look into the levels of pairwise commensurability between the distributions through a discrepancy measure, such as the Hellinger distance (Dey and Birmiwal, 1994), to reconcile the choices of value for wk.
For reaching a collective prior for μΔ | y1,…, y5, synthesis weights p1,…, p5 need to be specified. We apply a decreasing function:
with a concentraton parameter s0 to transform these weights from w1,…, wK. Specifically, for s0 ≫ wk, all pk will be close to 1/K irrespective of the values of wk. Whereas, with s0 → 0+, the smallest wk would have pk → 1, meaning that the corresponding θk | yk tends to dominate the collective prior. The rationale behind this approach is that both wk and pk might be determined by some distance measure between parameters θk and μΔ. It is an objective-directed approach, since we hope to discount pre-experimental information to a larger extent via small values of pk, when it is believed a priori to be less commensurate (thus, large values of wk) with the new experimental data. Figures S2 − S3 (Supporting Information) visualise the impact of wk, k = 1,…, 5, and s0 on the informativeness of the collective prior. A thorough evaluation by Zheng and Wason (2022) shows this objective-directed approach has desirable properties. We generally recommend choosing a small value (relative to the magnitudes of wk) for s0, particularly because this can discern the degree of relevance and can further lead to a heavy-tailed collective prior for cases of divergent pre-trial information. Here, we set s0 = 0.05 for illustration; consequently, p1 = 0.23, p2 = 0.16, p3 = 0.20, p4 = 0.25, p5 = 0.16. This gives a collective prior μΔ | y1,…, y5 ~ N(−0.309, 0.154), when specifying νk ~ wkGamma(2, 2) + (1 − wk)Gamma(18, 3) for our model.
Assuming known variance of and that nA = nB, the total sample sizes (i.e., nA + nB) found based on the ACC and ALC criteria are both 41.8 for 95% posterior coverage probability and the credible interval length as 0.65 on average. For cases of unknown , we let ~ Inv-Gamma(2.500, 0.385) (i.e., setting c = 5). The ACC and ALC sample sizes become 30.7 and 24 for attaining the same posterior behaviours, respectively. Targeting ϵ0 = 0.03, the APVC sample sizes are 32.2 and 27.6 for known and unknown , respectively.
It may be counter-intuitive to find the sample size for cases of unknown variance is smaller than those for known variance here, especially if the latter is perceived as a version of the former with infinite precision. We would reiterate that the prior specification for in our methodology uses pre-trial information, via an distribution. Taking the mode for illustration, the sample size would be proportional to the quantity , that is, the magnitude of the collective prior variance (i.e., 0.154 in this illustration) scaled by the constant relying on c. This is smaller than the fixed ; so not surprisingly, a smaller sample size would be yielded by the same criterion. We also caution that the distribution is not necessarily symmetric about the mode, and the uncertainty in needs to be integrated out for the formal sample size determination.
4. Performance evaluation
4.1. Basic settings
Motivated by the MYPAN trial, we generate four base scenarios of historical data, which are configured with different levels of pairwise (in)commensurability and informativeness. Such pre-experimental information from K sources is supposed to have been summarised as . For each base scenario, two distinct sets of prior mixture weights I and II for robust borrowing are considered to implement the proposed approach for borrowing of information, as listed in Table 1. These fractions are chosen to (a) reflect high and low level of prior confidence in the historical data when they are consistent between themselves, or (b) designate certain source of historical data to be more influential.
Table 1.
Configurations of hypothetical historical data, each accompanied by two sets of weights for robust borrowing of information. Pre-experimental information about θk | yk is assumed to have been summarised by a prior for k = 1,…, 5.
| Hypothetical historical data | Σpkλk | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| k = 1 | k = 2 | k = 3 | k = 4 | k = 5 | |||||
| Configuration 1 | mk | -0.260 | -0.240 | -0.370 | -0.340 | -0.320 | |||
| 0.250 | 0.230 | 0.220 | 0.360 | 0.260 | |||||
| Robust weights I | wk | 0.103 | 0.175 | 0.081 | 0.143 | 0.077 | -0.311 | 0.129 | |
| pk | 0.214 | 0.143 | 0.232 | 0.176 | 0.235 | ||||
| Robust weights II | wk | 0.252 | 0.319 | 0.140 | 0.306 | 0.149 | -0.325 | 0.198 | |
| pk | 0.149 | 0.069 | 0.359 | 0.082 | 0.341 | ||||
| Configuration 2 | mk | -0.260 | -0.240 | -0.370 | -0.340 | -0.320 | |||
| 0.100 | 0.100 | 0.100 | 0.100 | 0.100 | |||||
| Robust weights I | wk | 0.103 | 0.175 | 0.081 | 0.143 | 0.077 | -0.311 | 0.096 | |
| pk | 0.214 | 0.143 | 0.232 | 0.176 | 0.235 | ||||
| Robust weights II | wk | 0.252 | 0.319 | 0.140 | 0.306 | 0.149 | -0.325 | 0.158 | |
| pk | 0.149 | 0.069 | 0.359 | 0.082 | 0.341 | ||||
| Configuration 3 | mk | -0.260 | -0.170 | -0.440 | -0.150 | 0.120 | |||
| 0.250 | 0.640 | 0.970 | 1.540 | 0.590 | |||||
| Robust weights I | wk | 0.101 | 0.219 | 0.385 | 0.385 | 0.304 | -0.198 | 0.295 | |
| pk | 0.559 | 0.263 | 0.035 | 0.035 | 0.108 | ||||
| Robust weights II | wk | 0.325 | 0.203 | 0.171 | 0.180 | 0.272 | -0.215 | 0.379 | |
| pk | 0.065 | 0.235 | 0.298 | 0.280 | 0.122 | ||||
| Configuration 4 | mk | -0.260 | -0.170 | -0.440 | -0.150 | 0.120 | |||
| 0.250 | 0.150 | 0.400 | 0.890 | 0.220 | |||||
| Robust weights I | wk | 0.066 | 0.303 | 0.459 | 0.355 | 0.115 | -0.099 | 0.226 | |
| pk | 0.473 | 0.082 | 0.008 | 0.041 | 0.396 | ||||
| Robust weights II | wk | 0.537 | 0.306 | 0.054 | 0.220 | 0.350 | -0.312 | 0.343 | |
| pk | 0.002 | 0.098 | 0.602 | 0.243 | 0.055 | ||||
We compute the Hellinger distances of any two distributions to describe their pairwise (in)commensurability, as visualised in Figure S4 of the Supporting Information. This is used to justify the values of wk in Table 1 for our numerical study being no greater than 0.500, as the largest Hellinger distance in Figure S4 is below 0.500. Both the Gamma mixture prior for νk, and derivation of the weights, pk, for prioritising certain historical data to form a collective prior, follow our specification in Section 3. Nonetheless, we note at the outset the Gamma component distributions can be equally essential, as choices have an impact on the effective sample size of the collective prior (Neuenschwander et al., 2020).
We compare the sample sizes computed using the proposed Bayesian SSD formulae with those computed (a) without robustification, i.e., setting each wk = 0 for k = 1, …, 5, (b) without leveraging historical information for μΔ, i.e., setting each wk = 1, (c) from the proper Bayesian SSD approach driven by a single prior, here specified as the most informative , for example, N(−0.37, 0.22) for configuration 1, and (d) from an optimal approach as the benchmark. Specifically, the optimal approach is coupled with a perfectly commensurate prior, by equating to the collective prior variance . In this way, the corresponding result would serve as the benchmark referring to the scenario of perfect consistency between the collective prior and the new data, so the largest saving in sample size could be attained by the proposed methodology. For cases of unknown , the optimal sample sizes could be approached by setting c to a sufficiently large value.
4.2. Results
Figure 1 visualises a subset of the results, which compare the proposed Bayesian SSD formulae using robust weights I and II with the alternative approaches for cases of known and unknown , respectively. Here, we assume and, if unknown, ~ Inv-Gamma(1.5, ) for illustration. We fix the posterior credible interval length ℓ0 = 0.65 to find the ACC sample sizes, so that the average coverage probability would be 95%, that is, targeting α = 0.05 in (9). Likewise, for computing the ALC sample sizes, we fix α0 = 0.05 and constrain the average length of the posterior credible interval below 0.65. When applying the APVC, sample sizes are found with the average posterior variance retained to level ϵ = 0.03.
Figure 1.
Comparison of the Bayesian SSD approaches in terms of the sample size obtained according to the ACC, ALC and APVC criteria for cases of (i) known and (ii) unknown . Sample sizes in subfigure (ii) for unknown are computed setting c = 3, i.e., assuming that , for fairly limited use of pre-experimental information to inform the variance . This figure appears in colour in the electronic version of this article, and any mention of colour refers to that version.
In all configurations 1 − 4, we see that the sample sizes computed according to the same criterion, using robust weights I, are smaller than those using robust weights II. This is because following our setting the collective prior, produced by robust weights I, has smaller variance than its counterpart by robust weights II, for each configuration. Moreover, sample sizes yielded using either robust weights I or II are always bounded by those using no robustification (wk = 0) and no borrowing (wk = 1). We may think that no robustification leads to the least conservative result by the proposed SSD formulae, for the given historical information fully used. These, however, are not necessarily identical to the optimal situations, where is equated to the collective prior variance, or largely determined by the latter if unknown. In Figure 1, we omit the benchmark optimal sample sizes that may be obtained by using the proposed formulae with robust weights I and II for each configuration. Yet we will comment on the maximal saving that the proposed SSD approach can achieve in the following along with other figures.
The height difference across bars of sample sizes, computed using our approach with robust weights I or II and no borrowing (wk = 1), quantifies the benefit from leveraging pre-experimental information for μΔ. Looking across subfigures (i) and (ii), such height differences between methods are far greater for the unknown variance case than the known variance case. Comparison of SSD approaches with borrowing versus no borrowing, as visualised in subfigure (ii) of Figure 1, would be more objective for illustrating the benefit. As mentioned, choosing c = 3 means would be related with to a very limited extent, as if a diffuse prior had been placed on . Thereby, implementing no borrowing by setting wk = 1, pre-experimental information would neither be leveraged through the robust prior for μΔ, nor through the prior for the unknown . Consequently, larger sample sizes would be found for no borrowing SSD for the unknown than the known cases assuming = 0.35, to retain similar properties of the posterior distribution. Focusing on the bars for robust weights I and II against no borrowing within subfigure (ii), saving in all the ACC, ALC and APVC sample sizes could be as much as two-thirds for configurations 1 and 2. Such saving is attenuated in configurations 3 and 4 when historical information is divergent. In configurations 3, the ACC (ALC) sample size obtained from the no borrowing approach is about twice the size from the proposed approach with robust weights I, specifically, 232.2 versus 116.8 (136 versus 65), respectively. We observe a small increase in sample size by using robust weights II instead of I, because slightly higher prior probabilities of incommensurability had been allocated to certain informative for greater down-weighting. The trend is similar for results in configuration 4.
We then compare the proposed approach with an alternative strategy, that is, restricting the use of pre-experimental information from a single source. When the historical data are consistent (divergent) between themselves, the proposed SSD formulae lead to smaller (larger) sample sizes, as presented obviously in configuration 1 (configurations 3 and 4) for both cases of known and unknown . As one may perceive, such selection of a single source could be less robust than averaging over all available pre-experimental information. Another noteworthy finding is concerned with the comparison of the ACC and ALC sample sizes, particularly when is unknown and we place a minimally informative prior on it (setting c = 3). As shown in Figure 1, the ALC sample size is universally smaller than the ACC sample size for all these investigated configurations.
We move on to quantify how the sample sizes would vary as c changes. Focusing on approaches using pre-experimental information from multiple sources, Figure 2 displays the sample sizes exclusively for cases of unknown . We set c = 3, 5, 10, 20, 30, 40 and keep the target level of each SSD criterion unchanged from what we have used for Figure 1. As c gets larger, the sample sizes for all approaches investigated here decrease and tend to stablise at their own lowest levels possible. This could be explained from the perspective of prior effective sample size, to which variance is a key determining factor. Consider the prior placed on the inverse of the unknown variance that , of which the mean and variance are and , respectively. As c increases, the prior variance dinimishes, meaning that possible values of are more concentrated around the prior mean obtained based on historical data. For c ≥ 20, the ACC and ALC sample sizes are nearly identical. Whereas, the ACC sample size is more sensitive than the ALC to small values of c, e.g., when c = 3, 5. We note that the so-called ‘no borrowing’ (by setting wk = 1) should be clarified as no borrowing in terms of the parameter μΔ. When c gets larger, it means the unknown variance would be more closely tied to the prior variance based on the historical data. That is, borrowing is enabled through the variance, although not directly the parameter of inferential interest. By fixing wk = 1, historical data would not be leveraged through the robust prior for μΔ, but nevertheless could be used to inform the unknown , particularly when c is sufficiently large.
Figure 2.
The ACC, ALC and APVC sample sizes for the new trial, where the unknown could be related to the collective prior variance by assuming the quantity . The extent of borrowing for better knowledge about depends on the number of degrees of freedom, c. This figure appears in colour in the electronic version of this article, and any mention of colour refers to that version.
Figure 3 illustrates how the sample size varies, for cases of unknown , when targeting the average coverage probability, posterior credible length and posterior variance at different levels. Like in Figure 1, these results are obtained by setting c = 3 for the very limited use of pre-experimental information to inform . The optimal sample sizes are also plotted to show the maximal saving the proposed SSD formulae may achieve. Specifially, the optim I and II should be taken as the benchmark for formulae using robust weights I and II, respectively. As expected, sample sizes by robust weights I and II would always be bounded by the extremes of no robustification (all wk = 0) and no borrowing (all wk = 1). Given a fixed length ℓ0 = 0.65 of the HPD interval, more ACC sample sizes would be required if increasing the desired coverage probability on average, 1 − α. For example, the ACC sample size computed using robust weights I (II) rises from 78.7 to 156.5 (104.4 to 204.2) for configuration 3, had the level of 1 − α been lifted from 90% to 97.5%. The displayed ALC sample sizes in subfigure (ii) ensures the coverage probability as 95%; by relaxing the target average HPD interval length, fewer sample sizes would be needed. Likewise, the APVC sample sizes in subfigure (iii) share this commonality of decreasing as we relax the target posterior variance. Generating these plots would be helpful in practice for balancing between obtaining an economic sample size planning and a posterior sufficiently informative for inferences on a case-by-case basis. For example, targeting the average length of the HPD interval with 95% coverage probability as ℓ = 0.60 requires the ALC sample size to be 28 for configuration 1 using robust weights I, which may not be much different from 23 yielded by the level ℓ = 0.65.
Figure 3.
Sample sizes required when is unknown to retain desired average property of the posterior distribution. The ACC and ALC sample sizes are computed by fixing the credible interval length ℓ0 = 0.65 and coverage probability 1 − α0 = 95%, respectively. This figure appears in colour in the electronic version of this article, and any mention of colour refers to that version.
We further investigate the impact of , the associated levels of uncertainty inherent to historical data k = 1,…, K, on the respective sample sizes. Configurations 1 and 2, with the robust weights kept the same, have been constructed for this purpose. From Figures 1 − 3, it is clear that Configuration 1 requires a larger sample size than Configuration 2 under the same criterion. The explanation is that Configuration 2, with smaller sample variation, leads to a more informative collective prior for μΔ, so less information (sample size) would be required from the new experiment for the inference.
We also examine how sensitive the proposed Bayesian SSD formulae is to the Gamma mixture components. Since a suitable yet least informative Gamma(a01, b01) has been chosen for down-weighting, the other component of the mixture prior, Gamma(a02, b02), determines the maximum borrowing possible. Assuming unknown and setting c = 3, Figure 4 shows the Bayesian SSD under different choices of the hyperparameters, a02 and b02, for each criterion. As expected, a more informative Gamma(a02, b02) yields a smaller sample size given the same set of wk, k = 1,…, K. The ALC sample sizes appear to have least decreasing, compared with the ACC and APVC, in this sensitivity evaluation. We also observe that the reduction in Bayesian sample sizes is not proportional to the improving of informativeness of Gamma(a02, b02): setting the informative component as Gamma(18, 3) is not much different from Gamma(54, 3) for our illustrative examples. For practical implementation, we recommend the component Gamma distributions to be chosen for representing two extremes of very limited borrowing and complete pooling of information, when given a full prior mixture weight wk = 1 and wk = 0, respectively.
Figure 4.
The proposed Bayesian SSD dependent on the choice of the informative Gamma component distribution for strong borrowing. The labels at x-axis are short for Optim I, Optim II, Robust weights I, Robust weights II, and No robustification, respectively. This figure appears in colour in the electronic version of this article, and any mention of colour refers to that version.
Finally, comprehensive simulation studies have been performed in Section H − J of the Supporting Information to investigate (i) the average properties of the posterior for μΔ as updated by the new experimental data, (ii) the sensitivity to non-normal data, and (iii) the performance if original priors (without normal approximation) are used for the analysis.
5. Discussion
Planning a new experiment with a sufficient sample size necessitates the use of relevant information. Bayesian methods allow for the inherent uncertainty in the estimate of model parameters, as well as a formal incorporation of any expert opinion or historical data. In this paper, we have developed Bayesian sample size formulae that use commensurate priors to leverage pre-experimental data, available from multiple sources, for the model parameter(s) of interest. While we note proposals based on the ‘two-prior’ approach (De Santis, 2007; Brutti et al., 2009), the proposed method specifies a singular prior for both the design and analysis of the new experiment.
One area that deserves more investigation is surrounding wk. Following Zheng and Wason (2022), we recommend these are based on some measures of distributional discrepancy, such as the Hellinger distance between any two distributions. The underlying logic is that the new experiment, at the planning stage, is regarded as compatible with the historical experiments, then their data would also be. The levels of the (in)commensurability between a pre-experimental parameter and the new experimental parameter would thus be comparable to those between the pre-experimental parameters themselves. Nevertheless, we recognise that these prior mixture weights wk can not be correctly specified when the new experimental data are yet to be generated. Pragmatically, the new experiment could be embedded with interim analyses to enable mid-course modifications towards wk. Each update in terms of wk tends to better reflect the genuine incommensurability (Zheng and Hampson, 2020).
As noted by one reviewer, there are circumstances that pre-experimental information may be available for a single arm (say, to inform μA or μB) only. The proposed Bayesian methodology can still be useful in that the information may be represented into a commensurate predictive prior for the arm-based statistic(s). Analytical derivation of a posterior for the mean difference can follow our one presented in Section 2. This would be particularly relevant to the special topic of using historical control in clinial trials to supplement or replace a concurrent control. However, we caution that the selection of relevant pre-trial data on one arm needs to be done carefully, since the model may introduce systematic difference between arms that would affect the inference of the difference in means.
For comparing two Bernoulli probabilities in Section 3, we used a logit transformation to consider the log-odds ratio, which is generally adequately modelled by a normal distribution. The approach of constructing a normal statistic can also be used for time-to-event data, which is elaborated upon in Section K of the Supporting Information with new formulae presented. We are aware of the limitations. For example, accurate estimation of the Bernoulli probabilities is not straightforward and the censoring assumptions in the time-to-event data are simplified. We hope this work motivates further research for sample size determination in both binomial and time-to-event data within this Bayesian context.
Throughout this paper, we supposed pre-experimental information had been available with regard to the parameter of influential interest. Situations may be more complex in practice. For instance, historical data may have been recorded on a different measurement scale (Zheng et al., 2020) from what might be for the new experiment under planning. This is an area where our future research would look towards. To promote the uptake of our methodology, we have summarised the necessary actions, along with the specification of key parameters, at different stages of the planning of a new experiment in Section L of the Supporting Information. As a separate note, we applied quite general criteria such as ACC and ALC to control the average coverage probability or length of the HPD interval of the posterior distribution for the parameter of influential interest throughout. In such decision frameworks, the sample size largely depends on the informativeness of a prior distribution for μΔ, as well as for σ° when using pre-experimental data to inform the variance. With each criterion concerning an average property of the posterior distribution, permitting borrowing (with 0 < wk < 1) yields a smaller sample size than the approach of no borrowing (which can be a limiting case of the proposed model with wk = 1). However, when alternative decision criteria are applied, it is not necessarily true that enabling borrowing always leads to a sample size reduction. An example is research for overcoming prior-data conflict, where the prior mismatches the data accrued from the new experiment. There is relevant literature addressing the issue in clinical trials, where maintaining a strong control of error rates is required by regulatory agencies (EMA, 1998). Our sample size formulae according to the ACC can be closely relevant for giving a solution analogous to the frequentist hypothesis testing; for example, rejection of the null hypothesis could be defined based on posterior interval probabilities with respect to a magnitude of effect deemed clinically meaningful (Whitehead et al., 2008).
Supplementary Material
Acknowledgements
This work was supported by Cancer Research UK through Dr Zheng’s Population Research Postdoctoral Fellowship [RCCPDF\100008]. JW and TJ received funding from the UK Medical Research Council (MC_UU_00002/6, MC_UU_00002/14). This report is independent research arising in part from Prof Jaki’s Senior Research Fellowship (NIHR-SRF-2015-08-001) supported by the National Institute for Health Research. The views expressed in this publication are those of the authors and not necessarily those of the NHS, the National Institute for Health Research or the Department of Health and Social Care (DHSC).
Data Availability Statement
The authors confirm that the simulated data supporting the findings of this paper are reproducible with openly available R code in the Supporting Information.
References
- Adcock CJ. Sample size determination: a review. Journal of the Royal Statistical Society: Series D (The Statistician) 1997;46:261–283. [Google Scholar]
- Agresti A. Categorical Data Analysis. Wiley Series in Probability and Statistics; Wiley: 2003. [Google Scholar]
- Ahsanullah M, Kibria B, Shakil M. Atlantis Studies in Probability and Statistics. Paris: Atlantis Press; 2014. Normal and Student’s t Distributions and Their Applications. [Google Scholar]
- Brutti P, De Santis F, Gubbiotti S. Mixtures of prior distributions for predictive Bayesian sample size calculations in clinical trials. Statistics in Medicine. 2009;28:2185–2201. doi: 10.1002/sim.3609. [DOI] [PubMed] [Google Scholar]
- Clarke B, Yuan A. Closed form expressions for Bayesian sample size. Annals of Statistics. 2006;34:1293–1330. [Google Scholar]
- De Santis F. Using historical data for Bayesian sample size determination. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2007;170:95–113. [Google Scholar]
- Desu MM, Raghavarao D. Statistical Modeling and Decision Science. San Diego, CA: Academic Press; 1990. Sample Size Methodology. [Google Scholar]
- Dey DK, Birmiwal LR. Robust Bayesian analysis using divergence measures. Statistics & Probability Letters. 1994;20:287–294. [Google Scholar]
- Dias L, Morton A, Quigley J. Elicitation: The Science and Art of Structuring Judgement. Springer; 2017. [Google Scholar]
- Duan Y, Ye K, Smith EP. Evaluating water quality using power priors to incorporate historical information. Environmetrics. 2006;17:95–106. [Google Scholar]
- EMA. Statistical Principles for Clinical Trials. European Medicine Agency; London, E14 4HB, UK: 1998. [Last accessed on 11 June 2020]. [Google Scholar]
- EMA. Guideline on clinical trials in small populations. European Medicine Agency; London, E14 4HB, UK: 2006. [Last accessed on 11 June 2020]. [Google Scholar]
- Fraser DAS, Guttman I. Tolerance regions. Annals of Mathematical Statistics. 1956;27:162–179. [Google Scholar]
- Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D. Chapman & Hall/CRC Texts in Statistical Science. Third. Taylor & Francis; Boca Raton, FL: 2013. Bayesian Data Analysis. [Google Scholar]
- Hampson LV, Whitehead J, Eleftheriou D, Brogan P. Bayesian methods for the design and interpretation of clinical trials in very rare diseases. Statistics in Medicine. 2014;33:4186–4201. doi: 10.1002/sim.6225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hampson LV, Whitehead J, Eleftheriou D, et al. Elicitation of expert prior opinion: Application to the MYPAN trial in childhood polyarteritis nodosa. PLOS ONE. 2015;10:1–14. doi: 10.1371/journal.pone.0120981. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Joseph L, Bélisle P. Bayesian sample size determination for normal means and differences between normal means. Journal of 9the Royal Statistical Society: Series D (The Statistician) 1997;46:209–226. [Google Scholar]
- Joseph L, Wolfson DB, Berger RD. Sample size calculations for binomial proportions via highest posterior density intervals. Journal of the Royal Statistical Society: Series D (The Statistician) 1995;44:143–154. [Google Scholar]
- Lindley DV. The choice of sample size. Journal of the Royal Statistical Society: Series D (The Statistician) 1997;46:129–138. [Google Scholar]
- Neuenschwander B, Weber S, Schmidli H, O’Hagan A. Predictively consistent prior effective sample sizes. Biometrics. 2020;76:578–587. doi: 10.1111/biom.13252. [DOI] [PubMed] [Google Scholar]
- O’Hagan A, Forster JJ. Kendall’s library of statistics. second. Oxford University Press; London: 2004. Kendall’s Advanced Theory of Statistics, volume 2B: Bayesian Inference. [Google Scholar]
- Spiegelhalter D, Abrams K, Myles J. Statistics in Practice. New York: Wiley; 2004. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. [Google Scholar]
- Weiss R. Bayesian sample size calculations for hypothesis testing. Journal of the Royal Statistical Society: Series D (The Statistician) 1997;46:185–191. [Google Scholar]
- Whitehead J, Valdés-Márquez E, Johnson P, Graham G. Bayesian sample size for exploratory clinical trials incorporating historical data. Statistics in Medicine. 2008;27:2307–2327. doi: 10.1002/sim.3140. [DOI] [PubMed] [Google Scholar]
- Zheng H, Hampson LV. A Bayesian decision-theoretic approach to incorporate preclinical information into phase I oncology trials. Biometrical Journal. 2020;62:1408–1427. doi: 10.1002/bimj.201900161. [DOI] [PubMed] [Google Scholar]
- Zheng H, Hampson LV, Wandel S. A robust Bayesian meta-analytic approach to incorporate animal data into phase I oncology trials. Statistical Methods in Medical Research. 2020;29:94–110. doi: 10.1177/0962280218820040. [DOI] [PubMed] [Google Scholar]
- Zheng H, Wason JMS. Borrowing of information across patient subgroups in a basket trial based on distributional discrepancy. Biostatistics. 2022;23:120–135. doi: 10.1093/biostatistics/kxaa019. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The authors confirm that the simulated data supporting the findings of this paper are reproducible with openly available R code in the Supporting Information.




