Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Sep 1.
Published in final edited form as: Biometrics. 2011 Mar 1;67(3):1163–1170. doi: 10.1111/j.1541-0420.2011.01561.x

Bayesian Design of Non-Inferiority Trials for Medical Devices Using Historical Data

Ming-Hui Chen *, Joseph G Ibrahim , Peter Lam §, Alan Yu , Yuanye Zhang *
PMCID: PMC3136555  NIHMSID: NIHMS266678  PMID: 21361889

Summary

We develop a new Bayesian approach of sample size determination (SSD) for the design of non-inferiority clinical trials. We extend the fitting and sampling priors of Wang and Gelfand (2002) to Bayesian SSD with a focus on controlling the type I error and power. Historical data are incorporated via a hierarchical modeling approach as well as the power prior approach of Ibrahim and Chen (2000). Various properties of the proposed Bayesian SSD methodology are examined and a simulation-based computational algorithm is developed. The proposed methodology is applied to the design of a non-inferiority medical device clinical trial with historical data from previous trials.

Keywords: Fitting prior, Hierarchical model, Power prior, Sampling prior, Simulation

1. Introduction

Recently, the FDA released “Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials” (February 5, 2010, www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/ucm071072.htm). This document provides guidance on statistical aspects of the design and analysis of Bayesian clinical trials for medical devices. It lays out detailed guidance on the determination of sample size in a Bayesian clinical trial. This document also provides guidance on the evaluation of the operating characteristics of a Bayesian clinical trial design. Specifically, the evaluation of a Bayesian clinical trial design should include type I error (probability of erroneously approving an ineffective or unsafe device), type II error (probability of erroneously disapproving a safe and effective device), and power (the converse of type II error: the probability of appropriately approving a safe and effective device).

Sample size determination (SSD) is a crucial aspect of clinical trial design. In this paper, we are particularly interested in the design and analysis of non-inferiority trials. There is a vast literature on the frequentist methods of SSD in various non-inferiority trials, which includes, for example, D’Agostino Sr. et al. (2003), Hung et al. (2003), Rothmann et al. (2003), Hung et al. (2005), Kieser and Friede (2007), and Fleming (2008). The literature on Bayesian SSD has been growing recently due to recent advances in Bayesian computation and Markov chain Monte Carlo sampling. Joseph et al. (1995), Lindley (1997), Rubin and Stern (1998), Katsis and Toman (1999), and Inoue et al. (2005) are the Bayesian SSD articles cited in the FDA 2010 Guidance. An early review of Bayesian SSD is given in Adcock (1997). The most recent work includes Rahme and Joseph (1998), Simon (1999), Wang and Gelfand (2002), De Santis (2007), and M’Lan et al. (2006, 2008). The existing literature on Bayesian SSD primarily focuses on simple normal, one or two sample binomial problems, standard normal linear regression, and generalized linear models. Although the literature on Bayesian SSD discusses a variety of performance criteria, the widely used ones include the Bayes factor (Weiss, 1997), the average posterior variance criterion (APVC) (see, for example, Wang and Gelfand, 2002), the average coverage criterion (ACC), the average length criterion (ALC), the worst outcome criterion (WOC) (e.g., Joseph et al., 1995 and Joseph and Bélisle, 1997), and the approach based on the range of equivalence (see, for instance, Spiegelhalter et al., 2004) for superiority/non-inferiority trials. Other criteria have also been considered in the literature, including Lindley (1997), Pham-Gia (1997), Lam and Lam (1997), and M’Lan et al. (2006, 2008). However, most of the aforementioned Bayesian articles do not directly address design and analysis of non-inferiority trials except for Spiegelhalter et al. (2004).

The rest of the paper is organized as follows. In Section 2, we present the design of a non-inferiority trial with two treatment arms for evaluating the performance of a new generation of medical devices in order to motivate the methodology developed in this paper. The availability of historical data from first generation medical devices is also discussed in detail. In Section 3, we propose a general framework of Bayesian SSD for designing a non-inferiority trial. Section 4 provides a detailed development of the incorporation of historical data via the hierarchical modeling approach as well as the power prior formulation. The posterior distribution is discussed and a simulation-based computational algorithm is developed in Section 5. In Section 6, we apply the proposed methodology to sample size determination of the non-inferiority medical device trial discussed in Section 2. The proposed Bayesian SSD method is compared to frequentist SSD methods. We show that Bayesian SSD yields a substantial reduction in the sample size compared to a frequentist design. We conclude the paper with some discussion and extension of the proposed Bayesian SSD method in Section 7.

2. Design of A Non-Inferiority Trial with Two Treatment Arms for Medical Devices

We consider designing a clinical trial to evaluate the performance of a new generation of drug-eluting stent (DES) (“test device”) with a non-inferiority comparison to the first generation of DES (“control device”). Thus, the trial has two arms: test device and control device. The primary endpoint is the 12-month Target Lesion Failure (TLF) (binary) composite endpoint, which is an ischemia-driven revascularization of the target lesion (TLR), myocardial infarction (MI) (Q-wave and non-Q-wave) related to the target vessel, or (cardiac) death related to the target vessel. The secondary endpoint is the 9-month in-segment percent diameter stenosis (%DS) (continuous), which is the percentage of narrowing in the coronary artery caused by the plaque. Let yt(nt)=(yt1,yt2,,ytnt) and yc(nc)=(yc1,yc2,,ycnc) be the data corresponding to the test device and the control device, respectively, collected from this trial. Let n = nt + nc denote the total sample size. Also, we write y(n)=((yt(nt)),(yc(nc))). We assume that the ratio of two sample sizes, r=ncnt, is fixed. Thus, nt=n1+r and nc=rn1+r. We choose r to be small, for example, r = 1/4, so that nt > nc. The goal of the trial is to show that the test device is non-inferior to the control device.

We assume that yt(nt) and yc(nc) are two independent random samples. For the primary endpoint, we assume that yti (yci) follows a Bernoulli distribution Ber(pt) (Ber(pc)). Let μt=log(pt1pt) and μc=log(pc1pc). For the secondary endpoint, we assume that yti ~ N(μt, σ2) and yci ~ N(μc, σ2) independently. Let θ = (μt, μc) for the primary endpoint and θ = (μt, μc, σ2) for the secondary endpoint. Then, the joint distribution of y(n) for the primary endpoint is given by

f(y(n)θ)=i=1ntexp(ytiμt)1+exp(μt)×i=1ncexp(yciμc)1+exp(μc), (2.1)

and the joint distribution of y(n) for the secondary endpoint is given by

f(y(n)θ)=i=1nt12πσexp{12σ2(ytiμt)2}×i=1nc12πσexp{12σ2(yciμc)2}. (2.2)

The design parameter is the difference between μt and μc, namely, μt − μc, and the hypotheses for non-inferiority testing are

H0:μtμcδversusH1:μtμc<δ, (2.3)

where δ is a prespecified non-inferiority margin. The trial is successful if H1 is accepted.

Historical data are available from two previous trials on the first generation of DES. The first trial conducted in 2002 evaluated the safety and effectiveness of the slow release paclitaxel-eluting stent for treatment of de novo coronary artery lesions. The second trial conducted in 2004 expanded on the first trial, studied more complex de novo lesions, and involved multiple overlapping stents and smaller and larger diameter stents. Our historical data based on lesion size matched criteria are subsets of the data published in Stone et al. (2004, 2005). A summary of the historical data is given in Table 1. In Table 1, SD stands for standard deviation.

Table 1.

Historical Data

12-Month TLF log(9-month %DS)
% TLF (# of failure/n0k) mean ± SD (n0k)
Historical Trial 1 8.2% (44/535) 3.0891 ± 0.6315 (242)
Historical Trial 2 10.9% (33/304) 3.1849 ± 0.5811 (263)

In the next two sections, we will develop the general methodology for Bayesian SSD and elicit priors via historical data.

3. The General Methodology

We first develop a new but general method to determine Bayesian sample size for a non-inferiority trial. Denote the data associated with a sample size of n by y(n) and let θ be the vector of all the model parameters. Then, the joint distribution of y(n) and θ is written as f(y(n)|θ)π(θ), where π(θ) denotes the prior distribution. Let h(θ) be a scalar function that measures the “true” size of the treatment effect. Then, let δ denote the non-inferiority margin. Similar to Hung et al. (2003), we assume that the hypotheses for non-inferiority testing can be formulated as follows:

H0:h(θ)δversusH1:h(θ)<δ. (3.1)

Consequently, we let Θ0 and Θ1 denote the parameter spaces corresponding to H0 and H1. For the hypotheses given in (2.3), h(θ) = μt − μc; Θ0 = {θ = (μt, μc): μt − μcδ} and Θ1 = {θ: μt − μc < δ} for the primary endpoint; and Θ0 = {θ = (μt, μc, σ2): μt − μcδ, σ2 > 0} and Θ1 = {θ: μt − μc < δ, σ2 > 0} for the secondary endpoint.

Following Wang and Gelfand (2002), let π(s)(θ) denote the sampling prior and also let π(f)(θ) denote the fitting prior. The sampling prior, which captures a certain specified portion of the parameter space in achieving a certain level of performance in SSD, is used to generate the data while the fitting prior is used to fit the model once the data is obtained. We note that π(f)(θ) may be improper as long as the resulting posterior, π(f)(θ|y(n)) ∝ f(y(n)|θ)π(f)(θ), is proper. Further we let f(s)(y(n)) denote the marginal distribution that is induced from the sampling prior. Now, we introduce the key quantity

βs(n)=Es[1{P(h(θ)<δy(n),π(f))γ}], (3.2)

where the indicator function 1{A} is 1 if A is true and 0 otherwise, γ > 0 is a prespecified quantity, the probability is computed with respect to the posterior distribution given the data y(n) and the fitting prior π(f)(θ), and the expectation is taken with respect to the marginal distribution of y(n) under the sampling prior π(s)(θ).

Now, we propose a new Bayesian SSD algorithm as follows. Let Θ̄0 and Θ̄1 denote the closures of Θ0 and Θ1. Let π0(s)(θ) denote a “sampling prior” with support ΘB = Θ̄0 ∩ Θ̄1. Also let π1(s)(θ) denote a “sampling prior” with support Θ1Θ1. For given α0 > 0 and α1 > 0, we compute

nα0=min{n:βs0(n)α0}andnα1=min{n:βs1(n)1α1}, (3.3)

where βs0(n) and βs1(n) given in (3.2) corresponding to π(s)=π0(s) and π(s)=π1(s) are the Bayesian type I error and power, respectively. Then, the Bayesian sample size is given by nB = max{nα0, nα1}. According to the FDA 2010 Guidance, we choose γ ≥ 0.95. Common choices of α0 and α1 include α0 = 0.05 and α1 = 0.20 so that the Bayesian sample size nB guarantees that the type I error rate is less than or equal to 0.05 and the power is at least 0.80. In addition, for a given sample size nB, the operating characteristic curve can be constructed by varying Θ1 inside of Θ1. If h(θ) is a monotonic function of the distance between Θ1 and ΘB, then the further Θ1 is away from ΘB, the higher the power will be.

A simple illustration: i.i.d. normal case

Suppose y1, y2, …, yn are i.i.d. N (θ, τ−1), where τ is a known precision parameter. Suppose the hypotheses for non-inferiority testing are formulated as follows: H0: θδ versus H1: θ < δ. We specify an improper uniform fitting prior for θ, i.e., π(f)(θ) 1. In addition, we specify two point mass sampling priors for θ such that π0(s)(θ)=1 if θ = δ and π1(s)(θ)=1 if θ = 0. After some algebra, we can show that (i) a necessary condition for achieving a type I error rate of α0 is 1 − γα0 and (ii) if 1 − γα0, the Bayesian sample size is the smallest integer nB satisfying nB1τδ2[Φ1(1α1)+Φ1(γ)]2, where Φ denotes the N(0, 1) cumulative distribution function. It is interesting to note that for this simple case, β0(n)α0 always holds for all n when 1 − γα0. We also note that the Bayesian sample size nB is identical to the classical sample size formulation for a one-sided alternative hypothesis when α0 = 1 − γ.

4. The Incorporation of Historical Data in Bayesian SSD

Historical data are often available only for the control medical device. Now suppose that there are K historical datasets for the control device, denoted by yc0k = (yc0k1, …, yc0kn0k)′ for k = 1, …, K. Let yc0=(yc01,,yc0K) denote all K historical datasets. We develop two approaches, namely, the hierarchical prior and the power prior, to incorporate the historical data yc0.

4.1 Hierarchical Priors

Under the hierarchical Bernoulli/normal model, we assume that yc0k follows the same model given in either (2.1) or (2.2). Let θ0 = (μc01, …, μc0K)′(or θ0 = (μc01, …, μc0K, σ2)′) for the primary (or secondary) endpoint. Then, the joint distribution of yc0 is given by f(yc0θ0)=k=1Ki=1n0kexp(yc0kiμc0k)1+exp(μc0k) for the primary endpoint and f(yc0θ0)=k=1Ki=1n0k12πσexp{12σ2(yc0kiμc0k)2} for the secondary endpoint. We further assume μc ~ N(μc0, τ2), where &tau;2 > 0, and independently μc0k ~ N (μc0, τ2) for k = 1, …, K.

Let θ* = (μt, μc, θ0, μc0, τ2)′. Then, the hierarchical prior for θ* is given by

π(θyc0)f(yc0θ0)φ(μcμc0,τ2)k=1Kφ(μc0kμc0,τ2)π0(μt,σ2,μc0,τ2), (4.1)

where φ(·|μc0, τ2) denotes the probability density function (pdf) of a N(μc0, τ2) distribution. In (4.1), π0(μt, σ2, μc0, τ2) is the initial prior, which is specified as π0(μt,σ2,μc0,τ2)1σ2(τ2)(ξ0+1)exp(η0/τ2), where ξ0 > 0 and η0 > 0 are two prespecified hyperparameters. The joint prior in (4.1) is improper since an improper uniform prior is assumed for μt and the historical data are borrowed for μc and σ2 via the hierarchical model. Finally, the fitting prior is obtained after integrating out μc01, …, μc0K, μc0, and τ2 from (4.1). Specifically, we have

π(f)(θyc0)π(θyc0)dμc01dμc0Kdμc0dτ2. (4.2)

To specify the sampling prior π(s)(θ), we assume μt, μc, and σ2 are independent and then specify point mass priors for μt and μc and use the historical data to specify the sampling prior for σ2. Specifically, we take

π(s)(θ)=π(s)(μt)π(s)(μc)orπ(s)(θ)=π(s)(μt)π(s)(μc)π(s)(σ2), (4.3)

where π(s)(σ2)f(yc0θ0)[k=1Kφ(μc0kμc0,τ2)×1σ2(τ2)(ξ0(s)+1)exp(η0(s)/τ2)]dμc01dμc0Kdμc0dτ2, and ξ0(s)>0 and η0(s)>0 are prespecified hyperparameters, which may be different from (ξ0, η0). As discussed in Section 3, the sampling prior must be proper. We can show that under very mild conditions, the sampling prior π(s)(σ2) is proper.

We note that under the normal model, the hierarchical prior (4.1) for θ* reduces to

π(θyc0)(σ2)12k=1Kn0k1exp{12σ2k=1K[n0k(μc0ky¯c0k)2+(n0k1)S0k2]}×(τ2){ξ0+(K+1)/2+1}exp{1τ2[η0+12(μcμc0)2+12k=1K(μc0kμc0)2]}, (4.4)

where y¯c0k=(1/n0k)i=1n0kyc0ki and S0k2=[1/(n0k1)]i=1n0k(yc0kiy¯c0k)2 for k = 1, …, K. Thus, the fitting prior and the sampling prior depend only on the sufficient statistics {(ȳ0k, S0k2), k = 1), …, K} from the historical data.

4.2 Power Priors

We extend the power priors of Ibrahim and Chen (2000) to build the prior distribution for μc or (μc, σ2) when multiple historical datasets are available. For the primary endpoint, we consider the following normalized power prior for μc given multiple historical data yc0,

π(μcyc0,a0)=1C(a0)k=1K[i=1n0kexp(yc0kiμc)1+exp(μc)]a0kπ0(μc), (4.5)

where a0 = (a01, …, a0K)′, 0 ≤ a0k ≤ 1 for k = 1, 2, …, K, π0(μc) is an initial prior, and C(a0)=0k=1K[i=1n0kexp(yc0kiμc)1+exp(μc)]a0kπ0(μc)dμc. When π0(μc) ∝ 1, (4.5) reduces to

π(μcyc0,a0)=exp{μck=1Ka0kn0ky¯c0k}B(k=1Ka0kn0ky¯c0k,k=1Ka0kn0k(1y¯c0k))[1+exp(μc)]n0(a0),

where B(., .) denotes the complete beta function, n0(a0)=k=1Ka0kn0k, and ȳc0k is defined in (4.4). For the secondary endpoint, the normalized power prior for μc and σ2 is given by

π(μc,σ2yc0,a0)=1C(a0)k=1K[i=1n0k12πσexp{12σ2(yc0kiμc)2}]a0kπ0(μc,σ2), (4.6)

where π0(μc, σ2) is an initial prior and C(a0) is the normalizing constant, which is similar to the one in (4.5). Let y¯c0(a0)=k=1Ka0kn0ky¯c0kn0(a0) and S02(a0)=k=1Ka0kn0k(y¯c0ky¯c0(a0))2+k=1Ka0k(n0k1)S0k2, where ȳc0k and S0k2 are defined in (4.4). When π0(μc, σ2), ∝ 1/σ2, (4.6) reduces to π(μc,σ2yc0,a0)=(n0(a0)2πσ2)1/2exp{n0(a0)2σ2[μcy¯c0(a0)]2}×{[S02(a0)/2][n0(a0)+1]/2/Γ([n0(a0)+1]/2)}(σ2)12[n0(a0)+1]exp{S02(a0)/(2σ2)}. To complete the specification of the power prior, we assume that the a0k’s are independent and distributed as a0k ~ beta(b01, b02), where b01 > 0 and b02 > 0 are prespecified hyperparameters. We mention that the normalized power prior is also considered by Duan et al. (2006), Neuenschwander et al. (2009), and Hobbs et al. (2009).

Using (4.5) or (4.6), the fitting prior of θ is of the form

π(f)(θyc0)[π(θyc0,a0)k=1Ka0kb011(1a0k)b021]da0π0(μc), (4.7)

where π(θ̃|(yc0, a0) = π(μc|yc0, a0) defined in (4.5) and θ̃ = μc for the primary end-point, and π(θ̃|yc0, a0) = π(μc, σ2|yc0, a0) defined in (4.6) and θ̃ = (μc, σ2) for the secondary endpoint. Similar to the hierarchical prior, the sampling prior π(s)(θ) under the normalized power prior is specified as follows: π(s)(θ) = π(s)(μt)π(s)(μc) or π(s)(θ) = π(s)(μt) π(s)(μc)π(s)(σ2), where π(s) (μt) and π(s)(μc) are two prespecified proper priors,

π(s)(σ2)π(θyc0,a0s)π0(s)(σ2)dμc, (4.8)

a0s is prespecified and π0(s)(σ2) may be an improper initial prior such as π0(s)(σ2)1/σ2.

In (4.5) or (4.6), the parameter a0k controls the influence of the kth historical dataset on π(θ̃|yc0, a0). The parameter a0k can be interpreted as a relative precision parameter for the kth historical dataset. One of the main roles of a0 is that it controls the heaviness of the tails of the prior for μc (or (μc, σ2)). As all of the a0k’s become smaller, the tails of (4.5) or (4.6) become heavier. When a0k = 1 for all k with probability 1, (4.7) corresponds to the update of π0(θ) using Bayes theorem based on the historical data. When a0 = 0 with probability 1, then the power prior does not depend on the historical data. That is, a0 = 0 is equivalent to a prior specification with no incorporation of historical data. Thus, the a0k’s control the influence of the multiple historical datasets on the current study. Such control is important in cases where there is heterogeneity among the historical studies, or heterogeneity between the historical and current studies, or when the sample sizes of the historical and current studies are quite different.

We note that the use of historical data via the power priors for Bayesian sample size determination is also considered by De Santis (2007). We also note that a0 may be considered to be fixed instead of random. For ease of exposition, we consider the primary endpoint. When a0 is fixed, the fitting prior of θ = (μt, μc)′ is of the form π(f)(θ|yc0) ∝ π(μc|yc0, a0) π0(μt), where π(μc|yc0, a0) is given by (4.5), π0(μt) is an initial prior for μt, and a0 is fixed. When a0 is fixed, we know exactly how much historical data are incorporated in the new trial, and in addition, there is a theoretical connection between the power prior formulation and the hierarchical prior specification as established in Chen and Ibrahim (2006). De Santis (2006) also provides some useful comments on the fixed-a0 case as well as on power priors for the exponential family. On the other hand, when a0 is random, the amount of incorporation of historical data is determined by the data and hence not prespecified by the data analyst.

5. Posteriors and Computations

For ease of exposition, we only consider the primary endpoint. Instead of directly sampling from π(f)(θ|y(n), yc0) ∝ f(y(n)|θ)π(f)(θ|yc0), where f(y(n)|θ) is given by (2.1) and π(f)(θ|yc0) is defined in (4.2) or (4.7), we consider the augmented fitting posterior distribution parameters θ*, where θ* = (μt, μc, μc01,, μc0K, μc0, τ2)′ for the hierarchical prior and θ* = (μt, μc, a0) for the normalized power prior. Then, the augmented fitting posterior distribution of θ* is given by π(f)(θ*|y(n), yc0) ∝ f(y(n)|θ) π(θ*|yc0), where π(θ*|yc0) is defined in (4.1) under the hierarchical prior, and π(θyc0)π(μcyc0,a0)[k=1Ka0kb011(1a0k)b021]π0(μt) with π(μc|yc0, a0) defined in (4.5) under the normalized power prior. Although the posterior distribution π(f)(θ*|y(n), yc0) is analytically intractable, sampling from this distribution via the Gibbs sampler is quite straightforward, because the conditional posterior distributions of the components of θ* (except for a0) are either known distributions or log-concave. For a0, we use the localized Metropolis algorithm discussed in Chen et al. (2000) to sample from its conditional posterior distribution.

Let {θ*(m), m = 1, 2, …, M} denote a Gibbs sample from the augmented fitting posterior distribution π(f)(θ*|y(n), yc0). As θ is a subvector of θ*, let θ(m) denote the corresponding components of θ*(m) from the mth Gibbs iteration. Then, it is easy to show that {θ(m), m = 1, 2, …, M} is a Gibbs sample from the fitting posterior distribution π(f)(θ|y(n), yc0). Using this Gibbs sample, a Monte Carlo estimate of P(h(θ) < δ|y(n), π(f)) is given by

P^f=1Mm=1M1{h(θ(m))<δ}. (5.1)

To compute βs(n) in (3.2), we propose the following computational algorithm: Step 0: Specify nt, nc, δ, γ, and N; Step 1: Generate θ ~ π(s)(θ); Step 2: Generate y(n) ~ f(y(n)|θ); Step 3: Run the Gibbs sampler to generate a Gibbs sample {θ(m), m = 1, 2, …, M} of size M from the fitting posterior distribution π(f)(θ|y(n), yc0); Step 4: Compute f via (5.1); Step 5: Check whether fγ; Step 6: Repeat Steps 1–5 N times; and Step 7: Compute the proportion of {f ~ ≥ γ} in these N runs, which gives an estimate of βs(n).

6. Applications to Medical Device Trials

We apply the proposed methodology in designing the non-inferiority clinical trial for medical devices discussed in Section 2. We use the historical datasets given in Table 1 to construct our priors in Bayesian SSD. We set γ = 0.95, which implies a target type I error of 0.05. We notice that the same γ value was also used in Allocco et al. (2010). In all of the computations below, N = 10, 000 and M = 20, 000 were used.

Bayesian SSD for TLF

For the primary endpoint, the margin was set to be δ=logit(4.1%)=log{0.04110.041}. We took (ξ0, η0) = (0.01, 0.01) or (ξ0, η0) = (0.001, 0.001) for the initial prior of γ in the fitting prior (4.2), π0(μc) ∝ 1 and b01 = b02 = 1 for the initial priors of μc and a0k in (4.7). We computed the powers at μt = μc and the type I error at exp(μt)1+exp(μt)=exp(μc)1+exp(μc)+exp(δ)1+exp(δ). In other words, we convert μj back to pj in the Bernoulli case. In the sampling prior (4.3), we assumed a point mass prior at μc = logit(9.2%) for π(s)(μc), where 9.2% was the pooled proportion for the two historical control datasets, and a point mass prior at μt = μc or μt=logit[exp(μc)1+exp(μc)+exp(δ)1+exp(δ)] for π(s)(μ t). We first computed the powers and the type I errors for various sample sizes based on the proposed Bayesian SSD without the incorporation of historical data. Table 2 shows the results. Table 2 also presents the powers of the two frequentist methods, namely, the z-test with unpooled variances and the score test (Farrington and Manning, 1990) for non-inferiority trials. For both frequentist methods, the target type I error was 0.05. In all calculations, the margin δ = 0.041, pc = 9.2%, and a 3:1 sample size ratio were used. PASS 2008 (Hintze, 2008) was used for computing the powers for the two frequentist SSD methods. We see from Table 2 that the proposed Bayesian SSD without incorporation of historical data gives very similar powers compared to the score test for the frequentist SSD, while the type I errors of the Bayesian SSD are controlled at or below 5%. Both the score test and Bayesian SSD yield slightly higher powers than the z-test. In order to achieve 80% power, the z-test requires a total sample size of 1636 with nt = 1227 and nc = 409.

Table 2.

Powers and Type I Errors for 12-Month TLF

Total Sample Size 1000 1080 1200 1280 1480
nt 750 810 900 960 1110
nc 250 270 300 320 370

Frequentist SSD
z Test (Unpooled) Power 0.617 0.646 0.685 0.710 0.764
Score Test Power 0.672 0.699 0.736 0.758 0.807

Bayesian SSD
No Borrowing Power 0.648 0.676 0.718 0.738 0.800
a0 = (0, 0) Type I Error 0.049 0.048 0.048 0.050 0.044

Hierarchical Prior Power 0.796 0.820 0.841 0.863 0.894
(ξ0, η0) = (0.01, 0.01) Type I Error 0.044 0.045 0.044 0.049 0.048

Hierarchical Prior Power 0.839 0.860 0.882 0.900 0.922
(ξ0, η0) = (0.001, 0.001) Type I Error 0.038 0.042 0.039 0.040 0.041

Power Prior Power 0.840 0.856 0.884 0.892 0.923
Fixed a0 = (0.3, 0.3) Type I Error 0.030 0.027 0.028 0.030 0.032

Power Prior Power 0.843 0.878 0.897 0.902 0.914
Random a0 Type I Error 0.038 0.031 0.029 0.036 0.039

Table 2 also shows the powers and the type I errors of the Bayesian SSD procedure with hierarchical priors and power priors with fixed and random a0. The hierarchical prior with (ξ0, η0) = (0.001, 0.001) leads to higher powers than the one with (ξ0, η0) = (0.01, 0.01). In addition, the powers based on the power prior with a0 random are comparable to those based on the hierarchical prior with (ξ0, η0) = (0.001, 0.001) and the power prior with a0 fixed at a0 = (0.3, 0.3). These results imply that the power prior with random a0 and the hierarchical prior with (ξ0, η0) = (0.001, 0.001) borrow approximately 30% of the historical data. With incorporation of the historical data, a sample size of (nt, nc) = (810, 270) achieves 80% power. However, based on the frequentist SSD or the Bayesian SSD without incorporation of historical data, a sample size of 1480 with nt = 1110 and nc = 370 is required to achieve 80% power. Thus, the Bayesian SSD with incorporation of historical data leads to a substantial reduction in the sample size.

Bayesian SSD for %DS

For the secondary endpoint, the margin was set to be δ = 0.20. We compute the power at μt = μc and the type I error at μt = μc + δ. In the sampling prior (4.3), we assume a point mass prior at μc = 3.15 for π(s) (μc) and a point mass prior μt = μc or μt = μc + δ for π(s) (μt). PASS 2008 (Hintze, 2008) was used to compute the powers of the frequentist SSD based on the pooled SD = 0.607. In the Bayesian SSD procedure which does not use any historical data, we used the same pooled SD for σ in generating the data. For the hierarchical prior, we took (ξ0, η0) = (0.01, 0.01) or (ξ0, η0) = (0.001, 0.001) for the initial prior of τ in the fitting prior (4.2) and ξ0 = 0.01 and η0 = 0.01 for the initial prior of τ in the sampling prior (4.3). For the power priors, we used (4.8) with a0s = (0.05, 0.05) for the sampling prior π(s)(σ2). Using the same sampling prior, we also computed the powers and type I errors with a fixed a0 = (0.08, 0.08) in the fitting prior. The results are shown in Table 3.

Table 3.

Powers and Type I Errors for 9-Month %DS

Total Sample Size 200 240 260 280 308
nt 150 180 195 210 231
nc 50 60 65 70 77

Frequentist SSD Power 0.639 0.709 0.739 0.767 0.801

Bayesian SSD
No Borrowing Power 0.644 0.699 0.747 0.769 0.805
a0 = (0, 0) Type I Error 0.051 0.049 0.051 0.050 0.048

Hierarchical Prior Power 0.710 0.773 0.800 0.820 0.847
(ξ0, η0) = (0.01, 0.01) Type I Error 0.037 0.038 0.040 0.038 0.039

Hierarchical Prior Power 0.791 0.837 0.871 0.877 0.899
(ξ0, η0) = (0.001, 0.001) Type I Error 0.023 0.024 0.025 0.028 0.027

Power Prior Power 0.812 0.864 0.880 0.899 0.918
Fixed a0 = (0.08, 0.08) Type I Error 0.022 0.023 0.026 0.027 0.028

Power Prior Power 0.805 0.857 0.878 0.893 0.913
Random a0 Type I Error 0.013 0.014 0.017 0.015 0.015

Similar to TLF, the Bayesian SSD procedure with no incorporation of historical data yields similar powers, with the type I errors controlled at the 5% level, and the hierarchical prior with (ξ0, η) = (0.001, 0.001) yields higher powers than the one with (ξ0, η = (0.01, 0.01). From Table 3, we also see that the power prior with random a0 leads to slightly higher powers than the hierarchical prior with (ξ0, η0) = (0.001, 0.001), and the powers based on the power prior with random a0 are comparable to the power prior with a fixed a0 = (0.08, 0.08). These results imply that the hierarchical prior borrows less than 8% of historical data, while the power prior with random a0 borrows about 8% of the historical data. Similar to TLF, the Bayesian SSD with incorporation of historical data again leads to a substantial reduction in the sample size compared to the frequentist design.

7. Discussion

In this paper, we have developed a general methodology of Bayesian SSD, which is particularly suitable for designing a non-inferiority clinical trial. We have discussed two types of priors, namely, the hierarchical prior and the normalized power prior, to incorporate historical data. We have shown that Bayesian SSD leads to a substantial reduction in the sample size compared to frequentist SSD. One unique feature of the proposed Bayesian SSD methodology is that we use the historical data only from the control device but not from the test device. This feature is desirable, since for the test device, historical data are often not available. Although we primarily focus on the Bernoulli and normal models in this paper, our methodology is applicable to other models in the exponential family. In addition, the proposed methodology can also be extended to generalized linear models (GLMs). The computational algorithm given in Section 5 for these two extensions is basically the same. However, there may be two potential complications. First, a closed-form expression of the normalized power prior under GLMs may not be available. Therefore, an efficient Markov chain Monte Carlo sampling algorithm needs to be developed to sample from the fitting posterior distribution in Step 3 of the computational algorithm in Section 5. Second, the determination of the non-inferiority margin may be more difficult for some GLMs than the situation without covariates. For example, for binomial regression models, the non-inferiority margin based on the difference in two proportions may not be easily converted to the margin on the regression coefficient corresponding to the treatment effect. However, this may not be an issue for other GLMs such as the normal linear regression model.

The proposed Bayesian SSD works best if the historical data from the control device are compatible to the data from the current trial. However, the target type I error and power may not be well maintained when the data from the historical and current trials are not compatible. For non-inferiority trials, we have empirically observed that (i) the type I errors are controlled but the powers are lower when the true proportions or means in the control devices from the current trial are greater than those in the historical data; and (ii) the type I errors tend to be larger, but the powers tend to be higher when the true proportions or the true means for the control devices in the current trial are less than those in the historical data. For illustrative purposes, we consider n = 1200 with nt = 900 and nc = 300 for the primary endpoint TLF and n = 280 with nt = 210 and nc = 70 for the secondary endpoint %DS. For %DS, if a point mass sampling prior at μc = 3.10 is assumed and γ = 0.95, the powers and type I errors are 0.836 and 0.047 for the hierarchical priors with (ξ0, η0) = (0.01, 0.01) and 0.936 and 0.049 for the power prior with random a0; and if a point mass sampling prior at μc = 3.20 is assumed, the powers and type I errors are 0.792 and 0.030 for the hierarchical priors with (ξ0, η0) = (0.01, 0.01) and 0.815 and 0.004 for the power prior with random a0. In all cases, the type I errors are still controlled at 0.05. However, for TLF, the type I error is not controlled as shown in Table 4. Specifically, if a point mass sampling prior at μc = logit(8.0%) is assumed, the type I errors are 0.068 for the hierarchical priors with (ξ0, η0) = (0.01, 0.01) and 0.07 for the power prior with a0k ~ beta(1, 1) and γ = 0.95. There are two approaches for resolving this type I error problem. One approach is to change the initial prior beta(b01, b02) for a0k in (4.7) to down-weight the historical control data as suggested by an anonymous Associate Editor. Another approach is to increase the value of γ, which is recommended in the FDA 2010 Guidance. As shown in Table 4 for TLF, if a point mass sampling prior at μc = logit(8.0%) is assumed, the type I error decreases in b02 when b01 is fixed at 1. When (b01, b02) = (1, 10), which gives an initial prior weight of 10% to the historical control data, the type I error is 0.053. Also, we see from Table 4 that for a fixed initial prior beta(b01, b02), the type I error decreases in γ. In particular, when (b01, b02) = (1, 1) and γ = 0.97, the type I error is 0.041 if a point mass sampling prior at μc = logit(8.0%) is assumed. A combination of these two approaches is also quite effective in controlling the type I error while maintaining good power as shown in Table 4. Further methodological approaches for controlling the type I error are currently under investigation.

Table 4.

Powers and Type I Errors under Three Sampling Priors for 12-Month TLF with (nt, nc) = (900, 300)

Fitting Prior γ Point Mass Sampling Prior at μc=logit(pc)
pc=8.0%
pc=9.2%
pc=10.0%
Power Type I Error Power Type I Error Power Type I Error
Hierarchical Prior
(ξ0, η0) = (0.01, 0.01) 0.95 0.894 0.068 0.841 0.044 0.788 0.032
0.96 0.880 0.058 0.816 0.037 0.757 0.027
0.97 0.854 0.046 0.782 0.027 0.714 0.020

Power Prior with a0k ~ beta(b01, b02) in (4.7)
(b01, b02) = (1, 1) 0.95 0.945 0.070 0.882 0.039 0.799 0.034
(b01, b02) = (1, 5) 0.95 0.916 0.061 0.832 0.033 0.760 0.026
(b01, b02) = (1, 10) 0.95 0.868 0.053 0.791 0.038 0.728 0.032

(b01, b02) = (1, 1) 0.96 0.935 0.055 0.880 0.022 0.765 0.026
0.97 0.917 0.041 0.848 0.015 0.719 0.009

(b01, b02) = (1, 5) 0.96 0.899 0.047 0.803 0.027 0.722 0.021

Finally, we briefly discuss how to determine whether the trial is successful after it is completed. The computational algorithm developed in Section 5 can still be used for this purpose. Specifically, the following algorithm can be used to determine the outcome of the trial: Step 0: Use the same γ and the same fitting prior specified at the design stage; Step 1: Obtain the data y(n) at the completion of the trial; Step 2: Run the Gibbs sampler to generate a Gibbs sample {θ(m), m = 1, 2, …, M} of size M from the fitting posterior distribution π(f)(θ|y(n), yc0); Step 3: Compute f via (5.1); and Step 4: Declare a success of the trial if fγ.

Acknowledgments

Dr. Chen is a statistical consultant for Boston Scientific Corporation. This research was partially supported by Boston Scientific Corporation. The conclusions in this paper are entirely those of the authors and do not necessarily represent the views of Boston Scientific Corporation. No conflict of interest exists among the authors. In addition, Dr. Chen and Dr. Ibrahim’s research was partially supported by NIH grants #GM 70335 and #CA 74015.

References

  1. Adcock CJ. Sample size determination: a review. The Statistician. 1997;46:261–283. [Google Scholar]
  2. Allocco DJ, Cannon LA, Britt A, Heil JE, Nersesov A, Wehrenberg S, Dawkins KD, Kereiakes DJ. A prospective evaluation of the safety and efficacy of the TAXUS Element paclitaxel-eluting coronary stent system for the treatment of de novo coronary artery lesions: design and statistical methods of the PERSEUS clinical program. Trials. 2010;11:1. doi: 10.1186/1745-6215-11-1. http://www.trialsjournal.com/content/11/1/1. [DOI] [PMC free article] [PubMed]
  3. Chen MH, Ibrahim JG. The relationship between the power prior and hierarchical models. Bayesian Analysis. 2006;1:551–574. [Google Scholar]
  4. Chen M-H, Shao Q-M, Ibrahim JG. Monte Carlo Methods in Bayesian Computation. New York: Springer-Verlag; 2000. [Google Scholar]
  5. D’Agostino RB, Sr, Massaro JM, Sullivan LM. Non-inferiority trials: design concepts and issues – the encounters of academic consultants in statistics. Statistics in Medicine. 2003;22:169–186. doi: 10.1002/sim.1425. [DOI] [PubMed] [Google Scholar]
  6. De Santis F. Using historical data for Bayesian sample size determination. Journal of the Royal Statistical Society, Series A. 2007;170:95–113. [Google Scholar]
  7. De Santis F. Power priors and their use in clinical trials. The American Statistician. 2006;60:122–129. [Google Scholar]
  8. Duan Y, Ye K, Smith EP. Evaluating water quality using power priors to incorporate historical information. Environmetrics. 2006;17:95–106. [Google Scholar]
  9. Farrington CP, Manning G. Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Statistics in Medicine. 1990;9:1447–1454. doi: 10.1002/sim.4780091208. [DOI] [PubMed] [Google Scholar]
  10. Fleming TR. Current issues in non-inferiority trials. Statistics in Medicine. 2008;27:317–332. doi: 10.1002/sim.2855. [DOI] [PubMed] [Google Scholar]
  11. Hintze J. PASS 2008. NCSS, LLC; Kaysville, Utah, USA: 2008. www.ncss.com. [Google Scholar]
  12. Hobbs BP, Carlin BP, Mandrekar S, Sargent D. Technical Report 2009-017. Division of Biostatistics, University of Minnesota; 2009. Hierarchical commensurate prior models for adaptive incorporation of historical information in clinical trials. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hung HMJ, Wang SJ, O’Neill RT. A regulatory perspective on choice of margin and statistical inference issue in non-inferiority trials. Biometrical Journal. 2005;47:28–36. doi: 10.1002/bimj.200410084. [DOI] [PubMed] [Google Scholar]
  14. Hung HMJ, Wang SJ, Tsong Y, Lawrence J, O’Neill RT. Some fundamental issues with non-inferiority testing in active controlled trials. Statistics in Medicine. 2003;22:213–225. doi: 10.1002/sim.1315. [DOI] [PubMed] [Google Scholar]
  15. Ibrahim JG, Chen MH. Power prior distributions for regression models. Statistical Science. 2000;15:46–60. [Google Scholar]
  16. Inoue LYT, Berry DA, Parmigiani G. Relationship between Bayesian and frequentist sample size determination. The American Statistician. 2005;59:79–87. [Google Scholar]
  17. Joseph L, Bélisle P. Bayesian sample size determination for normal means and differences between normal means. The Statistician. 1997;46:209–226. [Google Scholar]
  18. Joseph L, Wolfson DB, Du Berger R. Sample size calculations for binomial proportions via highest posterior density intervals. The Statistician: Journal of the Institute of Statisticians. 1995;44:143–154. [Google Scholar]
  19. Katsis A, Toman B. Bayesian sample size calculations for binomial experiments. Journal of Statistical Planning and Inference. 1999;81:349–362. [Google Scholar]
  20. Kieser M, Friede T. Planning and analysis of three-arm non-inferiority trials with binary endpoints. Statistics in Medicine. 2007;26:253–273. doi: 10.1002/sim.2543. [DOI] [PubMed] [Google Scholar]
  21. Lam Y, Lam CV. Bayesian double-sampling plans with normal distributions. The Statistician. 1997;46:193–207. [Google Scholar]
  22. Lindley DV. The choice of sample size. The Statistician. 1997;46:129–138. [Google Scholar]
  23. M’Lan CE, Joseph L, Wolfson DB. Bayesian sample size determination for binomial proportions. Bayesian Analysis. 2008;3:269–296. [Google Scholar]
  24. M’Lan CE, Joseph L, Wolfson DB. Bayesian sample size determination for case-control studies. Journal of the American Statistical Association. 2006;101:760–772. [Google Scholar]
  25. Neuenschwander B, Branson M, Spiegelhalter DJ. A note on the power prior. Statistics in Medicine. 2009;28:3562–3566. doi: 10.1002/sim.3722. [DOI] [PubMed] [Google Scholar]
  26. Pham-Gia T. On Bayesian analysis, Bayesian decision theory and the sample size problem. The Statistician. 1997;46:139–144. [Google Scholar]
  27. Rahme E, Joseph L. Exact sample size determination for binomial experiments. Journal of Statistical Planning and Inference. 1998;66:83–93. [Google Scholar]
  28. Rothmann M, Li N, Chen G, Chi GYH, Temple R, Tsou HH. Design and analysis of non-inferiority mortality trials in oncology. Statistics in Medicine. 2003;22:239–264. doi: 10.1002/sim.1400. [DOI] [PubMed] [Google Scholar]
  29. Rubin DB, Stern HS. Sample size determination using posterior predictive distributions. Sankhyâ, Series B. 1998;60:161–175. [Google Scholar]
  30. Simon R. Bayesian design and analysis of active control clinical trials. Biometrics. 1999;55:484–487. doi: 10.1111/j.0006-341x.1999.00484.x. [DOI] [PubMed] [Google Scholar]
  31. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-Care Evaluation. New York: Wiley; 2004. [Google Scholar]
  32. Stone GW, Ellis SG, Cannon L, et al. Comparison of a polymer-based paclitaxel-eluting stent with a bare metal stent in patients with complex coronary artery disease: a randomized controlled trial. Journal of the American Medical Association. 2005;294:1215–1223. doi: 10.1001/jama.294.10.1215. [DOI] [PubMed] [Google Scholar]
  33. Stone GW, Ellis SG, Cox DA, et al. A polymer-based, paclitaxel-eluting stent in patients with coronary artery disease. The New England Journal of Medicine. 2004;350:221–231. doi: 10.1056/NEJMoa032441. [DOI] [PubMed] [Google Scholar]
  34. Wang F, Gelfand AE. A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models. Statistical Science. 2002;17:193–208. [Google Scholar]
  35. Weiss R. Bayesian sample size calculations for hypothesis testing. The Statistician. 1997;46:185–191. [Google Scholar]

RESOURCES