Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Apr 1.
Published in final edited form as: Comput Stat Data Anal. 2018 Sep 15;132:70–83. doi: 10.1016/j.csda.2018.08.018

Non-inferiority Testing for Risk Ratio, Odds Ratio and Number Needed to Treat in Three-arm Trial

Shrabanti Chowdhury 2, Ram C Tiwari 3, Samiran Ghosh 1,2,
PMCID: PMC6867690  NIHMSID: NIHMS1506867  PMID: 31749512

Abstract

Three-arm non-inferiority (NI) trial including the experimental treatment, an active reference treatment, and a placebo where the outcome of interest is binary are considered. While the risk difference (RD) is the most common and well explored functional form for testing efficacy (or effectiveness), however, recent FDA guideline suggested measures such as relative risk (RR), odds ratio (OR), number needed to treat (NNT) among others, on the basis of which NI can be claimed for binary outcome. Albeit, developing test based on these different functions of binary outcome are challenging. This is because the construction and interpretation of NI margin for such functions are non-trivial extensions of RD based approach. A Frequentist test based on traditional fraction margin approach for RR, OR and NNT are proposed first. Furthermore a conditional testing approach is developed by incorporating assay sensitivity (AS) condition directly into NI testing. A detailed discussion of sample size/power calculation are also put forward which could be readily used while designing such trials in practice. A clinical trial data is reanalyzed to demonstrate the presented approach.

Keywords: Assay Sensitivity, Binary Outcome, Fraction Margin, Non-inferiority Margin, Odds/Risk Ratio/NNT, Three-arm Trial

1. Introduction

With the steady improvements in health care technologies, standard of care, and clinical outcomes, the incremental benefits of newly developed interventions may be only marginal over existing treatments. However, in the presence of established treatments/therapies, placebo-controlled Randomized Control Trials (RCTs) are neither ethical nor clinically justified. Active-controlled NI trial is an attractive alternative in such situations, particularly when a slightly less efficacious treatment may be preferable to a group of patients in view of lower toxicity, less intensive side effects, ease of delivery and other less incapacitating factors. NI trials are intended to show if the new intervention retains a substantial portion of the active control effect, dictated by a pre-specified margin, often termed as NI margin (δ). Such margin must be prospectively defined and should be so chosen to reflect maximum acceptable extent of clinical non-inferiority of an experimental treatment. Further detailed discussion on the construction and desirable properties of NI margin can be found in the regulatory guidelines(FDA (2016), ICHE9 (2009), ICHE10 (2009)) and references (e.g. Althunian et al. (2017), Schumi and Wittes (2011), Brown et al. (2006), Hung and Wang (2004)). NI trials may or may not include a placebo arm due to ethical reasons. Two-arm placebo-free NI trials make two important assumptions regarding Assay Sensitivity (ICHE9 (2009), ICHE10 (2009)) and Constancy and depends heavily on external validations (D’Agostino et al. (2003) and FDA (2016)) and several other limiting factors as specified in Kieser and Stucke (2016). To alleviate some of these issues and if ethically acceptable and practically feasible, it is recommended by EMA (2005) to include a placebo arm in the current trial, resulting in a three-arm “gold-standard” design that has greater confidence concerning AS and lesser concern related to external validity.

For three-arm trial in the Frequentist setup, Pigeot et al. (2003) first proposed the fraction margin approach, where NI margin is adaptively formulated as the pre-specified negative fraction of the unknown effect size of the reference treatment over placebo in the current three-arm trial. Kieser and Friede (2007) extended this approach for the binary outcome for risk difference (RD). While RD is the simplest functional form for binary outcomes, as mentioned in the recent FDA guidance (FDA (2016), Page 24) there are other functionals (e.g., risk ratio (RR), odds ratio (OR), number needed to treat (NNT), risk reduction etc.) which could also be used to test the treatment effect (Hashemi et al., 1997) and claim NI. Under the NI setup, there exists published work for odds ratio using Frequentist’s approach for two-arm trial, see for example Hilton (2010) and Rousson and Seifert (2008), but no work exists for three-arm trial. Also, to the best of our knowledge for RR and NNT (Keefe et al., 2013) type functional form, no work on NI testing exists for either two or three-arm trial. This motivates us to introduce such methods and develop NI test procedure under Frequentist approach in this article. Apart from extending popular tests based on fraction margin approach for such functionals, in this paper we also propose a new approach based on conditional principle which directly employs the AS condition under Frequentist setup. This approach shows additional gain in sample size to achieve a desired level of power under certain situations. Extensive tables are calculated for sample size for all three types of functionals.

The rest of the article is organized as follows. In Section 2, we give the NI hypothesis and the details of NI margin. We show the non-uniqueness of the NI margin for different functionals. In Section 3, we discuss the existing method and propose a conditional Frequentist’s method for testing NI. In Section 4, we discuss the power and sample size calculation for the three functionals. Finally in Section 5, we apply our proposed methods on a published clinical trial data set. We conclude the article with discussions in Section 6. All proofs are provided in supplementary file for brevity purpose.

2. Non-inferiority Hypothesis Testing Setup

For a three-arm trial, fraction margin approach (Pigeot et al. (2003), Kieser and Friede (2007)) is popularly used for testing NI hypothesis and finding the corresponding decision rule. We begin our illustration borrowing the notations from Kieser and Friede (2007). Denote the primary end-points from the Placebo (P), Reference (R) and the Experimental (E) treatment in the current trial by XP, XR and XE respectively, each following Bin (nl, πl), where πl is the probability of success and nl is the sample size for the lth arm, l ∈ {P, R, E}. Without loss of generality, we assume that higher response probabilities indicate greater treatment benefits. Gamalo et al. (2011) used the two-arm fixed margin approach for NI testing considering the RD as the function of interest. Kieser and Friede (2007) formulated the three-arm NI hypothesis for binary outcome under fraction margin approach, where NI hypothesis for RD in terms of NI margin δ is given by

H0:πEπRδvs.H1:πEπR>δ. (2.1)

In the fraction margin approach, the construction of δ(< 0) can be mathematically expressed as δ = f (πRπP ), where f is a negative fraction f ∈ [−1, 0] assuming the condition of assay sensitivity, that is, πRπP > 0 holds. Figure 1(a) shows the NI region in the difference scale, which is directed to the right of πR + δ.

Figure 1:

Figure 1:

Three-arm NI Trial for (a) RD Margin with δ, (b) RR Margin 1 with δ1 and (c) ϵ −substantial non-inferiority for NNT

Before discussing the NI testing for a three-arm NI trial in terms of (1) RR, (2) OR and (3) NNT, we first reformulate the NI hypothesis using a general functional form as

H0:ψ(πE,πR)g(δ)vs.H1:ψ(πE,πR)>g(δ), (2.2)

where the decision boundary g (δ) is some function of δ, such that |δ| ∈ [0, 1], which denotes an appropriate portion of the unknown effect size of the active-control over placebo. RD hypothesis expressed in equation (2.1) can be also seen as a special case of above. For example, consider ψ(πE, πR) = πEπR, then the boundary is g(δ) = δ, which also happens to be the same as the NI margin itself. In the RR (or OR) scale, one can choose g (δ) = 1+δ, implying the NI margin is δ(< 0), which we term as Margin 1. Choosing a too restrictive δ, that is close to zero, would require large number of subjects to claim NI, whereas a too loose choice of δ may potentially approve a substantially inferior drug. In the RR (or OR) scale we construct δ = δ1 = f(1 − ψ(πP, πR)(< 0), where f ∈ [−1, 0] as in the RD case. Using this expression for δ1, we can write the hypothesis in (2.2) as

H01:ψ(πE,πR)1+f(1ψ(πP,πR))vs.H11:ψ(πE,πR)>1+f(1ψ(πP,πR)). (2.3)

In practice, clinical considerations should drive the choice of f and some of the reasonable values are 12,13,15. For all testing (e.g. RD, RR, OR etc.) choosing f = 0 implies NI margin is zero, hence the hypothesis in (2.1) and (2.3) becomes the superiority test of πE over πR. While for f = −1 the active control loses its practical significance over placebo, hence the test reduces to the simple superiority test of πE over πP. This can be easily checked for all functional forms considered in this paper. Note that, construction of a margin must satisfy these two boundary conditions. However, there may be other possible mathematical form which can adhere to these, thus implying such a margin may not be unique. To elucidate this fact one can also formulate the NI hypothesis by taking g(δ) = δ and constructing δ = δ2 = (ψ(πP, πR))f(> 0) where f ∈ [−1, 0] (see Wangge et al. (2013)). In this case the NI margin is 1 − g(δ) = 1 − δ2, which we term as Margin 2. Thus the NI hypothesis under Margin 2 becomes

H02:ψ(πE,πR)(ψ(πP,πR))fvs.H12:ψ(πE,πR)>(ψ(πP,πR))f. (2.4)

The two NI margins satisfy both the boundary conditions for f = 0 and −1, however, lead to two slightly different NI testing. Next, we discuss the specific cases for RR, OR and NNT.

2.1. Risk Ratio

Margin 1 : For RR the function ψ(πE,πR)=πEπR and ψ(πP,πR)=πPπR. Thus under Margin 1, the NI hypothesis testing in (2.3) becomes:

H01:πEπR1+f(1πPπR)vsH11:πEπR>1+f(1πPπR). (2.5)

Margin 1, i.e. δ = δ1 is constructed as the fraction of the difference between the unity and ratio of placebo to active control (reference) treatment effect, in the current three-arm trial.

As can be seen from Figure 1 (b), the NI region is directed to the right side of the point (1+δ1). Clearly from (2.5) we see that for f = 0 and f = −1, the test satisfies two boundary conditions. Now putting θ = 1 + f and after simplification the hypothesis in equation (2.5) can be written as

H01:πEθπR(1θ)πP0vsH11:πEθπR(1θ)πP>0, (2.6)

where θ is the pre-specified fraction of the effect of the reference drug relative to the placebo. The test drug would be non-inferior if its efficacy relative to placebo achieves at least θ×100% of the efficacy of the reference drug compared to placebo. Although different values of θ (θ ∈ [0, 1]) are chosen for different purposes, specifically for the NI testing of the new drug, θ is restricted in [0.5,1), thus making sure that the new drug retains at least 50% effect of the active control.

Margin 2 : For ψ(πE,πR)=πEπR and ψ(πP,πR)=πPπR, the NI hypothesis testing in (2.4) under RR Margin 2 can be written by taking logarithm as

H02:log(πE)log(πR)f[log(πP)log(πR)]vs.H12:log(πE)log(πR)>f[log(πP)log(πR)]. (2.7)

Taking θ = 1+f as before, the following hypothesis can be written after some simplification

H02:log(πE)θlog(πR)(1θ)log(πP)0vs.H12:log(πE)θlog(πR)(1θ)log(πP)>0. (2.8)

Compared to the Margin 1 hypothesis in (2.6), Margin 2 hypothesis above represents the testing in the logarithm of each proportion (i.e. πl, l ∈ {P, R, E}). The interpretation of θ in terms of effect retention in the log scale is little more involved compared to Margin 1. Albeit, since all the Frequentist tests are asymptotic normal approximation, the log transformed version of Margin 2 is expected to perform better.

To compare the NI regions we plot the boundary g(δ) as function of θ ∈ [0, 1] for test 1 and test 2 respectively in Figure 2 (a) and observe that both are increasing functions of θ. The area above the two curves are the respective NI regions, which happen to be bounded within the unit square. The vertical line corresponds to θ = 0.5 and the region to its right corresponds to the values of θ chosen for NI testing. However, for all values of θ, the NI region for NI hypothesis testing 2 (2.8) is bigger than that for NI testing 1 (2.6), implying that the NI testing 2 is more powerful as compared to the NI testing 1. This is also depicted in Figure 3 (a) where we plot the power curves for θ = 0.8 and total sample size of N = 300 under equal allocation, keeping πR and πP fixed at 0.7 and 0.1 respectively. Details of the derivation for the power curves are given later in Section 3. The vertical lines represent the respective values of πE under the null hypothesis for the respective margins.

Figure 2:

Figure 2:

Comparison of NI regions for RR tests in (a) and OR tests in (b).

Figure 3:

Figure 3:

Comparison of power curves for two margins of RR in (a) and OR in (b).

Remark 1: Since Margin 2 is formulated in the logarithm scale, same value of θ will have different impact in terms of margin width, thus further affecting resulting NI test. One can show that under the same preservation level θ (or loss-of-effect f) of E and for fixed control effect in the RR scale (πRπP), the ratio πEπP has to exceed a smaller quantity under Margin 2 as compared to Margin 1 for E to be non-inferior. This implies that the NI test based on Margin 2 gives a little more relaxed margin as compared to the test based on Margin 1, without compromising θ. This is equivalent to saying that if we fix the NI margin or equivalently the quantity that πEπP needs to exceed for E to be non-inferior, then θ under Margin 2 has to be larger than that under Margin 1 to achieve this. Denoting θ by θ1 under Margin 1 and by θ2 under Margin 2, we can express θ1 as a function of θ2 as θ1=(πRπP)θ21πRπP1. Although in this paper we calculate the power and sample size under Margin 2 across different θ for all functionals, the same will be true under Margin 1 for a slightly smaller θ which can be readily obtained from the above relation between θ1 and θ2. For example, when πR = 0.7 and πP = 0.5 giving πRπP=1.4, and taking θ2 = 0.8, we obtain the value of θ1 as 0.772.

Remark 2: Note that when the function of interest is Risk reduction, the function ψ(πE, πR) takes the form 1πEπR. From (2.5) we have H01:1πEπRf(1πPπR)vs.H11:1πEπR<f(1πPπR). Hence the NI hypothesis and test procedures developed for risk ratio is exactly identical for risk reduction, and thus the latter does not need any separate derivation. Risk Reduction is another possible functional form mentioned in the FDA guidance (FDA, 2016).

2.2. Odds Ratio

Margin 1 : For OR the function ψ(πE,πR)=πE(1πE)πR(1πR), and similarly one can define ψ(πP, πR). Thus, under Margin 1, the NI testing in (2.3) becomes the following:

H01:πE(1πE)πR(1πR)1+f(1πP(1πP)πR(1πR))vs.H11:πE(1πE)πR(1πR)>1+f(1πP(1πP)πR(1πR)) (2.9)

Clearly from (2.9) we see that for f = 0, the margin δ1 becomes 0 and hence the above test will be a superiority test of the experimental treatment (E) over the control (R) in the current trial since πE(1πE)πR(1πR)πEπR. Again for f = −1, we see that the test (2.9) becomes the simple superiority test of πE over πP. Now putting θ = 1 + f and after some simplification the above test becomes

H01:πE(1πE)θπR(1πR)(1θ)πP(1πP)0vs.H11:πE(1πE)θπR(1πR)(1θ)πP(1πR), (2.10)

where θ holds the similar interpretation as described in the context of RR before.

Margin 2 : The NI hypothesis testing in (2.4), under Margin 2, can be written by taking logarithm as

H02:log(πE(1πE))log(πR(1πR))f(log(πP(1πP))log(πR(1πR)))vs.H12:log(πE(1πE))log(πR(1πR))>f(log(πP(1πP))log(πR(1πR))). (2.11)

Taking θ = 1 + f the following test can be obtained from (2.7) after some simplification

H02:log(πE(1πE))θlog(πR(1πR))(1θ)log(πP(1πP))0vs.H12:log(πE(1πE))θlog(πR(1πR))(1θ)log(πP(1πP))>0. (2.12)

Similar to RR, we plot NI regions and the power curves for OR under the two tests resulted from the respective NI margins in Figure 2 (b) and Figure 3 (b) respectively. We again observe that test 2 is more powerful compared to test 1 for OR since the former gives a more relaxed margin compared to the latter under fixed θ and fixed control effect in the OR scale. The relation between the two preservation levels under the respective margins in the OR scale becomes θ1=(πR/(1πR)πR/(1πP))θ21πR/(1πR)πP/(1πP)1. Note that as in RR case the logarithm transformations make the data conform more closely to the Normal distribution giving better asymptotic performance.

2.3. NNT

As discussed for RD the NI hypothesis in (2.1) and (2.2) will be of the following form:

H0:πE(πR+δ)0vs.H1:πE(πR+δ)>0, (2.13)

where the boundary g(δ) = δ, which is constructed as the negative fraction of the unknown difference between the control and placebo in the current trial (Kieser and Friede, 2007). Since NNT is the inverse of RD, one would want the value of NNT to be as small as possible. The higher the value of NNT, less effective is the treatment (Keefe et al., 2013). The ideal case would be when all patients in the treatment arm show improvement while none in the control arm has improved leading to the value of NNT to be 1. For treatments like pain killer for acute pain, an effective treatment is expected to have an NNT between 2–5. In other situations like using aspirin after heart attack, a quite higher NNT (40 +) would indicate an effective therapy, while NNT can be as low as 1 for treating a sensitive bacterial infection with antibiotics (Cook and Sackett, 1995). In the context of NI testing for RD, experimental intervention E is declared to be non-inferior over R if the treatment effect πE exceeds πR +δ, (δ < 0) (in Figure 1 (a)). Hence the NI hypothesis for NNT can be formulated from (2.13) by taking the reciprocal of both sides. However, to avoid taking the reciprocal of 0, one can formulate NI hypothesis, where E will be declared ϵ-substantially non-inferior over R if πE exceeds πR + δ + ϵ, where ϵ > 0 is a pre-chosen small integer. We write the ϵ-substantial NI hypothesis below:

H0:πE(πR+δ)ϵvs.H1:πE(πR+δ)>ϵ. (2.14)

Figure 1 (c) shows the ϵ-substantial non-inferiority region which is to the right of the point πR + δ + ϵ. Now we formulate the NI hypothesis for NNT from (2.14) in the following with the condition that πE > πR + δ

H0:1πE(πR+δ)1/ϵ=Dvs.H1:1πE(πR+δ)<1/ϵ=D, (2.15)

where D is a positive integer denoting the additional number of patients required to declare NI of E over R (Cook and Sackett, 1995). Note that in case πE < (πR + δ), NI testing of E over R does not have any practical meaning. After some simplification and putting θ = 1 + f, f ∈ [−1, 0] and δ = f (πRπP ), the NI hypothesis for NNT in (2.15) takes the form:

H0:πEθπR(1θ)πPϵvsH1:πEθπR(1θ)πP>ϵ. (2.16)

This is exactly same as the hypothesis for RD when ϵ → 0. The interpretation of θ and f remain same as for the RD case.

3. Proposed Approach for NI Testing

For testing NI hypothesis we follow the general guideline developed by Pigeot et al. (2003) and Kieser and Friede (2007). The MLE of the Binomial proportion πl is π^l=Xlnl and its variance is given by πl(1πl)nl, l ∈ {E, R, P}. In the Frequentist’s approach the test statistics are based on the maximum likelihood estimator (MLE) of the parametric function ψ (πE, πR) and ψ (πP, πR) and under asymptotic normality, the statistic ψ(π^E,π^R)g(δ^)Var(ψ(π^E,π^R)) is assumed to follow N (0, 1) under H0 in (2.2), where g(δ^) is some function of ψ(π^P,π^R). Instead of MLE one may also consider the restricted maximum likelihood estimator (RMLE) of πl subject to the constraint ψ (πE, πR) = g (δ). More specifically for risk difference the test statistic for three-arm NI testing is TσT2, where, T=π^Eθπ^R(1θ)π^P and σT2=πE(1πE)nE+θ2πR(1πR)nR+(1θ)2πP(1πP)nP. Though Pigeot et al. (2003) explicitly mentioned that superiority of the reference over placebo (i.e the AS condition) must be tested before one employs fraction-margin approach for testing NI hypothesis. However, in practice this key first step is often ignored. This may lead to somewhat over estimation of the sample size. Moreover the AS condition (either tested or assumed) is not used further in NI testing itself. In this article we first develop traditional Frequentist’s approach of NI testing for RR, OR and NNT closely following the marginal approach developed earlier for RD. It turns out as a common rule that the pretest of AS over NI is subordinated in the complete test procedure. As mentioned in Mielke and Munk (2009) the power of simply testing NI nearly coincides with the power of complete (or joint) test procedure for commonly used alternatives. Thus, the focus of testing non-inferiority is almost always on NI hypothesis itself, albeit AS condition must be verified first in all practical examples. We note that, for all the examples in the current manuscript we always ensured first that the AS condition is met. However, since the two test statistics used for AS and NI, respectively, are correlated, we put forward the fact that the hypothesis for NI testing must be tested conditionally based on the fact that the AS null hypothesis is rejected already. Next, we propose the conditional approach of NI testing, thus incorporating the AS condition πRπP > 0 in the Frequentist’s statistic itself.

In the following two subsections we give the Frequentist’s approach of NI testing based on the marginal and conditional approach. We discuss NI testing for both Margin 1 and Margin 2. Note that all Frequentist’s tests are asymptotic approximate and the performance of such tests depend upon the accuracy of transformation for all the functions. We note that one can also perform score test or likelihood ratio test (Mielke, 2010; Tang et al., 2014) to conduct the NI testing. However, in our experience the performance of such tests are very close to that of normal approximation. This is not completely surprising given the large sample size, which, often is a general characteristic of many NI trials. Specifically, some additional results of score test in that direction are included in supplementary material. Albeit, as suggested by one reviewer these are all plausible alternatives and may enhance performance of the test in certain situations.

3.1. Test Procedure and Sample Size: Marginal Approach

Rather than developing separate tests for RR, OR and NNT, we first write the the NI hypotheses in (2.6), (2.8), (2.10) and (2.12) and (2.15) in a general form as

H0:g(πE)θg(πR)(1θ)g(πP)ϵvs.H1:g(πE)θg(πR)(1θ)g(πP)>ϵ, (3.1)

where ϵ ≥ 0. For RR and OR, ϵ = 0 while for NNT, ϵ > 0. For RR test 1: g(πl) = πl, for RR test 2: g(πl) = log(πl). For OR test 1: g(πl) = πl/(1 − πl) and for OR test 2: g(πl) = log(πl/(1 − πl)), and for NNT, g(πl) = πl, l ∈ {E, R, P}. Now consider the test statistic T=g(π^E)θg(π^R)(1θ)g(π^P)ϵ for testing the NI hypothesis in (3.1), π^l being the MLE of πl, l ∈ {E, R, P}. The variance of T will be σT2=Var(g(π^E))+θ2Var(g(π^R))+(1θ)2Var(g(π^P)), where Var(g(π^l))Var(π^l)(g(π^l))2 at π^l=πl, for l ∈ {E, R, P}. For g(πl) = πl, σl2=πl(1πl)nl (RR test 1 and NNT); for g(πl) = log(πl) σl2=(1πl)nlπl; for g(πl) = πl/(1 − πl), σl2=πlnl(1πl)3; and for g(πl) = log(πl/(1 − πl)), σl2=1nlπl(1πl). Now under asymptotic normality we assume Z=T/σT2~N(0,1) under H0, since μT = E(T) = g(πE) − θg(πR) – (1 − θ)g(πP) − ϵ = 0 under H0. So the rejection criteria will be Z > z1−α, where z1−α is the 100(1−α)% of the standard Normal distribution. The value of α is usually chosen to be 0.025.

3.1.1. Sample Size

For the sample size determination we first derive the power function of the above test procedure. We use the notation πl,1 to denote the proportion in the lth arm under the alternative hypothesis, and πl,0 to denote the same under H0. Borrowing notations from Kieser and Friede (2007), let ψ = g(πE) − θg(πR) – (1 − θ)g(πP) − ϵ and let ψ1 = g(πE,1) − θg(πR,1) − (1 − θ)g(πP,1) − ϵ for the alternative to be detected. The variance of the MLE ψ^ under H1 will be VarH1(ψ^)=σE,12+θ2σR,12+(1θ)2σP,12, where σl,12 is σl2 with πl replaced by πl,1 and the expression of σl2 for NNT and different RR and OR tests are described above. Now, for simplicity, we express the sample size in the reference (nR) and the experimental (nE) arms as the ratio r1 and r2 respectively of the sample size nP = n, say, in the placebo arm such that nP : nR : nE = 1 : nR/nP : nE/nP = 1 : r1 : r2. Here r1 and r2 are known positive quantities that determine the allocation ratio of the sample sizes in the arms R and E respectively, relative to the arm P. The total sample size, thus, would be N = n(1 + r1 + r2). Since VarH1(ψ^) involves nP, nR and nE we replace the latter two in terms of the ratios of nP and denote τ12=nPVarH1(ψ^) under H1. Analogously, ψ0 and τ0 denote the same expressions as ψ1 and τ1, replacing πl,1 by πl,0 under H0, satisfying the restriction g(πE,0) − θg(πR,0) – (1 − θ)g(πP,0) = ϵ and this implies ψ0 = 0. Under asymptotic normality Z0=nPψ^/τ0 and Z1=nP(ψ^ψ1)/τ1 are assumed to follow N(0, 1) under H0 and H1 in (3.1) respectively. Hence, the asymptotic expression of power is given by

PH1(Z0z1α)=PH1(Z1>z1ατ0τ1nPψ1τ1)=1Φ(z1ατ0τ1nPψ1τ1), (3.2)

where Φ(·) denotes the cumulative distribution function of N(0, 1). For achieving a power of (1 − β) % the sample size nP can be obtained explicitly as

nP=(z1ατ0+z1βτ1)21(ψ1)2. (3.3)

3.2. Test Procedure and Sample Size: Conditional Approach

We introduce our conditional approach for NI hypothesis testing given in (3.1) by incorporating the AS condition (i.e. πR > πP ). For finding the MLE we truncate the parameter space of (πE, πR, πP ) such that it belongs to {πE, πR, πP : πE ∈ [0, 1], πR ∈ [0, 1], πP ∈ [0, 1], πR > πP }. One may develop an LR-test based on the statistic

T=g(π^E)θg(π^R)(1θ)g(π^P)ϵ (3.4)

under null hypothesis subject to the imposed condition (πR > πP ) via Wald-type test. Following Mutze et al. (2015) one can improve the convergence via the restricted maximum likelihood (RML) which requires solving under H0

(π^E,RML,π^R,RML,π^P,RML)=argmaxg(πE)θg(πR)(1θ)g(πP)ϵ0,πR>πPlogl(πE,πR,πP), (3.5)

where logl(πE, πR, πP ) denotes the log-likelihood. This optimization problem can be solved numerically but no closed form expression is possible. One practical strategy to reduce computational burden, that is often recommended in practice, is to work with unrestricted MLE which is TML=g(π^E,ML)θg(π^R,ML)(1θ)g(π^P,ML)ϵ, however only considering the part restricted by π^R,ML>π^P,ML, which is TRMLTMLI[π^R,MLπ^P,ML>0]. This strategy (see Huang et al. (2011) and Kulldorff (1997)) is proved to be quite useful in many practical applications. Since working with product of random variables is little cumbersome, one can further show that f(TRML)f(TML|π^R,MLπ^P,ML>0)×Pr[π^R,MLπ^P,ML>0]. Since 0π^R,ML,π^P,ML1, are i.i.d. random variables, it is easy to prove Pr[π^R,MLπ^P,ML>0]=12 which can be absorbed as a constant. Hence for all practical purpose one can consider the distribution of the test statistic, (TML|π^R,ML>π^P,ML)(g(π^E,ML)θg(π^R,ML)(1θ)g(π^P,ML)ϵ|π^R,ML>π^P,ML). For notational simplicity from now onwards we denote the ML estimate π^l,ML by π^l, l ∈ {E, R, P}. Note that for the specific forms of g(πl) defined above for RR, OR and NNT, g(πl) is monotone in πl, l ∈ {E, R, P}. Hence imposing the restriction πR > πP is equivalent to g(πR) > g(πP ). This leads to the modified test statistic for NI testing: (W=g(π^E)θg(π^R)(1θ)g(π^P)ϵ|g(π^R)>g(π^P)). We write W as (U – ψV – ψ|V > 0), where U=g(π^E)g(π^P) and V=g(π^R)g(π^P) are two correlated random variables. Under the asymptotic normality of W we have Wμwσw~AN(0,1), where E[W] = μw and V[W]=σw2.

Lemma 3.2.1 Under conditional normal approximation, the mean μw and variance σw2 of W=g(π^E)θg(π^R)(1θ)g(π^P)ϵ|g(π^R)g(π^P)>0, for ϵ ≥ 0 are given by

μw=μU+σUρcϕ(d)θ(μV+σV1cϕ(d))ϵσw2=σU2[1+ρ2cdϕ(d)(ρcϕ(d))2]+σV2[1ϕ(d)c(ϕ(d)cd)]2θ[σUσVρc(c+dϕ(d))+σUμVρcϕ(d)+σVμU1cϕ(d)+μUμV(μU+σUρcϕ(d))(μV+σV1cϕ(d))]. (3.6)

More specific values can be obtained for different tests as below,

μU=πEπP,μV=πRπP,,σl2=πl(1πl)nl,l{E,R,P}:RRtest1,μU=log(πE)log(πP),μV=log(πR)log(πP),,σl2=(1πl)nlπl,l{E,R,P}:RRtest2,μU=πE1πEπP1πP,μV=πR1πPπP1πP,σl2=πlnl(1πl)3,l{E,R,P}:ORtest1,μU=log(πE1πE)log(πP1πP),μV=log(πR1πP)log(πP1πP),σl2=1nlπl(1πl):ORtest2,σU2=σE2+σP2,σV2=σR2+σP2,ρ=Var(π^P)Var(U)Var(V)=σP2σU2σV2,d=μVσV,c=1Φ(d).

Proof: See A.1 in the file of supplementary material.

As before, we denote under H0, πE by πEnull and under H1, πE by πEalt as point alternative. Under H0, the expression of πEnull can be obtained by solving g(πEnull)=g(πP)+θ(g(πR)g(πP))+ϵ. Under H1, πEalt satisfies g(πEalt)θg(πR)(1θ)g(πP)>ϵ(g(πEalt)g(πP))>θ(g(πR)g(πP))+ϵ. Note πE is involved in the expression of the mean and variance of W. Hence under asymptotic normality, we have the following

Wμwnullσwnull~AN(0,1)underH0,andWμwaltσwalt~AN(0,1)underH1.

The critical region of the test is given by W > k, where k is obtained by assuming a test of size α:PH0(W>k*)=α, implying k*=μwnull+z1ασwnull, where z1−α is the 100(1 − α)% percentile point of the N (0, 1) distribution.

3.2.1. Sample Size

Using our proposed approach we can calculate sample size for the assessment of NI to attain a desired power. We give the expression of the power of the test for a point alternative πE=πEalt:PH1(W>k*)=1Φ(k*μwaltσwalt). Now to obtain the power function of the test we fix πR, πP and θ and vary πEalt. The sample size nP = n (of the arm P) is calculated from the following equation so that the power achieved is at least 100(1 − β)%:

PH1(W>k*)1βΦ(k*μwaltσwalt)β. (3.7)

For example, setting β at 20%, or the power at 80%, n is determined from equation (3.7). In our sample size calculation we fix πR and πP and vary πE=πEalt under H1 satisfying H1:g(πEalt)θg(πR)(1θ)g(πP)>ϵ. Under H0, πE=πEnull is obtained from g(πEnull)θg(πR)(1θ)g(πP)=ϵ. We obtain the power function by varying πEalt. Thus, we obtain nP from (3.3) for each πEalt and obtain the sample size in the other arms and hence the total sample size as a function of the allocation ratios.

4. Sample Size Tables for the Non-Inferiority Testing

Before going to the sample size calculation we generate the power curves under both the conditional and marginal approaches to get an idea about the operating characteristics of the proposed methods. In Figure 4, we plot the power curves corresponding to three different values of θ : 0.9, 0.8 and 0.7 for RR and OR. The power curves for NNT can be similarly obtained but not shown. The three values of θ correspond to f = −0.1, −0.2 and −0.3 respectively, which correspond to the three choices of the NI margin. This implies that the experimental drug, in comparison to placebo, must achieve at least 90%, 80% and 70% respectively of the effect of the active control with respect to the placebo in order to be noninferior. From Figure 4, we observe that as θ decreases, the power curve becomes steeper which means for smaller θ the proposed test is more powerful than that for higher θ. This makes sense as for smaller θ (or larger f) it is easier to declare NI of the experimental drug over the reference, since in that case the new drug has to preserve smaller proportion of the control drug in the current trial in order to be non-inferior.

Figure 4:

Figure 4:

Power curves for different θ under (a) RR Conditional, (b) RR marginal, (c) OR Conditional and (d) OR Marginal approaches keeping πR = 0.7 and πP = 0.1

Next we refer to the Sections 3.1.1 and 3.2.1 for the sample size determination under our proposed marginal and the conditional approach respectively. As discussed earlier, sample sizes in the placebo, reference and the experimental arms are denoted by n, r1n and r2n respectively, with r1, r2 ≥ 1. To compute (nE, nR, nP ), we consider three possible allocations for (P, R, E): (1 : 1 : 1), (1 : 2 : 2) and (1 : 2 : 3) of the total sample size N. Hence, for the allocation (1 : 1 : 1), r1 = r2 = 1, for (1 : 2 : 2), r1 = r2 = 2 and for (1 : 2 : 3) the values are r1 = 2 and r2 = 3. The power expression of the proposed conditional approach does not give an explicit solution for nP and hence an iterative process is needed. We determine the sample size under the two approaches for θ = 0.8 and 0.7 for (πR = 0.7, πP = 0.1) and (πR = 0.6, πP = 0.55). For NNT since we are considering ϵ −substantial non-inferiority, we choose ϵ = 0.05 which is equivalent to treating an additional 20 patients in order to see the benefit of the experimental drug; that is, to declare NI of E over R. We present the sample size for RR in Table 1, for OR in Table 2 and for NNT in Table 3.

Table 1:

Sample Size for RR to Achieve a Power of 80 % for θ = 0.8 and θ = 0.7, α = 0.025 and πE ϵ [0.65, 0.9] under Three Different Allocations. The simulated power (SimP) is also reported to show that calculated sample size is adequate to guarantee 80% power except for minor numerical fluctuation.

Allocation πR = 0.7, πP = 0.1 πR = 0.6, πP = 0.55
Conditional Marginal Conditional Marginal
P R E θ πE nP N SimP nP N SimP nP N SimP nP N SimP
1 1 1 0.8 0.9 27 81 0.874 27 81 0.876 40 120 0.977 43 129 0.909
0.85 33 99 0.874 33 99 0.883 55 165 0.958 58 174 0.873
0.8 42 126 0.877 42 126 0.870 82 246 0.944 86 258 0.857
0.75 56 168 0.879 56 168 0.869 136 408 0.927 141 423 0.836
0.7 79 237 0.878 79 237 0.857 278 834 0.899 286 858 0.812
0.65 124 372 0.882 124 372 0.846 909 2727 0.877 915 2745 0.808

0.7 0.9 24 72 0.865 24 72 0.887 38 114 0.969 39 117 0.904
0.85 28 84 0.854 28 84 0.898 52 156 0.949 53 159 0.870
0.8 33 99 0.857 33 99 0.894 76 228 0.934 77 231 0.848
0.75 40 120 0.862 40 120 0.888 124 372 0.912 125 375 0.838
0.7 51 153 0.873 51 153 0.884 245 735 0.894 247 741 0.813
0.65 68 204 0.875 68 204 0.872 734 2202 0.871 737 2211 0.813

1 2 2 0.8 0.9 17 85 0.807 17 85 0.788 22 110 0.977 22 110 0.906
0.85 21 105 0.830 21 105 0.827 30 150 0.961 30 150 0.874
0.8 27 135 0.854 27 135 0.863 44 220 0.935 44 220 0.852
0.75 35 175 0.859 35 175 0.879 73 365 0.923 73 365 0.841
0.7 49 245 0.871 49 245 0.873 147 735 0.906 148 740 0.833
0.65 76 380 0.884 76 380 0.862 470 2350 0.876 471 2355 0.818

0.7 0.9 17 85 0.809 17 85 0.813 21 105 0.957 21 105 0.905
0.85 19 95 0.805 19 95 0.835 29 145 0.943 29 145 0.886
0.8 23 115 0.835 23 115 0.868 42 210 0.922 42 210 0.864
0.75 28 140 0.825 28 140 0.889 68 340 0.906 68 340 0.839
0.7 35 175 0.845 35 175 0.898 133 665 0.888 133 665 0.824
0.65 47 235 0.866 47 235 0.898 395 1975 0.877 395 1975 0.814

1 2 3 0.8 0.9 15 90 0.787 15 90 0.771 18 108 0.983 19 114 0.934
0.85 18 108 0.822 18 108 0.807 25 150 0.974 25 150 0.890
0.8 23 138 0.849 23 138 0.848 36 216 0.959 37 222 0.872
0.75 30 180 0.854 30 180 0.867 60 360 0.952 60 360 0.839
0.7 42 252 0.874 42 252 0.884 120 720 0.937 121 726 0.832
0.65 64 384 0.894 64 384 0.875 381 2286 0.914 382 2292 0.817

0.7 0.9 14 84 0.767 14 84 0.760 18 108 0.976 18 108 0.929
0.85 17 102 0.813 17 102 0.818 24 144 0.963 24 144 0.894
0.8 20 120 0.820 20 120 0.852 34 204 0.941 34 204 0.872
0.75 24 144 0.807 24 144 0.878 55 330 0.928 55 330 0.851
0.7 31 186 0.829 31 186 0.899 108 648 0.913 108 648 0.830
0.65 41 246 0.868 41 246 0.914 318 1908 0.897 318 1908 0.817

Table 2:

Sample Size for OR to Achieve a Power of 80 % for θ = 0.8 and θ = 0.7, α = 0.025 and πE ϵ [0.65, 0.9] under Three Different Allocations. The simulated power (SimP) is also reported to show that calculated sample size is adequate to guarantee 80% power except for minor numerical fluctuation.

Allocation πR = 0.7, πP = 0.1 πR = 0.6, πP = 0.55
Conditional Marginal Conditional Marginal
P R E θ πE nP N SimP nP N SimP nP N SimP nP N SimP
1 1 1 0.8 0.9 20 60 0.801 20 60 0.801 20 60 0.862 21 63 0.819
0.85 31 93 0.794 31 93 0.794 32 96 0.838 34 102 0.793
0.8 49 147 0.796 49 147 0.796 54 162 0.824 57 171 0.782
0.75 85 255 0.804 85 255 0.804 102 306 0.824 107 321 0.783
0.7 165 495 0.809 165 495 0.809 232 696 0.819 241 723 0.788
0.65 415 1245 0.8054 415 1245 0.805 844 2532 0.809 853 2559 0.798

0.7 0.9 15 45 0.800 15 45 0.800 19 57 0.858 20 60 0.831
0.85 21 63 0.791 21 63 0.791 30 90 0.836 31 93 0.805
0.8 30 90 0.807 30 90 0.808 51 153 0.828 52 156 0.795
0.75 45 135 0.800 45 135 0.801 93 279 0.819 95 285 0.790
0.7 72 216 0.806 72 216 0.806 205 615 0.816 209 627 0.792
0.65 125 375 0.805 125 375 0.805 681 2043 0.806 686 2058 0.796

1 2 2 0.8 0.9 11 55 0.808 11 55 0.808 11 55 0.849 11 55 0.829
0.85 16 80 0.808 16 80 0.808 17 85 0.834 18 90 0.809
0.8 26 130 0.804 26 130 0.804 29 145 0.826 30 150 0.796
0.75 45 225 0.801 45 225 0.801 54 270 0.823 55 275 0.798
0.7 87 435 0.803 87 435 0.803 122 610 0.815 124 620 0.792
0.65 220 1100 0.806 220 1100 0.806 434 2170 0.801 437 2185 0.790

0.7 0.9 8 40 0.801 8 40 0.799 10 50 0.830 10 50 0.827
0.85 12 60 0.818 12 60 0.818 17 85 0.829 17 85 0.825
0.8 17 85 0.809 17 85 0.809 28 140 0.822 28 140 0.817
0.75 26 130 0.789 26 130 0.789 50 250 0.807 50 250 0.801
0.7 41 205 0.817 41 205 0.817 110 550 0.803 110 550 0.797
0.65 71 355 0.810 71 355 0.810 362 1810 0.799 362 1810 0.796

1 2 3 0.8 0.9 9 54 0.809 9 54 0.809 9 54 0.859 9 54 0.837
0.85 13 78 0.831 13 78 0.831 14 84 0.831 14 84 0.803
0.8 22 132 0.809 22 132 0.809 23 138 0.822 24 144 0.795
0.75 37 222 0.792 37 222 0.792 43 258 0.816 44 264 0.789
0.7 72 432 0.806 72 432 0.806 98 588 0.819 99 594 0.793
0.65 182 1092 0.804 182 1092 0.804 349 2094 0.808 352 2112 0.796

0.7 0.9 7 42 0.811 7 42 0.811 8 48 0.804 8 48 0.806
0.85 10 60 0.797 10 60 0.797 13 78 0.804 13 78 0.803
0.8 14 84 0.803 14 84 0.803 22 132 0.806 22 132 0.799
0.75 22 132 0.807 22 132 0.807 40 240 0.809 40 240 0.804
0.7 34 204 0.807 34 204 0.807 88 528 0.809 88 528 0.804
0.65 60 360 0.827 60 360 0.827 289 1734 0.801 289 1734 0.796

Table 3:

Sample Size for NNT to Achieve a Power of 80 % for θ = 0.8 and θ = 0.7, ϵ = 0.05, α = 0.025 and πE ϵ [0.65, 0.9] under Three Different Allocations. The simulated power (SimP) is also reported to show that calculated sample size is adequate to guarantee 80% power except for minor numerical fluctuation.

Allocation πR = 0.7, πP = 0.1 πR = 0.6, πP = 0.55
Conditional Marginal Conditional Marginal
P R E θ πE nP N SimP nP N SimP nP N SimP nP N SimP
1 1 1 0.8 0.9 35 105 0.809 35 105 0.809 38 114 0.858 41 123 0.829
0.85 55 165 0.820 55 165 0.820 61 183 0.844 65 195 0.820
0.8 95 285 0.812 95 285 0.812 109 327 0.833 115 345 0.804
0.75 195 585 0.812 195 585 0.812 238 714 0.823 247 741 0.807
0.7 584 1752 0.802 584 1752 0.802 837 2511 0.807 846 2538 0.800
0.65 7248 > 104 0.804 7248 > 104 0.804 > 104 > 104 - > 104 > 104 -

0.7 0.9 22 66 0.800 22 66 0.800 36 108 0.870 37 111 0.832
0.85 32 96 0.793 32 96 0.793 56 168 0.843 58 174 0.818
0.8 49 147 0.799 49 147 0.799 99 297 0.831 101 303 0.804
0.75 82 246 0.813 82 246 0.813 209 627 0.820 213 639 0.812
0.7 161 483 0.806 161 483 0.806 674 2022 0.805 679 2037 0.798
0.65 431 1293 0.799 431 1293 0.799 > 104 > 104 - > 104 > 104 -

1 2 2 0.8 0.9 18 90 0.813 18 90 0.813 21 105 0.866 21 105 0.825
0.85 28 140 0.800 28 140 0.800 33 165 0.852 34 170 0.832
0.8 48 240 0.808 48 240 0.808 58 290 0.831 59 295 0.811
0.75 99 495 0.813 99 495 0.813 126 630 0.823 127 635 0.807
0.7 295 1475 0.813 295 1475 0.813 432 2160 0.806 434 2170 0.795
0.65 3660 > 104 0.799 3660 > 104 0.799 > 104 > 104 - > 104 > 104 -

0.7 0.9 12 60 0.791 12 60 0.791 20 100 0.856 20 100 0.834
0.85 17 85 0.798 17 85 0.798 31 155 0.832 31 155 0.822
0.8 26 130 0.797 26 130 0.797 54 270 0.824 54 270 0.815
0.75 42 210 0.801 42 210 0.801 113 565 0.811 113 565 0.806
0.7 83 415 0.8173 83 415 0.817 360 1800 0.807 360 1800 0.803
0.65 221 1105 0.805 221 1105 0.805 6848 > 104 0.799 6848 > 104 0.799

1 2 3 0.8 0.9 15 90 0.803 15 90 0.803 17 102 0.860 18 108 0.841
0.85 23 138 0.806 23 138 0.806 27 162 0.856 28 168 0.833
0.8 39 234 0.823 39 234 0.823 48 288 0.849 48 288 0.818
0.75 79 474 0.821 79 474 0.821 102 612 0.815 104 624 0.807
0.7 235 1410 0.809 235 1410 0.809 350 2100 0.811 352 2112 0.805
0.65 2903 > 104 0.794 2903 > 104 0.794 12809 > 104 0.801 12809 > 104 0.801

0.7 0.9 9 54 0.803 9 54 0.803 16 96 0.844 17 102 0.852
0.85 13 78 0.812 13 78 0.812 26 156 0.849 26 156 0.833
0.8 20 120 0.820 20 120 0.820 44 264 0.821 44 264 0.812
0.75 33 198 0.806 33 198 0.806 92 552 0.819 92 552 0.811
0.7 64 384 0.808 64 384 0.808 290 1740 0.800 291 1746 0.805
0.65 172 1032 0.801 172 1032 0.801 5508 > 104 0.794 5508 > 104 0.795

We present the sample sizes for the placebo arm only in the tables, however those for the arms R and E can be obtained by multiplying it with the allocation ratios. The total sample size for the allocation (1 : 1 : 1) is 3nP(1); that for (1 : 2 : 2) is 5nP(2); while for (1 : 2 : 3) it is 6nP(3), where nP(1), nP(2) and nP(3) are the respective sample size for the placebo arm under the three different allocations. From all three tables we observe that the sample size under the conditional approach is smaller or at most equal to that calculated under the marginal approach to achieve a power of 80%. It is clear from Tables 1, 2 and 3 that the two approaches behave nearly identically when πR >> πP. However, when their difference is smaller, conditional approach tends to improve power for fixed sample size. Also we observe that the sample size requirement decreases with decrease in θ for a fixed power, which is consistent to the power curve plots. For NNT we observe that the sample size requirement is bigger as compared to those for RR and OR since we test for ϵ-substantial non-inferiority (ϵ > 0) and hence for fixed θ, the margin allowance for NNT is smaller than that for RR and OR.

Although appealing at first glance, one may not want to use a balanced study design in the NI context from two aspects: (i) due to ethical reasons in case an effective treatment exists, the number of patients receiving the placebo should be kept as small as possible, and (ii) as pointed out by Koch and Tangen (1999), the difference between E and R should be expected to be much smaller than the difference of both of them relative to placebo so that the latter ones are easier to detect. As observed by Pigeot et al. (2003) for continuous outcome, the necessary sample size required for the unbalanced allocations is remarkably smaller compared to the balanced one. We observe similar results for the sample size under NNT from Table 3. From Table 2 for OR we notice that the necessary sample size is remarkably smaller for the unbalanced allocation (1 : 2 : 2) as compared to a balanced design (1 : 1 : 1) and a minor reduction is again obtained for the unbalanced allocation (1 : 2 : 3) as compared to (1 : 2 : 2). However, for RR the sample sizes do not follow the same pattern, particularly when the difference between πR and πP is large, with respect to the allocation, as can be seen from Table 1. Apart from the difference in the functional form, this might also be due to the fact that even after logarithmic transformation of RR it still yields a somewhat skewed distribution that do not conform to the normal approximation quite well as compared to OR or NNT.

5. Application

We illustrate our proposed Frequentist methods for RR, OR and NNT with a published dataset from a three-arm comparative study on major depressive disorder. This dataset is described in Higuchi et al. (2009). Hida and Tango (2011) as well as Ghosh et al. (2016) also considered this specific dataset in their paper. Hida and Tango (2013) proposed a Frequentist’s version of the problem for binary outcomes and Ghosh et al. (2018) considered the Bayesian version of the same for risk difference. The objective of the depression trial was to compare the efficacy and safety of duloxetine (E) with those of paroxetine (R) and placebo (P). This study was a double-blinded, randomized, parallel-group active-controlled study of a six-week treatment with the following number of patients in each arm: duloxetine (nE = 147), paroxetine (nR = 148) and placebo (nP = 145). The primary endpoint was continuous type which is the change in HAMD-17 total score from baseline at the end of sixth week. Hida and Tango (2011) considered two binary outcomes for their Frequentist approach namely, Response and Remission. Response is the primary outcome defined as the reduction of more than 50% total. Remission is the secondary outcome which is defined as maintaining HAMD-17 score of ≤ 17 at the end of 6 weeks. We present the data in Table 2, in terms of Response and Remission. We analyze both the Response and Remission outcomes separately using our proposed approach. To make a meaningful interpretation of the effect of the experimental drug, a clinically acceptable margin reflecting the largest loss of effect is chosen to determine non-inferiority of the experimental drug over the control. Here, we vary θ in the range [0.5, 0.8] to explore different possibilities. For the marginal approach the p−value of the test for NNT and for Margin 2 of both RR and OR is calculated as

pvalue=PH0(T>Tobs)=1Φ(nPTTobsτ02), (5.1)

where Tobs=g(π^E)θg(π^R)(1θ)g(π^P)ϵ is the Frequentist’s statistic under the existing approach and τ02=nPVar(T) under null hypothesis. The quantity ϵ is chosen to be 0.05 for the analysis under NNT, while ϵ = 0 for RR and OR. For the conditional Frequentist approach we calculate the p−value as

pvalue=PH0(W>Wobs)=1Φ(Wobsμwnullσwnull), (5.2)

where Wobs=g(π^E)θg(π^R)(1θ)g(π^P)ϵ|g(π^R)g(π^P)>0 is the Frequentist’s test statistic for the conditional testing and μwnull and σw2null are the mean and variance of W under null hypothesis as given in Section 3. The Frequentist p−values are reported in Table 5 and Table 6 for RR, OR and NNT for the Response and the Remission data respectively.

Table 5:

Frequentist p-values for the Response Data

RR OR NNT (ϵ = 0.05)

θ Conditional Marginal Conditional Marginal Conditional Marginal
0.5 0.047 0.047 0.041 0.041 0.227 0.227
0.55 0.059 0.059 0.054 0.055 0.272 0.272
0.6 0.075 0.075 0.072 0.073 0.320 0.321
0.65 0.094 0.094 0.094 0.095 0.372 0.374
0.7 0.119 0.119 0.122 0.123 0.426 0.428
0.75 0.149 0.150 0.155 0.157 0.479 0.482
0.8 0.186 0.187 0.193 0.195 0.532 0.535

Table 6:

Frequentist p-values for the Remission Data

RR OR NNT (ϵ = 0.05)

θ Conditional Marginal Conditional Marginal Conditional Marginal
0.5 0.085 0.085 0.080 0.080 0.379 0.38
0.55 0.101 0.101 0.099 0.099 0.424 0.426
0.6 0.121 0.121 0.121 0.121 0.470 0.473
0.65 0.146 0.146 0.147 0.148 0.516 0.519
0.7 0.174 0.175 0.177 0.179 0.559 0.564
0.75 0.207 0.209 0.212 0.215 0.601 0.606
0.8 0.245 0.248 0.245 0.254 0.640 0.645

The respective p−values are compared with α = 0.025 to deduce the final decision. From both Table 5 and Table 6, we observe that p−values decrease as θ decreases implying greater chance of declaring NI for smaller values of θ, since p − value < α implies rejection of NI. Also we observe that the p−values under the conditional approach is smaller or at most equal to that under the marginal approach which is consistent to the sample size calculation under all three functionals. However, since none of the p−values is smaller that α = 0.025 NI null hypothesis can not be rejected and hence NI can not be claimed for any of the tests across any θ for all the functionals.

6. Discussion

In this paper we have presented fraction margin based Frequentist test procedures for the “gold standard” three-arm NI trial which includes a placebo arm for binary endpoints with RR, OR and NNT type functionals. This is an important methodological contribution in view of the recent FDA guideline. We also introduced a conditional test of NI. For both RR and OR, we showed the non-uniqueness of the NI margin and constructed two examples of that. We made a comparison among them to identify the one yielding better operating characteristic. Additional guidance is required from regulators if one plans to choose a unique NI margin for all situation. We also note that NI testing for NNT can be regarded as the -substantial NI testing under the risk difference case. We tabulated the sample size under three different types of allocation for RR, OR and NNT, which should provide a good starting point for accessing sample size when designing such trials. We also note that the tests based on asymptotic approximation perform favorably since usually in NI testing the number of patients in each treatment arm is moderately large. In case of small sample size exact method of NI testing can be developed using Fisher’s exact test following Wellek (2005), Hasselblad and Lokhnygina (2007) and Zaslavsky (2013). However, all of these articles presented the exact approach for two-arm NI testing and most of them considered NI testing for rate difference as the function of interest only. To the best of our knowledge there exists no published work on exact testing approach for three-arm NI trial and hence this should be considered as an important future work. Also as suggested by one reviewer, likelihood ratio test or score test are potential alternatives to the asymptotic tests.

Historical information plays substantial role in the design and analysis of NI trial. Hence NI trial has to be reflected in several substantive aspects, for e.g. the choice of δ, the question of whether a placebo can be included as an additional arm of the study, assay sensitivity, etc. From the sample size tables we have observed that the proposed conditional test yields identical power and hence sample size as that of the marginal test when the active control is substantially superior to placebo. However, when the control is marginally superior to placebo, the proposed conditional approach yields smaller sample size as compared to the marginal approach for a fixed power. Also analysis of our clinical trial data suggest that both the methods perform comparably in all situations, however, the p−values under the conditional approach are always found to be smaller or at most equal to that obtained under the marginal one. This essentially supports the observation we made in the difference of sample size quantification between the two methods. Also we note that although in this article we considered sample size allocation motivated by Pigeot et al. (2003) and Koch and Tangen (1999), one may also consider optimal allocation to treatment arms following the line of Singer (2001) and Pigeot et al. (2003) for continuous outcome. However, when the outcome is binary, derivation of optimal allocation formula for various functional forms still remains an open problem.

We note that under the fraction margin approach the fraction “f” is pre-specified, while the NI margin δ is unknown. Hence the value of δ can vary greatly depending on the estimated effect size of the reference treatment, i.e. as a function of (g(π^R),g(π^P)). As evident, the information gained from the historical trial/s may play a significant role in NI trial design and hierarchical Bayesian approach may provide an attractive framework to achieve this. In this article we restricted ourselves to the Frequentist approach only, but that is definitely an avenue worth exploring in future. On the other hand in the fixed margin approach (see Hida and Tango (2013) and Ghosh et al. (2018)), with three-arms, the joint testing of NI and AS may be performed which needs additional care since it may produce conservative test with restrictive type-I error (Chuang-Stein et al., 2007; Dmitrienko et al., 2009) under intersection-union test. Albeit development of such procedure for RR, OR and NNT under alternative definition of type-I error (e.g. average testing error of Chuang-Stein et al. (2007)) is another interesting open problem.

Supplementary Material

1

Table 4:

Remission and Response as Outcome in the Depression Trial of Higuchi et al. (2009)

Outcome Duloxetine Paroxetine Placebo
Remission 50 49 32
Response 80 78 56
Total nE = 147 nR = 148 nP = 145

Acknowledgements

The research of last author is partly supported by PCORI contract number ME-1409–21410 and NIH grant number P30-ES020957.

Footnotes

Supplementary Material

For proofs and additional results please see the supplementary material.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.(2005). Guideline on the choice of the noninferiority margin (Doc. Ref. EMEA/CPMP/EWP/215). EMA. [Google Scholar]
  • 2.(2016). Non-Inferiority Clinical Trials to Establish Effectiveness Guidance for Industry. FDA. [Google Scholar]
  • 3.Althunian TA, de Boer A, Klungel OH, Insani WN, and Groenwold RH (2017). Methods of defining the non-inferiority margin in randomized, double-blind controlled trials: a systematic review. Trials, 18(1):107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Brown D, Volkers P, and Day S (2006). An introductory note to chmp guidelines: choice of the non-inferiority margin and data monitoring committees. Statistics in Medicine, 25(10):1623–1627. [DOI] [PubMed] [Google Scholar]
  • 5.Chuang-Stein C, Stryszak P, Dmitrienko A, and Offen W (2007). Challenge of multiple co-primary endpoints: a new approach. Statistics in Medicine, 26(6):1181–1192. [DOI] [PubMed] [Google Scholar]
  • 6.Cook R and Sackett DL (1995). The number needed to treat: a clinically useful measure of treatment effect. Biometrical Journal, 310(6977):452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.D’Agostino RB, Massaro JM, and Sullivan LM (2003). Noninferiority trials: Design concepts and issues-the encounters of academic consultants in statistics. Statistics in Medicine, 22(2):169–186. [DOI] [PubMed] [Google Scholar]
  • 8.Dmitrienko A, Tamhane AC, and Bretz F (2009). Multiple testing problems in pharmaceutical statistics. CRC Press. [Google Scholar]
  • 9.Gamalo MA, Wu R, and Tiwari RC (2011). Bayesian approach to noninferiority trials for proportions. Journal of Biopharmaceutical Statistics, 21(5):902–919. [DOI] [PubMed] [Google Scholar]
  • 10.Ghosh S, Ghosh S, and Tiwari RC (2016). Bayesian approach for assessing non-inferirity in a three-arm trial with pre-specified margin. Statistics in Medicine, 35(5):695–708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Ghosh S, Tiwari RC, and Ghosh S (2018). Bayesian approach for assessing noninferiority in a three-arm trial with binary endpoint. Pharmaceutical statistics, 17(4):342–357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Hashemi L, Nandram B, and Goldberg R (1997). Bayesian analysis for a single 2× 2 table. Statistics in Medicine, 16:1311–1328. [DOI] [PubMed] [Google Scholar]
  • 13.Hasselblad V and Lokhnygina Y (2007). Tests for 2 × 2 tables in clinical trials. Journal of Modern Applied Statistical Methods, 6(2):456–468. [Google Scholar]
  • 14.Hida E and Tango T (2011). On the three-arm noninferiority trial including a placebo with a prespecified margin. Statistics in Medicine, 30(3):224–231. [DOI] [PubMed] [Google Scholar]
  • 15.Hida E and Tango T (2013). Three-arm noninferiority trials with a prespecified margin for inference of the difference in the proportions of binary endpoints. Journal of biopharmaceutical statistics, 23(4):774–789. [DOI] [PubMed] [Google Scholar]
  • 16.Higuchi T, Murasaki M, and Kamijima K (2009). Clinical evaluation of duloxetine in the treatment of major depressive disorder-placebo and paroxetine-controlled double-blinded comparative study. Japaneese Journ of Clinical Psychopharmocology, 12:1613–1634. [Google Scholar]
  • 17.Hilton JF (2010). Noninferiority trial designs for odds ratios and risk differences. Statistics in Medicine, 29:982–993. [DOI] [PubMed] [Google Scholar]
  • 18.Huang L, Zalkikar J, and Tiwari RC (2011). A likelihood ratio test based method for signal detection with application to fda’s drug safety data. Journal of the American Statistical Association, 106(496):1230–1241. [Google Scholar]
  • 19.Hung HMJ and Wang SJ (2004). Multiple testing of noninferiority hypotheses in active controlled trials. Journal of Biopharmaceutical Statistics, 14(2):327–335. [DOI] [PubMed] [Google Scholar]
  • 20.ICHE10 (2009). ICH Harmonised Tripartite Guideline. Choice of Control Group and Related Issues in Clinical Trials. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. [Google Scholar]
  • 21.ICHE9 (2009). ICH Harmonised Tripartite Guideline. Statistical Principles for Clinical Trials. International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use. [Google Scholar]
  • 22.Keefe R, Kraemer HC, Epstein RS, Frank E, Ginger H, Laughren TP, Mcnulty J, Reed SD, Sanchez J, and Leon AC (2013). Defining a clinically meaningful effect for the design and interpretation of randomized controlled trials. Innovations in Clinical Neuroscience, 10(5–6 Suppl A):4S. [PMC free article] [PubMed] [Google Scholar]
  • 23.Kieser M and Friede T (2007). Planning and analysis of three-arm non-inferiority trials with binary endpoints. Statistics in medicine, 26(2):253–273. [DOI] [PubMed] [Google Scholar]
  • 24.Kieser M and Stucke K (2016). Assessing additional benefit in noninferiority trials. Biometrical Journal. [DOI] [PubMed] [Google Scholar]
  • 25.Koch GG and Tangen CM (1999). Non parametrc analysis of covaiance and its role in non-inferiority clinical trials. Drug Information Journal, 33:1145–1159. [Google Scholar]
  • 26.Kulldorff M (1997). A spatial scan statistic. Communications in Statistics - Theory and Methods, 26(6):1481–1496. [Google Scholar]
  • 27.Mielke M (2010). Maximum Likelihood Theory for Retention of Effect Non-Inferiority Trials. PhD thesis, Niedersächsische Staats-und Universitätsbibliothek Göttingen. [Google Scholar]
  • 28.Mielke M and Munk A (2009). The assessment and planning of non-inferiority trials for retention of effect hypotheses-towards a general approach. arXiv preprint arXiv:0912.4169.
  • 29.Mutze T, Munk A, and Friede T (2015). Design and analysis of three-arm trials with negative binomially distributed endpoints. Statistics in Medicine, 35(4):505–521. [DOI] [PubMed] [Google Scholar]
  • 30.Pigeot I, Schafer J, Rohmel J, and Hauschke D (2003). Assessing noninferiority of a new treatment in a three-arm clinical trial including a placebo. Statistics in Medicine, 22(6):883–899. [DOI] [PubMed] [Google Scholar]
  • 31.Rousson V and Seifert B (2008). A mixed approach for proving non-inferiority in clinical trials with binary endpoints. Biometrical Journal, 2:190–204. [DOI] [PubMed] [Google Scholar]
  • 32.Schumi J and Wittes JT (2011). Through the looking glass: understanding non-inferiority. Trials, 12(2):106–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Singer J (2001). A simple procedure to compute the sample size needed to compare two independent groups when the population variances are unequal. Statistics in medicine, 20(7):1089–1095. [DOI] [PubMed] [Google Scholar]
  • 34.Tang N-S, Yu B, and Tang M-L (2014). Testing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints. BMC medical research methodology, 14(1):134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wangge G, Roes KC, de Boer A, Hoes AW, and Knol MJ (2013). The challenges of determining noninferiority margins: a case study of noninferiority randomized controlled trials of novel oral anticoagulants. Canadian Medical Association Journal, 185(3):222–227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Wellek S (2005). Statistical methods for the analysis of two-arm non-inferiority trials with binary outcomes. Biometrical Journal: Journal of Mathematical Methods in Biosciences, 47(1):48–61. [DOI] [PubMed] [Google Scholar]
  • 37.Zaslavsky BG (2013). Bayesian hypothess testing in two-arm trials with dichotomous outcomes. Biometrics, 69(1):157–163. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES