Skip to main content
Sage Choice logoLink to Sage Choice
. 2023 Jul 28;32(9):1784–1798. doi: 10.1177/09622802231189592

Simultaneous confidence intervals for an extended Koch-Röhmel design in three-arm non-inferiority trials

Martin Scharpenberg 1,, Werner Brannath 1
PMCID: PMC10540495  PMID: 37503578

Abstract

Three-arm ‘gold-standard’ non-inferiority trials are recommended for indications where only unstable reference treatments are available and the use of a placebo group can be justified ethically. For such trials, several study designs have been suggested that use the placebo group for testing ’assay sensitivity’, that is, the ability of the trial to replicate efficacy. Should the reference fail in the given trial, then non-inferiority could also be shown with an ineffective experimental treatment and hence becomes useless. In this article, we extend the so-called Koch-Röhmel design where a proof of efficacy for the experimental treatment is required in order to qualify for the non-inferiority test. While the efficacy of the experimental treatment is an indication of assay sensitivity, it does not guarantee that the reference is sufficiently efficient to let the non-inferiority claim be meaningful. It has, therefore, been suggested to adaptively test the non-inferiority only if the reference demonstrates superiority to placebo and otherwise to test δ -superiority of the experimental treatment over placebo, where δ is chosen in such a way that it provides proof of non-inferiority with regard to the reference’s historical effect. In this article, we extend the previous work by complementing its adaptive test with compatible simultaneous confidence intervals. Confidence intervals are commonly used and suggested by regulatory guidelines for non-inferiority trials. We show how to adopt different approaches to simultaneous confidence intervals from the literature to the setting of three-arm non-inferiority trials and compare these methods in a simulation study. Finally, we apply these methods to a real clinical trial example.

Keywords: Simultaneous confidence intervals, ‘gold-standard’ non-inferiority trial, optimal sample size, adaptive test strategy

1. Introduction

We consider three-arm trials including an experimental treatment ( E ), an established reference ( R ) and a placebo (see, e.g. Koch and Röhmel 1 ). Typical indications for such trials are, for example, asthma or depression, where the use of a placebo is acceptable from an ethical point of view, and where the effect of the reference has normally a high variance. New treatments are desired to obtain better effects for certain patient collectives or to reduce side effects. An effect that is not inferior to that of the reference is then regarded as successful. To assure assay sensitivity, a placebo group is included in the trial.

Denote by μi the effect in group i , where i{E,R,P} . We will use the following notations for the hypotheses of interest:

HEPS:μEμP0(to show superiority of E over P).HERN:μEμRδ0(to show non-inferiority of E compared to R with margin δ0>0).HEPδ1:μEμPδ1(to show superiority by δ1>0 of E over P).

Without loss of generality, we can assume that μP=0 . The non-inferiority margin is normally chosen as δ0=rμRh , where r(0,1) and μRh is the historical reference effect, that is, the effect observed in earlier studies. A common choice is r=1/2 or smaller. Note, that any other choice of δ0 , for example, by clinical considerations, is also possible and can be represented by rμRh for some r . We will come to the choice of δ1 immediately.

The gold-standard for testing in three-arm trials was proposed by Koch and Röhmel 1 and consists of a hierarchical procedure rejecting first HEPS to assure assay sensitivity and then rejecting HERN to confirm efficacy of the new treatment. There are several possibilities to proceed after the second rejection. An important issue is the uncertainty about the strength of the reference in the study. If actually μR=0 , then the rejection of HERN is both easy and worthless (see Hauschke and Pigeot 2 ). Since it is unsatisfying to spend level on showing superiority of the reference over placebo, an idea is to consider the hypothesis HEPδ1 instead. This targets the clinical relevance of the new treatment and implicitly assures non-inferiority to the historical reference effect. We propose to choose δ1=(1r)μRh . If the effect of the reference in the study corresponds to the historical effect, that is, μR=μRh , then non-inferiority implies μE>μRδ0=(1r)μRh . On the other hand, if the reference in the study is weak, then the rejection of HEPδ1 assures likewise that μE>δ1=(1r)μRh . Hence, non-inferiority to the (historical) reference with margin rμRh=δ0 is shown.

In Brannath et al., 3 a flexible extension of the Koch-Röhmel design was proposed, which is in the spirit of the above argumentation. A hierarchical test of the sequence HEPS , HERN , HEPδ1 was introduced, and success of the study was interpreted in dependence of the reference strength. If ‘ R>P ’, then success is reached if HERN is rejected. If ‘ RP ’, then success means rejecting HEPδ1 . The test of the hypothesis HRPS:μRμP0 served as a filter (‘ R>P ’ or ‘ RP ’) for the interpretation of the study results. It was shown that the overall probability for erroneously declaring success can be controlled at level α without a confirmatory test of the reference effect.

It is well known that confidence intervals provide more information than hypothesis testing. They are particularly important for non-inferiority trials. For example, we can quantify the amount by which μE is superior to μP , or even superior to μP+δ1 . We may also be able to improve the non-inferiority to the reference or even show superiority, that is, prove that μEμR>0 . In this article, we extend the idea of Brannath et al. 3 by introducing three sets of simultaneous confidence intervals (SCIs) for the effects μEμP and μEμR . The first set of proposed SCIs is based on stepwise confidence intervals introduced by Hsu and Berger. 4 We will find an intrinsic filter describing the strength of the reference, which is perfectly compatible with the interpretation of these simultaneous confidence bounds. Thus, we obtain an intuitive procedure to decide about the success of the study and obtain information via a SCI. Another set of SCIs will be based on the work of Schmidt and Brannath, 5 who introduced informative confidence intervals in hierarchical testing. Finally, we will compare these two sets of SCIs to the single step SCIs which can easily be derived for the aforementioned effects.

The article is organized as follows. We introduce the SCIs in Section 2 and give numerical results on their performance in Section 3. We apply the different intervals to data of clinical trial in Section 4 and end with a discussion in Section 5.

2. Description of the SCIs

2.1. SCIs with intersection union filter

We assume that all observations are normally distributed with common standard deviation σ>0 . Denote by ni the sample size of group i{E,R,P} . Then the observed mean in group i is XiN(μi,σ2/ni) . Define the univariate confidence bounds

EP=XEXPzασnE1+nP1andER=XEXRzασnE1+nR1

where zα=Φ1(1α) is the quantile of the standard normal distribution. According to Hsu and Berger, 4 simultaneous lower confidence bounds for θ=(θ1,θ2)=(μEμP,μEμR) with coverage probability (1α) are given by

LEP={EPif EP<00if EP0,ER<δ0Lmin:=min{EP,ER+δ0}if EP0,ERδ0 (1)

and

LER={if EP<0ERif EP0,ER<δ0Lminδ0=min{EPδ0,ER}if EP0,ERδ0 (2)

These SCIs are compatible with the rejection decisions of the Koch-Röhmel design, that is, with a hierarchical test of HEPS and HERN as by Maurer et al. 6 The first line of (1) and (2) corresponds to the case where the hypothesis HEPS cannot be rejected. In this case, the study is considered a failure. The second line is the case where HEPS is rejected, but HERN is not. We also consider this case as failure, even though non-inferiority is not of interest in the case of a weak reference. However, if the reference fails then HERN should be rejected easily. The most important situation occurs in the third line of (1) and (2), where both hypotheses HEPS and HERN are rejected. The confidence bounds LEP and LER are then both determined by either information on the effect μEμP via EP or on the effect μEμR via ER , where the boundary between these two options can easily be calculated. Under the last condition of (1) and (2):

Lmin=ER+δ0XRXPzασ(nE1+nP1nE1+nR1)+δ0 (3)

It appears unnatural at first sight that LER , which is a lower bound for θ1 , is given by an estimate of θ2 in some cases, and vice versa. However, there is a transparent way of interpreting the bounds: If the reference is ‘strong’ in the sense of (3), then the confidence bound for the effect of E versus R is of interest, which is reflected by the fact that Lmin=ER+δ0 and hence the level is fully exploited for the parameter μEμR . In the other case, where the reference is ‘weak’ so that the inequality in (3) is not satisfied, then Lmin=EP , that is, the level is exploited for μEμP . This makes sense, because the difference of the new treatment to the weak reference is not of interest and non-inferiority to the historical reference is targeted indirectly via a bound for the effect of E versus P . In summary, the study can be judged as successful if either the reference is strong and LERδ0=rμRh or the reference is weak and LEPδ1=(1r)μRh . This inference can be made without needing to test the reference effect directly. Thus, equation (3) serves as an intrinsic filter for the interpretation of the study results.

Formally, the definition of Lmin is an application of nested intersection union (IU) tests of hypotheses HΛϑ:Λϑ at level α for the parameter Λ=min{μEμP,μEμR+δ0} . The value of Λ equals μEμP if and only if μRμP+δ0 , otherwise it quantifies μEμR+δ0 . This corresponds to the philosophy of a weak versus strong reference. Since the SCIs (1) and (2) and the filter (3) are the application of nested IU tests, we will call the filter IU filter.

The procedure of defining SCIs and applying them to define the success of the study can be illustrated graphically using the notation of Schmidt and Brannath. 5 This is done in Figure 1. The test of the hypothesis family (HEPϑ)ϑ0 (at level α ) leads to a confidence bound LEP=EP<0 if HEPS cannot be rejected, or LEP=0 , otherwise. In the first case, the procedure stops with LER= , while in the latter case, the family of hypotheses (HERϑ)ϑδ0 is tested (at level α ), resulting in a confidence bound <LERδ0 . If LER=δ0 , that is, if HERN is rejected, then the level α is shifted further to the hypothesis family (HΛϑ)ϑ>0 , improving the confidence bounds LEP and LER simultaneously.

Figure 1.

Figure 1.

Formal description of the simultaneous confidence intervals. The hypotheses are HEPϑ:μEμPϑ , HERϑ:μEμRϑ and HΛϑ:Λϑ for Λ=min{μEμP,μEμR+δ0} . The intersection union (IU) filter ‘ R>P ’ is given in (3).

This procedure modifies the hierarchical test design introduced by Brannath et al., 3 where the hypothesis HEPδ1 is tested after rejection of HEPS and HERN . Here, more information about the size of both effects μEμP and μEμR can be gained via a confidence interval for Λ . Another difference is that the filter ‘ R<P ’ is now given via (3), while it was a level- α -test of the hypothesis HRP:μRμP0 in Brannath et al. 3 The new filter has a remarkable advantage. Indeed, it was discussed by Brannath et al. 3 that the rather complex strategy in Figure 1 can be replaced by an intuitive strategy for many situations. Actually, we can give an intuitive picture for all situations if we use the new filter. The corresponding graph is shown in Figure 2. After rejection of the gatekeeper HEPS , it is decided via the filter whether to look at the lower bound for μEμR or the lower bound for μEμP . In the first case R>P , the study is successful if LER:=ERδ0 , in the second case, R<P , it is successful if LEP:=EPδ1 . Additionally, one can report LEP=max{0,LER+δ0} in the first case, or respectively, LER=LEPδ0 in the second case. Although the procedure in Figure 2 is data-driven via the filter, one can show that it maintains the given coverage probability 1α , because it is equivalent to the procedure in Figure 1, which is based on the intervals from Hsu and Berger 4 (see the Appendix). Therefore, also the probability of erroneously stating success of the study is controlled.

Figure 2.

Figure 2.

Intuitive graphical description and interpretation of the simultaneous confidence intervals. Notations as in Figure 1.

2.2. Informative SCIs

Next, we present another way to calculate simultaneous lower confidence bounds for (μEμP,μEμR) with coverage probability (1α) by employing the method of Schmidt and Brannath. 5 They derived the so-called informative SCIs in hierarchical testing. A confidence bound is informative, if it does not stick to the border of the null-hypothesis interval with positive probability, in case the null hypothesis is rejected.

The procedure for deriving the simultaneous lower confidence bounds, denoted by LEPinf and LERinf in the following, is illustrated in Figure 3 and works as follows: First, fix a value 0<q<1 . Then, HEPS:μEμP0 is tested at full level α . If HEPS cannot be rejected, the procedure stops and reports LEPinf=EP and LERinf= . If HEPS can be rejected, the full level α is passed to testing HERN . If HERN cannot be rejected, the procedure stops and reports LEPinf=0 and LERinf=ER . If however, also HERN is rejected, we test each HERϑ:μEμRϑ at level qϑ+δ0α . The lower confidence bound LERinf for μEμR will then be the first ϑ for which HERϑ cannot be rejected at the respective level. It can be shown that this bound is the unique solution to

1Φ{XEXRϑσnE1+nR1}=qϑ+δ0α.

Figure 3.

Figure 3.

Graphical description of the algorithm to determine informative SCIs for μEμP and μEμR , based on the notation and procedure of Schmidt and Brannath. 5

In the next step, we pass the remaining α -level ( 1qLERinf+δ0α ) to the hypotheses HEPϑ and determine the lower confidence bound for μEμP as LEPinf=max{0,~EP} , where

~EP=XEXPzqLERinf+δ0ασnE1+nP1

Note, that in the cases where HEPS or HERN cannot be rejected, these SCIs coincide with those presented in the previous section. Only in the case, where E is superior to P and non-inferior to R, the two procedures to calculate SCIs differ.

Note, that the intervals introduced here are not informative in the sense indicated at the beginning of this subsection. This is due to the fact, that if HEPS is rejected, but HERN is not, which might be the case with positive probability, the confidence bound for μEμP is 0 by construction. In this case, the study would fail in the sense that it was not able to show a sufficient treatment effect of the experimental treatment. Since the SCIs introduced here are based on the informative SCIs introduced by Schmidt and Brannath, 5 we will refer to them as informative SCIs in the following.

While the SCIs introduced in Section 2.1 offered an intrinsic filter to determine whether ‘ R>P ’ or ‘ RP ’, this is not the case for the SCIs introduced in this section. For the interpretation of the confidence intervals, respectively, the study results, we will use the superiority filter presented by Brannath et al., 3 which declares ‘ R>P ’ if

XRXPσnR1+nP1zα (4)

that is if HRPS can be rejected. The interpretation then works similar to the case of Section 2.1. After rejection of HERS we can declare a successful study, if either ‘ R>P ’ (i.e. HRPS can be rejected) and LERinf>δ0 , or if ‘ RP ’ (i.e. HRPS cannot be rejected) and LEPinf>δ1 . Furthermore, LEPinf , respectively, LERinf can additionally be reported.

2.3. Single-step SCIs

We now briefly introduce simultaneous single-step confidence intervals for the effects considered. To this end, we consider the test statistics

TEP=XEXPσnE1+nP1,TERN=XEXR+δ0σnE1+nR1

For large sample sizes nE,nR and nP and for μE=μP and μEμR=δ0 (i.e. HEPSHERN is true), the vector of test statistics (TEP,TERN) is asymptotically multivariate normally distributed with mean vector μ=(0,0) and covariance matrix Σ=(1ρρ1) with ρ=cPcR(1+cP)(1+cR) for cR=nR/nE and cP=nP/nE . Let dα be the equicoordinate 1α quantile of this distribution. With this definition, it is well known (cf. Hsu 7 ) that

LEPS=XEXPdασnE1+nP1andLERS=XEXRdασnE1+nR1

give simultaneous lower confidence bounds for (μEμP,μEμR) with coverage probability (1α) . As for the intervals introduced in Section 2.2, we need a filter for the interpretation of study results, when using the single-step SCIs introduced here. We will also use the superiority filter (4) and declare success of the study if either ‘ R>P ’ (i.e. HRPS can be rejected) and LERS>δ0 , or if ‘ RP ’ (i.e. HRPS cannot be rejected) and LEPS>δ1 .

3. Properties of SCIs and probability of success

3.1. Optimal sample sizes

Brannath et al. 3 presented a method to calculate the minimal necessary sample size to obtain a certain probability of success (i.e. probability of correctly declaring the study a ‘success’) which is based on a procedure described by Schlömer and Brannath. 8 By ‘success’ we mean that we can show either non-inferiority of the new treatment to the reference in the case where the filter is satisfied (denoted by ‘Success ER’ in the following), or to show δ1 -superiority of the new treatment over placebo in the case where the filter is not satisfied (denoted by ‘Success EP’). The overall success probability (‘Total PoS’) is the sum of Success ER and Success EP (for further explanation see Section 3.2). The algorithm to determine the optimal sample sizes is based on the observation, that the hypotheses HEPϑ:μEμPϑ,HERϑ:μEμRϑ and HRPϑ:μRμPϑ can be tested using test statistics which are multivariate normally distributed. Given specific sample size allocations (specified via the ratios cR=nR/nE and cP=nP/nE ) a probability of success, a significance level α and assumptions on the effects of E, R and P, the total sample size needed can then be derived via the cumulative distribution function of the respective multivariate normal distribution. In order to obtain the optimal sample sizes, the required total sample size N is optimized over the sample size allocations.

The method presented by Brannath et al. 3 can also be applied to our setup with the IU filter (3). We want to compare the optimal sample sizes derived for this setup to those derived for the superiority filter (4). To compare the two approaches, we make the following assumptions:

  • desired power 90% for the success of the study,

  • one-sided FWER α=2.5% ,

  • independent normal observations with common standard deviation σ=2 ,

  • the effect of the new treatment is equal to the historical reference effect, μEμP=1 .

We consider three different scenarios for the effect of the reference over the placebo where we assume a non-inferiority margin of δ0=0.5 , which corresponds to half of the historical reference effect:

  • Scenario 1: the reference is as good as observed historically, μRμP=1 .

  • Scenario 2: the reference is only half as good as observed historically, μRμP=0.5 .

  • Scenario 3: the reference fails and is as good as placebo, μRμP=0 .

Furthermore, we consider the setup of scenario 1 but we assume a non-inferiority margin of δ0=0.25 , which corresponds to 25% of the historical reference effect:

  • Scenario 4: the reference is as good as observed historically, μRμP=1 , δ0=0.25 .

Scenarios 2 and 3 are particularly relevant if the reference treatment is assumed to be ‘unstable’. Unstable reference effects are observed in indications such as depression, where there is a rather large uncertainty about the mean treatment effect to be expected in a specific trial, inter alia due to individual variability in treatment effects (see also EMA 9 and Kaiser et al. 10 ).

Table 1 and Figure 4 show the optimal sample sizes for the approach derived by Brannath et al. 3 (‘superiority filter’) and for the procedures involving SCIs as introduced in Section 2. As we can see, the informative SCIs require only slightly larger sample sizes than the procedure of Brannath et al. 3 throughout all scenarios. The single step SCIs lead to the largest sample sizes for most values of v . For the scenario in which the reference presents with the full historical effect, the sample size needed for the IU filter is higher than that needed for the informative SCIs. For the other scenarios, in which the reference effect is assumed smaller than the historical reference effect, the required sample size for the IU filter is smaller than that of the informative SCIs.

Table 1.

Optimal sample sizes for the flexible non-inferiority design comparing the SCIs introduced in Section 2 to the procedure of Brannath et al. 3 without SCIs (see text for details).

Scenario Superiority filter IU filter Informative Single step
1 nE 345 356 349 402
nR 350 348 348 406
nP 102 145 104 100
N 797 849 801 908
2 nE 185 227 159 134
nR 182 75 216 253
nP 303 285 313 323
N 670 587 688 710
3 nE 341 306 348 397
nR 44 33 52 44
nP 339 325 346 399
N 724 661 746 840
4 nE 1352 1355 1345 1587
nR 1354 1354 1362 1593
nP 107 123 108 106
N 2813 2832 2815 3286

SCIs: simultaneous confidence intervals; IU: intersection union.

Figure 4.

Figure 4.

Optimal sample sizes for the flexible non-inferiority design comparing the simultaneous confidence intervals (SCIs) introduced in Section 2 to the procedure of Brannath et al. 3 without SCIs (see text for details). The x -axis is the ratio v= reference effect/historical reference effect. Because of numerical instabilities, the values for the informative SCIs were smoothed. The grey line at v=1 indicates the scenario, where the reference is assumed to be as strong as historically observed.

For all methods, we see that the sample size for the reference group increases in v most of the time, while the sample size for the placebo group is decreasing in v . This is due to the fact that for increasing v , the focus of the procedures is shifted from the superiority comparison of experimental treatment versus placebo (which is more relevant for small v , i.e. a weak reference) to the non-inferiority comparison of experimental versus reference treatment (which is more relevant for large v , i.e. a strong reference).

We can see in the lower left panel of Figure 4, that there is a ‘bump’ in the sample size of the reference group for v around 0.5 for the method without SCIs as well as the single step and informative SCIs. This is due to the fact that in this region, the superiority filter, which is used for the interpretation of the study results with theses procedures, begins to declare ‘ R>P ’, which directs the focus from testing HEP towards HER . Hence, a higher sample size in the reference group is beneficial in these situations.

Usually, we are interested to keep the placebo group as small as possible. We can see from the lower right panel of Figure 4 that the single step intervals require substantially larger placebo group sample sizes than the other procedures for small values v . For values of v larger than 0.6 we observe, that the UI intervals require larger sample sizes in the placebo group than the other methods. The informative intervals seem to offer a good compromise with sample sizes near those of the procedure of Brannath et al., 3 while offering the benefit of SCIs.

Comparing the results of scenario 4 to those of scenario 1, we observe that choosing a smaller non-inferiority margin leads to the expected effect of inflating the required sample sizes substantially. However, it can be seen that also in this setup the additional sample size required for the informative and IU intervals compared to the procedure without SCIs is negligible.

Note, that usually we would only consider values of v1 , that is, the reference was considered at most as good as historically observed in the sample size planning. This seems natural in the setup of possibly unstable references. One of the reviewers made us aware of the fact, that values of v>1 could still be considered in a setting with equivalence tests. Therefore, Figure 4 also shows values of v>1 . In the given scenario, where the effect of the new treatment is assumed to equal the historical reference effect, sample sizes substantially increase for such v , since showing non-inferiority to reference for higher reference-effects is harder.

As seen in Table 1 and Figure 4, the optimal sample size is very sensitive to the assumed reference effect in the current trial. Since there might be some uncertainty about the ratio v of the reference effect in the study divided by the historical reference effect, Brannath et al. 3 proposed an ‘hybrid’ Bayesian-frequentist approach to the sample size calculation. In this approach, several ratios for v are considered and weighted with an assumed probability of their occurrence. More general, denote the density of the probability distribution of v on [0,1] with f(v) . Then the weighted success probability is given by

S=01Svf(v)dv (5)

where S is the targeted success probability and Sv is the success probability if v is the true ratio. Brannath et al. 3 stated that S is often referred to as ‘probability of success’ (cf. Kunmann et al. 11 ) or ‘assurance’ (cf. Stallard et al. 12 ). A very simple application of this approach is to define a grid of values for v and assign probabilities for each of the values. To illustrate this approach, we use a simple example: Since we do not expect the reference to fail completely, we expect v=1 with some probability p and we assume with probability 1p that v=3/4 . The probability of success in (5), which is targeted at 90% , as before, is then given by

S=pS1+(1p)S3/4 (6)

Figure 5 shows the optimal sample sizes obtained with this approach for p[0,1] . We see that for all values of p the informative intervals yield sample sizes close to those of the procedure of Brannath et al. 3 We can also see that the IU intervals yield to a substantially larger placebo group, compared to the other methods.

Figure 5.

Figure 5.

Optimal sample sizes for the flexible non-inferiority design comparing the simultaneous confidence intervals (SCIs) introduced in Section 2 to the procedure of Brannath et al. 3 without SCIs (see text for details). The x -axis is the probability p that v=1 . v=3/4 is assumed with probability 1p . Because of numerical instabilities, the values for the informative SCIs were smoothed.

Brannath et al. 3 indicated that ‘in practical applications, the choice of the values for v and p should be driven by knowledge from prior studies. I.e. the values chosen for v should reflect true ratios observed in prior studies and p should be driven by the believes as to how likely the respective values for v will be observed in the planned trial’.

3.2. Example

To illustrate the simultaneous confidence intervals proposed in Sections 2.1 to 2.3, as well as their interpretation on our notion of ‘success’, we give an example of different outcomes. We assume the setting of scenario 1 of the previous section, resulting in the following optimal sample sizes for the IU filter: nE=356 , nR=348 and nP=145 .

Without loss of generality, the observed placebo mean is XP=0 . Consider Table 2(a), where the lower confidence bounds LEP and LER are calculated via (1) and (2) for different observed means XE and XR . Furthermore, it is indicated if the filter (3) is satisfied. Note, that with the given assumptions, ‘ R>P ’ can be stated if and only if XR>0.591 . If this filter is satisfied, LER is of major interest and LEP is reported only additionally (in the table in parentheses). If the filter is not satisfied, then LEP is of major interest. In Table 2(b) and (c), the informative and single-step confidence bounds are calculated. Furthermore, it is indicated if the filter (4) is satisfied. With the given assumptions, ‘ R>P ’ can be stated if and only if XR>0.387 . For the informative SCIs q=0.01 was chosen. For more information on the choice of q see Section 3.3.

Table 2.

Example for the determination of the SCI with different filters and the resulting success.

(a) IU filter with corresponding SCIs. Success ER is satisfied if the IU filter holds and LER0.5 . Success EP is satisfied if the IU
filter does not hold and LEP>0.5 . See text for further explanations.
XE XR IU filter EP ER LEP LER Success EP Success ER
1.00 1.00 Yes 0.614 0.295 (0.205) 0.295 No Yes
1.00 0.50 No 0.614 0.205 0.614 (0.114) Yes No
1.00 0.30 No 0.614 0.404 0.614 (0.114) Yes No
0.80 0.30 No 0.414 0.205 0.414 ( 0.086 ) No No
(b) Superiority filter with informative SCIs ( q=0.01 ). Success ER is satisfied if the superiority filter holds and LERinf0.5 .
Success EP is satisfied if the superiority filter does not hold and LEPinf>0.5 . See text for further explanations.
XE XR Superiority filter LEPinf LERinf Success EP Success ER
1.00 1.00 Yes 0.561 0.340 No Yes
1.00 0.50 Yes 0.607 0.063 No Yes
1.00 0.30 No 0.611 0.228 Yes No
0.80 0.30 No 0.407 0.063 No No
(c) Superiority filter and single-step SCIs. Success ER is satisfied if the superiority filter holds and LERS0.5 . Success EP is
satisfied if the superiority filter does not hold and LEPS>0.5 . See text for further explanations.
XE XR Superiority filter LEPS LERS Success EP Success ER
1.00 1.00 Yes 0.560 0.337 No Yes
1.00 0.50 Yes 0.560 0.163 No Yes
1.00 0.30 No 0.560 0.363 Yes No
0.80 0.30 No 0.360 0.163 No No

SCIs: simultaneous: confidence: intervals; : IU: intersection union.

In the first rows of Table 2(a), the observed mean in the experimental group equals the historical reference effect. In the first row, XR>0.591 , hence the IU filter is satisfied and the focus is on the bound LER=ER . Success is given via the column named Success ER in the table, namely if LERδ0 , which is the case here. In the second row, the IU filter is not satisfied, therefore, success is given by LEP=EP and is defined via Success EP, namely that LEPδ1 . One has proven that the new treatment has a relevant effect, although the reference is only half as strong as historically. In the fourth example, the filter is also not satisfied. Now the effect of the experimental treatment is lower, therefore no success can be attested. This rule is stricter than the simple Koch-Röhmel design, where non-inferiority of E versus R would be sufficient. Here one can even show superiority of E versus R with ER , however, this has no merit because the reference is weak and LER is only a shifted version of LEP and thus does not give additional information.

In the first two rows of Table 2(b) and (c), the superiority filter assesses ‘ R>P ’ and the respective confidence bounds LERinf and LERS are both larger than δ0 leading to a successful outcome (Success ER). In the last two rows, the observed mean XR is fairly small, leading the superiority filter to declare ‘ RP ’. For the third row, LEPinf and LEPS are both larger than δ1=0.5 leading to a Success EP. In the last row, LEPinf and LEPS both fall below that threshold, resulting in no claim of success. While both, the informative and the single-step confidence intervals come to the same conclusions, we can see that they differ in their size. In this example, we observe that we always have LEPinf>LEPS . This is particularly useful in the third row, because superiority over placebo is the main conclusion here. Note that there is no such trend for the bounds for the comparison of E and R.

3.3. Probability of success

We now compare the probability of success between the SCIs presented in this article and the method of Brannath et al. 3 using the superiority filter (4) but without calculating SCIs. With the comparison to the latter procedure, we can assess the ‘cost’ (in terms of probability of success) for the additional information the SCIs provide. We ran a simulation with 100,000 repetitions of observations in the settings of scenarios 1 and 2 of Section 3.1. We counted how often the filters were satisfied and, in each case, how often the relevant success occurred (i.e. proof of non-inferiority if the filter is valid, proof of δ1 -superiority if the filter is not valid) and what the median of the confidence bounds is. Note, that the mean of LER cannot be reported because this bound is minus infinity in case the gatekeeper (superiority of E vs. P ) fails to be shown. The mean of LEP , however, did not differ strongly from its median, so we show here only the median as a measure of central tendency of the confidence bounds. However, failure of the gatekeeper was almost never a problem. For the informative SCIs we investigated different choices for q . Here, we only present the results for q=0.01 , since they yielded the best results.

The results of the simulation are presented in Table 3 and Figures 6 and 7. We see that there is only very little loss in probability of success when comparing the informative SCIs interpreted with the help of the superiority filter, to the procedure of Brannath et al. 3 using the superiority filter without SCIs. The maximum observed loss in probability of success observed in our simulations was about 1% , which is quite low considering the benefit in interpretation the SCIs offer. Throughout all simulations, we observed that the single-step confidence intervals interpreted with the help of the superiority filter yielded a lower probability of success than the procedure of Brannath et al. 3 or the informative SCIs. The loss in probability of success compared to using the informative SCIs amounted up to 10% in some scenarios. Comparing the SCIs with IU filter to the procedures using the superiority filter, we see that in scenario 1, where we the sample size is optimized under the assumption that the reference is as strong as historically observed, the IU filter offers substantial less probability of success for most reference effects (ratio of v =reference effect/historical reference effect >0.2 ). The loss in probability of success was as high as 13% for v=0.65 . In scenario 2, however, where the sample size is optimized under the assumption, that the reference will only show half it’s historical effect in the present trial, the SCIs with IU filter show best results in terms of probability of success. However, the observed loss for the superiority filter is at most around 6% , which is smaller than the loss the IU filter procedure presented with in scenario 1. From this we can conclude, that if small reference effects are expected, and the sample size is planned accordingly, the SCIs based on the IU filter should be chosen. In case of large expected reference effects the informative SCIs offer the best performance.

Table 3.

Characteristics of simultaneous confidence intervals for the settings of scenarios 1 and 2 of Section 3.1 for different ratios of v =reference effect/historical reference effect. Without SCI is the probability of success for the procedure introduced by Brannath et al., 3 using the superiority filter but no SCIs. For the informative and single-step intervals, the superiority filter was used for interpretation of study results.

(a) Scenario 1 (sample size planning with v=1 ). Sample size: N=nE+nR+nP=356+348+145=849 .
Filter satisfied Probability of success
v IU Superiority Without SCI IU Informative Single step
1.50 100.0 100.0 2.5 2.5 2.5 1.3
1.25 100.0 100.0 37.9 37.9 37.9 28.1
1.00 98.1 99.9 91.2 89.5 91.2 86.0
0.75 79.2 96.7 96.9 85.5 96.9 96.6
0.50 32.7 71.6 82.2 73.2 81.9 78.8
0.25 4.3 24.2 72.4 71.7 71.9 63.1
0.00 0.1 2.5 72.0 72.0 71.8 62.1
(b) Scenario 2 (sample size planning with v=0.5 ). Sample size: N=nE+nR+nP=227+75+285=587 .
Filter satisfied Probability of success
v IU Superiority Without SCI IU Informative Single step
1.50 100.0 100.0 2.5 2.5 2.5 1.3
1.25 100.0 99.8 15.3 15.4 15.3 9.6
1.00 99.5 97.1 45.7 46.8 45.6 34.6
0.75 95.0 82.4 74.6 78.4 73.8 64.2
0.50 75.0 49.0 83.0 88.4 81.4 75.0
0.25 38.4 15.9 81.3 84.5 79.8 73.3
0.00 10.5 2.5 80.6 81.2 79.6 72.1

SCIs: simultaneous: confidence: intervals; : IU: intersection union.

Figure 6.

Figure 6.

Probability of success, success with valid filter (Power ER) and success without valid filter (Power EP) from simulations in scenario 1 (left column) and scenario 2 (right column). The x -axis is the ratio v= reference effect/historical reference effect. The vertical line corresponds to the ratio v which the sample size was calculated for.

Figure 7.

Figure 7.

Median lower confidence limits from simulations in scenario 1 (left column) and scenario 2 (right column). The x -axis is the ratio v= reference effect/historical reference effect.

Note, that the probability of success decreases for values of v which are larger than those assumed in the sample size calculation (vertical line in Figure 6). This is due to the fact that for increasing v it becomes harder to prove non-inferiority of experimental treatment to the (increasingly efficacious) reference. If we wanted to protect ourselves against the potential power loss associated with values of v which are larger than assumed during the planning of the trial, we could think of using adaptive designs, which would allow for a recalculation of the sample size during the course of the trial. However, this is not in the scope of this article. For values of v which exceed 1.5 , the experimental treatment is inferior to the reference by more than the pre-specified non-inferiority margin. Therefore, a claim of ‘success’ would in this case constitute a type-I-error. We can see that in our simulations, the probability for this to happen was bounded by 2.5%. This is in line with the fact that we know from theoretical derivations, that the FWER is controlled with all of the presented procedures. Following a reviewers suggestion, we simulated the FWER for several scenarios under the null hypothesis. The results are provided in Supplemental Material. The FWER was controlled in all setups considered.

4. Example – Application to trial data

Next, we apply the introduced confidence intervals to data of a three-arm study on major depressive disorder by Higuchi et al., 13 which is also discussed by Hida and Tango 14 and Brannath et al. 3 We will first show the example in it’s original version, as shown by Higuchi et al. 13 as well as the application of the adaptive test strategy of Brannath et al. 3 After that, we will apply the IU and the informative ( q=0.01 ) confidence intervals to the same data.

Higuchi et al. 13 investigated the efficacy and safety of 6-week treatment with duloxetine (E) to those of paroxetine (R) and placebo (P) in a double-blinded, randomized, active controlled, parallel-group study. It’s primary endpoint was the HAM-D17 change from baseline at 6 weeks and the statistical analysis was planned to test the superiority of duloxetine over placebo and the non-inferiority of duloxetine compared to paroxetine in hierarchical order. The non-inferiority margin was set as δ0=2.5 . The following mean decreases in the HAM-D17 were observed: 10.2±6.1 (mean ± sd) in the duloxetine group, 9.4±6.9 in the paroxetine group and 8.3±5.8 in the placebo group with sample sizes nE=147,nR=148andnP=145 .

The resulting (unadjusted) 97.5% lower confidence interval for the difference in means between duloxetine and placebo is in this case given by (0.53, ) (i.e. EP=0.53 ), indicating superiority of duloxetine over placebo. The (unadjusted) 97.5% lower confidence interval for the difference in means between duloxetine and paroxetine can be calculated to be ( 0.69, ) (i.e. ER=0.69 ) and excludes δ0=2.5 indicating non-inferiority of duloxetine compared to paroxetine. However, the superiority of paroxetine over placebo could not be established, because the lower 97.5% confidence interval for this comparison is given by ( 0.37, ) and therefore includes 0. Higuchi et al. 13 concluded that non-inferiority of duloxetine compared to paroxetine did not have assay sensitivity.

Brannath et al. 3 applied their adaptive test strategy to the data, employing the superiority filter (4) to determine, whether paroxetine was sufficiently efficacious. They used δ1=2.5 for the investigation of superiority of E versus P. As seen above duloxetine can be shown to be superior to placebo (i.e. HEPS is rejected). However, using the superiority filter (4), ‘ R>P ’ cannot be concluded, which is why as a next step they tested for δ1 -superiority of duloxetine over placebo ( HEPδ1 ). Since the lower 97.5% confidence interval for this comparison includes δ1=2.5 , HEPδ1 could not be rejected, and the study was not successful in the sense that there was not enough evidence to declare duloxetine sufficiently efficacious.

For illustrative purposes, Brannath et al. 3 investigated how their procedure would perform, if the observed mean change in the duloxetine group would have been 12.2 with all other parameters unchanged. Under this assumption, duloxetine can be shown to be superior to placebo (i.e. HEPS is rejected) and the superiority filter still concludes ‘ RP ’. However, one now has EP=2.53 , leading to a claim of δ1 superiority of duloxetine to placebo, which in case can be interpreted as a successful study. Brannath et al. 3 however noted that the confidence intervals reported ‘while appropriate for deriving the test decision in the hierarchical test, are no simultaneous confidence intervals, and therefore do not have simultaneous coverage probability’.

We now want to apply the simultaneous confidence intervals introduced in this manuscript to the original data of this study. Applying the intuitive approach outlined in Figure 2, we note that HEP:μEμP0 can be rejected. The IU and the superiority filters both conclude ‘ RP ’ and the IU simultaneous confidence bounds are given as LEP=EP=0.53 and LER=EPδ0=1.97 . Since LEP covers the superiority margin δ1=2.5 , we cannot establish ‘success’ of the study. The informative confidence bounds are gives as LEPinf=0.528 and LERinf=1.67 . Since LEPinf also covers the superiority margin δ1=2.5 , we cannot establish ‘success’ of the study with the informative SCIs either. However, the simultaneous confidence bounds LEP and LER resp. LEPinf and LERinf allow for an interpretation on the magnitude of treatment effects, which was not possible before.

Applying the confidence intervals to the hypothetical case where we assume that the observed mean change in the duloxetine group is 12.2 with all other parameters unchanged, we obtain EP=2.53 and ER=0.69 . Again, HEP:μEμP0 can be rejected and the IU and the superiority filter both conclude ‘ RP ’. The IU simultaneous confidence bounds are now given as: LEP=EP=2.53 and LER=EPδ0=0.03 . Since LEP>δ1 we now conclude δ1 superiority of duloxetine over placebo, and, therefore, a ‘success’ of the study. For the informative SCIs, we obtain LEPinf=2.53 and LERinf=0.59 . Since LEPinf>δ1 we also conclude δ1 superiority of duloxetine over placebo, and, therefore, a ‘success’ of the study with the informative SCIs. While Brannath et al. 3 came to the same conclusions on the significance tests, the proposed confidence bounds offer the advantage, that we can derive conclusions on the size of the effects of duloxetine compared to placebo and paroxetine.

5. Summary and discussion

Brannath et al. 3 introduced a flexible extension to the three-arm ‘gold-standard’ non-inferiority design, in which it is possible to declare success of the study, even if the reference fails. We complemented the previous work with compatible simultaneous confidence intervals, thus providing more information than hypothesis testing alone. We introduced three methods to calculate confidence intervals in the given design: stepwise confidence intervals based on those introduced by Hsu and Berger, 4 ‘informative’ SCIs based on the work of Schmidt and Brannath 5 and single-step confidence intervals. While the informative and the single-step intervals rely on a ‘filter’ for the interpretation of study results, the intervals based on Hsu and Berger 4 offer an ‘intrinsic’ filter (IU filter) guaranteeing FWER control for all sample sizes.

We did multiple simulations in which we compared the different procedures for calculating SCIs and also the procedure by Brannath et al. 3 in different scenarios. We saw that substantial power losses may occur with the IU filter compared to the superiority filter. However, when the reference effect is anticipated to be smaller than historically observed (and the study is planned accordingly) the SCIs with the IU filter might perform better than the superiority filter. We observed only very little loss in the probability of success for the informative SCIs compared to the procedure by Brannath et al. 3 which does not yield SCIs at all.

Finally, we applied the different SCIs to data from a real clinical trial, exemplifying the gain of calculating SCIs, which lies is the possibility to make statements of the size of the effects.

As mentioned in Section 3.3, an extension of the presented method to adaptive designs might be desirable to protect against the potential loss in power, if the reference effect is larger than expected during the sample size calculation. A further extension of the proposed methods, not discussed in this article either, includes the application to sequential designs to allow for interim testing.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802231189592 - Supplemental material for Simultaneous confidence intervals for an extended Koch-Röhmel design in three-arm non-inferiority trials

Supplemental material, sj-pdf-1-smm-10.1177_09622802231189592 for Simultaneous confidence intervals for an extended Koch-Röhmel design in three-arm non-inferiority trials by Martin Scharpenberg and Werner Brannath in Statistical Methods in Medical Research

Appendix: Equivalence between formal and intuitive procedure

We will show that the confidence bounds LEP and LER that are defined by the intuitive Figure 2 are identical to the bounds defined in Figure 1, that is, by formulas (1) and (2). Note that the quantities EP and ER are equivalent to the test statistics for HEPS and HERN , respectively, in the sense that HEPS is rejected if and only if EP0 , and HERN is rejected if and only if EPδ0 . We will discuss all cases.

If EP<0 , then HEPS is not rejected and LEP=EP<0,LER= with both the formal and the intuitive design, because the first family of hypotheses (HEPϑ)ϑ0 is not rejected completely and both procedures stop.

If EP0 and ER<δ0 , then HEPS is rejected and HERN is accepted so that LEP=0 and LER=ER with the formal design. In the intuitive design, we can – informally written – see that ‘ R>E>P ’, that is, the filter is satisfied. More formally, we obtain from the definitions of EP and ER that

XRXP=EP(ER+δ0)+zασ(nE1+nP1nE1+nR1)+δ0.

Because EP0 and ER+δ0<0 , it follows that the filter (3) holds. Therefore, LER=ER and LEP=max{LER+δ0,0}=max{ER+δ0,0}=0 , as in the formal design.

If EP0 and ERδ0 , then we reject both HEPS and HERN with the formal procedure. The confidence bounds are LEP=ER+δ0,LER=ER if the filter (3) holds and LEP=EP,LER=EPδ0 if the filter does not hold. Consider now the intuitive design. If the filter holds, then LER=ER and LEP=max{LER+δ0,0}=ER+δ0 . If the filter does not hold, then LEP=EP and LER=LEPδ0=EPδ0 . All bounds correspond to those of the formal design in the respective cases.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iDs: Martin Scharpenberg https://orcid.org/0000-0003-1584-0056

Werner Brannath https://orcid.org/0000-0002-8622-3904

Supplemental material: Supplemental materials for this article are available online.

References

  • 1.Koch A, Röhmel J. Hypothesis testing in the “gold standard” design for proving the efficacy of an experimental treatment relative to placebo and a reference. J Biopharm Statist 2004; 14: 315–325. [DOI] [PubMed] [Google Scholar]
  • 2.Hauschke D, Pigeot I. Establishing efficacy of a new experimental treatment in the ‘gold standard’ design. Biom J 2005; 47: 782–786. [DOI] [PubMed] [Google Scholar]
  • 3.Brannath W, Scharpenberg M, Schmidt S. A flexible design for non-inferiority studies. Stat Med 2022; 41: 5033–5045. [DOI] [PubMed] [Google Scholar]
  • 4.Hsu JC, Berger RL. Stepwise confidence intervals without multiplicity adjustment for dose–response and toxicity studies. J Am Stat Ass 1999; 94: 468–482. [Google Scholar]
  • 5.Schmidt S, Brannath W. Informative simultaneous confidence intervals in hierarchical testing. Methods Inf Med 2014; 53: 278–283. [DOI] [PubMed] [Google Scholar]
  • 6.Maurer W, Hothorn L, Lehmacher W. Multiple comparisons in drug clinical trials and preclinical assays: a-priori ordered hypotheses. In: Biometrie in der chemisch-pharmazeutischen Industrie. J. Vollmar, Fischer Verlag, Stuttgart, 1995. pp. 3–18.
  • 7.Hsu J. Multiple Compariosons – Theory and methods. Boca Raton: Chapman & Hall/CRC, 1996. [Google Scholar]
  • 8.Schlömer P, Brannath W. Group sequential designs for three-arm ‘gold standard’ non-inferiority trials with fixed margin. Stat Med 2013; 32: 4875–4889. [DOI] [PubMed] [Google Scholar]
  • 9.EMA Reflection paper on the need for active control in therapeutic areas where use of placebo is deemed ethical and one or more established medicines are available. European Medicines Agency, London (Draft, EMA/759784/2010), 2010.
  • 10.Kaiser T, Volkmann C, Karyotaki E, et al. (Heterogeneity of treatment effects in trials on psychotherapy of depression. APA PsycArticles: Advance online publication. 2022. doi:10.1037/cps0000079.
  • 11.Kunmann K, Grayling M, Lee K, et al. A review of Bayesian perspectives on sample size derivation for confirmatory trials. Am Stat 2021; 75: 424–432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Stallard N, Posch M, Friede T, et al. Optimal choice of the number of treatments to be included in aclinical trial. Stat Med 2009; 28: 1321–1338. [DOI] [PubMed] [Google Scholar]
  • 13.Higuchi T, Murasaki M, Kamijima K. Clinical evaluation of duloxetine in the treatment of major depressive disorder-placebo- and paroxetine-controlled double-blind comparative study. Jpn J Clin Psychopharmacol 2009; 12: 1613–1634. [Google Scholar]
  • 14.Hida E, Tango T. On the three-arm non-inferiority trial including a placebo with a prespecified margin. Stat Med 2011; 30: 224–231. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-pdf-1-smm-10.1177_09622802231189592 - Supplemental material for Simultaneous confidence intervals for an extended Koch-Röhmel design in three-arm non-inferiority trials

Supplemental material, sj-pdf-1-smm-10.1177_09622802231189592 for Simultaneous confidence intervals for an extended Koch-Röhmel design in three-arm non-inferiority trials by Martin Scharpenberg and Werner Brannath in Statistical Methods in Medical Research


Articles from Statistical Methods in Medical Research are provided here courtesy of SAGE Publications

RESOURCES