BOB: Bayesian Optimal Design for Biosimilar Trials with Co-Primary Endpoints

Xiaohan Chi; Zhangsheng Yu; Ruitao Lin

doi:10.1002/sim.9571

. Author manuscript; available in PMC: 2023 Nov 20.

Published in final edited form as: Stat Med. 2022 Sep 20;41(26):5319–5334. doi: 10.1002/sim.9571

BOB: Bayesian Optimal Design for Biosimilar Trials with Co-Primary Endpoints

Xiaohan Chi ^1,², Zhangsheng Yu ^1,², Ruitao Lin ³

PMCID: PMC9588749 NIHMSID: NIHMS1834131 PMID: 36127794

Summary

For regulatory approval of a biosimilar product, extensive evaluations should be performed by rigorous clinical trials to establish the similarity between the reference product and the proposed biosimilar in terms of both efficacy and safety. Existing designs for biosimilar trials often use a single primary efficacy endpoint in trial monitoring, and then separately evaluate the safety of the biosimilar product in a secondary analysis at the trial completion. However, ignoring the safety endpoint and the correlation between safety and efficacy in trial monitoring may lead to a high false positive rate, or it may delay the termination of the trial when dissimilarity in safety is early detected. We propose a Bayesian optimal design for biosimilar trials by incorporating both safety and efficacy endpoints in a uniform framework. Based on a Bayesian joint safety and efficacy model, we sequentially use a so-called Bayesian biosimilar probability to make go/no-go decisions. We calibrate the Bayesian design to maximize the statistical power while maintaining the frequentist type I error rate at the nominal level. We carry out extensive simulation studies to show that the design has desirable performance in terms of the false positive rate and the average sample size. We also apply the proposed design to a biosimilar trial evaluating a ranibizumab product.

Keywords: Bayesian optimal design, Biosimilar, Co-primary endpoints, Power, Sequential design

1 ∣. INTRODUCTION

Biological products are substances mainly derived from living organisms and include vaccines, blood components, gene therapy, tissues, and proteins like monoclonal antibodies¹, which have a wide variety of sizes and structural complexity. As a result, the manufacturing process is quite complex for innovator biological products, leading to high development costs². A biosimilar is a biological product that is “highly similar” to an approved reference biological product in terms of efficacy, safety, and quality³. In contrast to innovator biological products, biosimilars can avoid duplicating expensive clinical trials and have lower development risks and costs. Hence research on biosimilars is crucial to the pharmaceutical industry because of cost savings for healthcare systems and consumers. Recent years have witnessed a booming biosimilar market, especially in Asia, because the low-cost biosimilars provide a desirable solution to the pricing hurdle⁴. In addition, the growing number of patents on approved biological products set to expire has accelerated the development of biosimilars.

For regulatory approval, the analytic similarity between the biosimilar and the reference product (i.e., biosimilarity) should be established based on meaningful clinical trials. Regulatory agencies such as the US Food and Drug Administration (FDA) and European Medicines Agency (EMA) have published guidelines for the development and approval of biosimilars^3,5. Generally speaking, the definition of biosimilarity does not vary much among regulators. According to the FDA guideline³, biosimilarity is defined as that “the biological product is highly similar to the reference product notwithstanding minor differences in clinically inactive components,” and that “there are no clinically meaningful differences between the biological product and the reference product in terms of the safety, purity, and potency of the product.” It is worth noting that these definitions all emphasize the importance of the safety and efficacy of biosimilars. In other words, a manufacturer must demonstrate that its proposed biosimilar product is clinically and statistically indifferent to the reference product in terms of safety and efficacy.

In practice, it is more complex and challenging to characterize biological products⁶, because they have many fundamental differences from small-molecule drugs such as aspects of structure, safety, and pharmacological mechanisms⁷. It is almost impossible for a biosimilar to have the same active ingredients as the reference biological product. Hence researchers cannot directly apply conventional methods used to evaluate bioequivalence for small-molecule drugs to establish biosimilarity. Many statistical methods have been developed for conducting biosimilar clinical trials. Chow and Liu⁸ proposed a two-group parallel design for a bridging study based on the pharmacokinetic (PK), pharmacodynamic (PD), as well as efficacy. Chiu et al.⁹ borrowed the historical data of the approved biological products as the prior information for biosimilars based on a Bayesian method. Pan et al.¹⁰ proposed a Bayesian group sequential design that integrates historical information through the calibrated power prior approach. Uozumi and Hamada¹¹ developed an adaptive seamless PK and efficacy design to combine the tests of PK parameters and efficacy. Weiss et al.¹² proposed to incorporate prior information from the historical data of the reference product, and the results from in vitro and phase I PK/PD study of the biosimilar. Mielke et al.¹³ introduced a hybrid Bayesian-frequentist approach for incorporating historical data and controlling the type I error rate. Psioda et al.¹⁴ proposed a general Bayesian design framework to test biosimilarity in several indications simultaneously. Belay et al.¹⁵ proposed a two-stage Bayesian adaptive design for biosimilar trials with time-to-event endpoints.

Existing works typically focus on a single primary efficacy endpoint and seldom take the correlation between safety and efficacy into account. However, biological products are likely to cause systemic adverse effects, which are closely associated with the efficacy of the product¹⁶. For example, both prospective and retrospective analyses in non-small cell lung cancer patients have shown an association between the onset of immune-related adverse events (irAE) and the efficacy of anti-PD-1 and anti-PD-L1 antibodies¹⁷. Patients who experienced irAE had superior progression-free survival and overall survival compared to those who did not experience irAE. Negative correlation between safety and efficacy also has been observed in biological products². Due to safety–efficacy interactions, it is possible that a biosimilar product is similar to the reference product in efficacy, but that they differ significantly from each other in safety profile. In this case, using efficacy as the primary endpoint may result in a false positive or a waste of resources if the dissimilarity cannot be detected early.

In this paper, we propose a Bayesian optimal design for biosimilar trials (BOB) by simultaneously using both safety and efficacy endpoints to evaluate the biosimilarity between a test biosimilar product and its reference product. Considering both safety and efficacy endpoints presents challenges to the statistical design. For example, the conclusions made based on respective efficacy and safety endpoints may go in different directions, making it difficult to interpret the biosimilarity. In addition, the correlation between efficacy and safety is typically unknown, and misspecification of the correlation may affect the study power or inflate the type I error rate. To address these challenges, we have built a joint model for efficacy and safety under the Bayesian framework. Based on the joint posterior distributions for efficacy and safety, we sequentially evaluate the biosimilarity to make interim go/no-go decisions. We propose calibration steps to maximize the power of this Bayesian design while maintaining the frequentist type I error rate at the prespecified nominal level. Our design is flexible in that clinicians can adapt the design settings, as well as control the trial rigidity according to their understandings of the biosimilar to be tested. Simulation studies show that the proposed design yields comparable power with the competitor designs and performs better in type I error rate control under various scenarios.

The remainder of this paper is organized as follows. In Section 2, we describe our joint probability model and propose a Bayesian optimal design considering both safety and efficacy endpoints. In Section 3, we conduct simulation studies to evaluate the performance of the new design. In Section 4, we exemplify the proposed design based on the reported data from a randomized clinical trial that evaluates a proposed ranibizumab biosimilar product. Finally, we conclude this paper with a discussion in Section 5. The R code to implement the proposed BOB design can be found at https://github.com/xiaohanchi/BOB_Design.

2 ∣. METHODS

2.1 ∣. Probability Model

In a biosimilar trial that tests the biosimilarity between a reference (R) product and a test (T) biosimilar, we consider a continuous efficacy response and a binary safety indicator as the co-primary efficacy and safety endpoints, respectively. In this paper, we simply assume the use of the equal randomization, so that the numbers of patients are equal between arms T and R. The proposed design is not bundled with the equal randomization assumption and can be adapted to accommodate various randomization schemes, including any fixed-ratio randomization or outcome-adaptive randomization.

Assume that n patients have been enrolled and treated in each arm so far, a pair of (X_ki, Y_ki) is available for each patient i in arm k, i = 1, … , n and k = T, R, where X_ki represents a continuous efficacy outcome (for example, the original or log-transformed efficacy data), and Y_ki indicates whether the patient has experienced an adverse event (AE) with Y_ki = 1 if a toxicity is observed during the follow-up; otherwise Y_ki = 0. We use a marginal-conditional approach to jointly model the bivariate safety–efficacy outcomes. In particular, we assume Y_ki marginally follows a Bernoulli distribution, and given Y_ki, X_ki follows a conditional normal distribution whose mean depends on the value of Y_ki. That is,

Y_{k i} \sim Bernoulli (p_{k}), X_{k i} ∣ Y_{k i} \sim N (μ_{k 1} Y_{k i} + μ_{k 0} (1 - Y_{k i}), σ_{k}^{2}),

(2.1)

where p_k is the probability of a patient having an AE (i.e., the toxicity probability), μ_k1 and p_k0 are the respective conditional means of X_ki in patients with AE (Y_ki = 1) and patients without AE (Y_ki = 0); $σ_{k}^{2}$ is the conditional variance. For simplicity, we assume a common conditional variance $σ_{k}^{2}$ for X_ki ∣ Y_ki = 1 and X_ki ∣ Y_ki = 0. Such an equal-variance assumption can be relaxed straightforwardly. For arm k = T or R, the unconditional mean (μ_k, which is the population treatment effect in terms of efficacy) and variance ( $τ_{k}^{2}$ ) of X_ki, and the correlation (ρ_k) between X_ki and Y_ki are respectively

μ_{k} = E (X_{k i}) = p_{k} μ_{k 1} + (1 - p_{k}) μ_{k 0}, τ_{k}^{2} = Var (X_{k i}) = σ_{k}^{2} + p_{k} (1 - p_{k}) (μ_{k 1} - μ_{k 0})^{2}, ρ_{k} = Cor (X_{k i}, Y_{k i}) = (μ_{k 1} - μ_{k 0}) \sqrt{p_{k} (1 - p_{k})} ∕ τ_{k} .

(2.2)

Apparently, the correlation between safety and efficacy in arm k is determined by the values of μ_k0 and μ_k1, as well as the toxicity probability p_k, which turns out to be zero under the special case when μ_k0 = μ_k1.

We base the decision-making process in biosimilar trials on Bayesian posterior probabilities¹⁸. Under the Bayesian paradigm, we assume the following prior distributions for the unknown parameters:

p_{k} \sim Beta (a_{k}, b_{k}), μ_{k 0}, μ_{k 1} ∣ σ_{k}^{2} \overset{i.i.d.}{\sim} N (m_{k}, \frac{σ_{k}^{2}}{v_{k}}), σ_{k}^{2} \sim IG (α_{k}, β_{k}),

where “i.i.d.” means independent and identically distributed; a_k, b_k, m_k, υ_k, α_k, and β_k are prespecified hyperparameters for arm k = T, R, and their values may not necessarily be the same between treatment arms; and IG(α, β) denotes an inverse Gamma distribution with a shape parameter α and a scale parameter β.

Given the observed data D_n = {(X_R1, Y_R1), … , (X_Rn, Y_Rn); (X_T1, Y_T1), … , (X_Tn, Y_Tn)} of the first 2n patients in arms T and R, the posterior distribution of (P_k, μ_k0, μ_k1, $σ_{k}^{2}$ ) can be derived as follows due to the conjugacy of the priors:

p_{k} ∣ D_{n} \sim Beta (z_{k} + a_{k}, n - z_{k} + b_{k}), μ_{k 0} ∣ D_{n}, σ_{k}^{2} \sim N (m_{k 0, n}, \frac{σ_{k}^{2}}{v_{k 0, n}}), μ_{k 1} ∣ D_{n}, σ_{k}^{2} \sim N (m_{k 1, n}, \frac{σ_{k}^{2}}{v_{k 1, n}}), σ_{k}^{2} ∣ D_{n} \sim IG (α_{k, n}, β_{k, n}),

where $z_{k} = \sum_{i = 1}^{n} Y_{k i}$ , ${\bar{X}}_{k 0} = \sum_{i = 1}^{n} X_{k i} (1 - Y_{k i}) ∕ (n - z_{k})$ , ${\bar{X}}_{k 1} = \sum_{i = 1}^{n} X_{k i} Y_{k i} ∕ z_{k}$ , and

m_{k 0, n} = \frac{v_{k} m_{k} + (n - z_{k}) {\bar{X}}_{k 0}}{v_{k} + n - z_{k}}, m_{k 1, n} = \frac{v_{k} m_{k} + z_{k} {\bar{X}}_{k 1}}{v_{k} + z_{k}}; v_{k 0, n} = v_{k} + n - z_{k}, v_{k 1, n} = v_{k} + z_{k}; α_{k, n} = α_{k} + \frac{n}{2}, β_{k, n} = β_{k} + \frac{1}{2} \sum_{i}^{n} {(1 - Y_{k i}) (X_{k i} - {\bar{X}}_{k 0})^{2} + Y_{k i} (X_{k i} - {\bar{X}}_{k 1})^{2}} + \frac{v_{k} (n - z_{k}) ({\bar{X}}_{k 0} - μ_{k 0})^{2}}{2 (v_{k} + n - z_{k})} + \frac{v_{k} z_{k} ({\bar{X}}_{k 1} - μ_{k 1})^{2}}{2 (v_{k} + z_{k})} .

Based on simple algebra, the respective marginal posterior distributions of μ_k0 and μ_k1 are,

μ_{k 0} ∣ D_{n} \sim t_{2 α_{k, n}} (m_{k 0, n}, \frac{β_{k, n}}{α_{k, n} v_{k 0, n}}), μ_{k 1} ∣ D_{n} \sim t_{2 α_{k, n}} (m_{k 1, n}, \frac{β_{k, n}}{α_{k, n} v_{k 1, n}}),

where t_ν (μ, σ²) denotes a t–distribution with ν degrees of freedom, the location parameter μ, and the scale parameter σ². The posterior distributions of μ_k and $τ_{k}^{2}$ are then derived from (2.2) based on the marginal posterior distributions of p_k, μ_k0, μ_k1, and $σ_{k}^{2}$ , which can be easily obtained using the Markov chain Monte Carlo method.

In some cases, due to the quickly developed manufacturing methods of biological products, historical information may not be available for the current trial. In addition, because of a lack of clinical knowledge about the new biosimilars, researchers are not likely to have enough prior information about the tested biological products. Under such circumstances, in order to avoid misleading inference caused by the inappropriate use of prior information, we adopt non-informative priors by taking a_k = b_k = 1, and υ_k, α_k, β_k → 0. Of note, we have additionally investigated the use of half-Cauchy or uniform distributions as the prior on σ_R, and found that the proposed design is not sensitive to the choice of the prior on $σ_{R}^{2}$ (or σ_R) as long as they are non-informative.

In other situations, it is possible for some long-established biological products to access useful historical information, then we can incorporate information obtained from former studies as informative priors for the reference product. For example, we can choose values of hyperparameters for arm R based on clinicians’ understandings of the reference products and the historical studies while keeping non-informative priors for arm T. Alternatively, to avoid prior-data conflict, we can consider the approaches of adaptively incorporating historical information, such as the calibrated power prior approach¹⁰ and robust meta-analytic priors¹⁹. However, investigating the performance of the proposed design based on informative priors is beyond the scope of this paper and will be examined in a separate paper elsewhere.

2.2 ∣. Bayesian Optimal Design

Denote δ_p = p_T – p_R as the difference in toxicity probabilities between products T and R, and $δ_{μ}^{s} = (μ_{T} - μ_{R}) ∕ τ_{R}$ as the scaled difference in treatment effects. Biological products including biosimilars usually exhibit high within-subject variability in efficacy²⁰. Rather than using the unadjusted difference μ_T – μ_R, we hereby adopt $δ_{μ}^{s}$ , the scaled average bioequivalence/biosimilarity criterion (SABE)²¹ that was advocated by the FDA advisory committee for pharmaceutical science and clinical pharmacology, to assess the interchangeability in efficacy between the reference and test products. Formally, we propose to evaluate the biosimilarity between T and R through testing the following hypotheses:

H_{0} : (δ_{p}, δ_{μ}^{s}) \in Θ_{0}, versus H_{1} : (δ_{p}, δ_{μ}^{s}) \in Θ_{1},

(2.3)

where

Θ_{0} = {(δ_{p}, δ_{μ}^{s}) : δ_{p} \leq Δ_{p}^{L} or δ_{p} \geq Δ_{p}^{U}, or δ_{μ}^{s} \leq Δ_{μ}^{L} or δ_{μ}^{s} \geq Δ_{μ}^{U}} Θ_{1} = {(δ_{p}, δ_{μ}^{s}) : Δ_{p}^{L} < δ_{p} < Δ_{p}^{U}, and Δ_{μ}^{L} < δ_{μ}^{s} < Δ_{μ}^{U}} .

Here, $Δ_{p}^{L}$ and $Δ_{p}^{U}$ (or $Δ_{μ}^{L}$ and $Δ_{μ}^{U}$ ) are regulatory lower and upper biosimilarity limits for the safety endpoint (or the efficacy endpoint), respectively. The null hypothesis H₀ means that either δ_p or $δ_{μ}^{s}$ (or both of them) does not meet its predefined criterion for biosimilarity, i.e., T and R are not similar in terms of safety or efficacy. The alternative hypothesis H₁ means that both parameters satisfy their respective criteria, i.e., T and R are similar due to the similarity in both the toxicity probability and scaled mean efficacy. In other words, biosimilarity is established if H₀ is rejected. The values of ( $Δ_{p}^{L}$ , $Δ_{p}^{U}$ ) or ( $Δ_{μ}^{L}$ , $Δ_{μ}^{U}$ ) can be specified from the investigators to reflect a clinically meaningful and regulatorily reasonable indifferent margin for safety or efficacy, respectively. Particularly, ( $Δ_{μ}^{L}$ , $Δ_{μ}^{U}$ ) can be specified according to the reference-scaled average bioequivalence approach^22,23. For example, if the unscaled biosimilarity limits are specified as ±0.2, the scaled limits ( $Δ_{μ}^{L}$ , $Δ_{μ}^{U}$ ) can be defined as $Δ_{μ}^{U} = - Δ_{μ}^{L} = 0.2 ∕ σ_{W 0}$ , where σ_W0 is a benchmark value predefined by the regulatory agency, say σ_W0 = 0.25. Furthermore, the mixed scaling limits, with $Δ_{μ}^{U} = - Δ_{μ}^{L} = 0.2 \cdot {τ_{R} + (σ_{W 0} - τ_{R}) 1 (σ_{W 0} > τ_{R})} ∕ (σ_{W 0} τ_{R})$ , can be used to relax the criterion to establish biosimilarity when the reference variability $τ_{R}^{2}$ exceeds the benchmark value $σ_{W 0}^{2}$ .

In our design, we adopt a so-called Bayesian biosimilar probability (BBP) to evaluate the biosimilar hypotheses (2.3) under the Bayesian framework,

BBP = Pr (H_{1} is true ∣ D_{n}) = Pr (Δ_{p}^{L} < δ_{p} < Δ_{p}^{U} \cap Δ_{μ}^{L} < δ_{μ}^{s} < Δ_{μ}^{U} ∣ D_{n}),

(2.4)

which can be numerically calculated based on the joint posterior distribution of p_R, p_T, β_R, μ_T, $τ_{R}^{2}$ ∣ D_n derived from Section 2.1. Here, BBP is a probability measure to quantify the degree of similarity between products T and R in terms of the safety and efficacy simultaneously. The correlation between safety and efficacy is also naturally incorporated in this probability. The larger the BBP, the smaller the difference between the two products in terms of safety and efficacy.

We propose to evaluate the biosimilarity between products T and R in a group sequential manner and use BBP to make interim go/no-go decisions. Specifically, let J denote the total number of analyses, including J – 1 interim analyses and one final evaluation; and let n_j denote the number of patients in each arm accumulated up to the jth analysis, j = 1, 2, … , J. The maximum number of patients to be enrolled in the trial is 2n_J. The steps for making the go/no-go decision rules in the proposed BOB design are described as follows:

Initialization: Enroll and equally randomize the first 2n₁ patients to T and R arms.
Interim analyses: Given the jth interim data D_{n_j} = {(X_T1, Y_T1), … , (X_{Tn_j}, Y_{Tn_j}); (X_R1, Y_R1), … , (X_{Rn_j}, Y_{Rn_j})}, j = 1, 2, … , J – 1, we update the BBP, denoted as ${BBP}_{j} = Pr (Δ_{p}^{L} < δ_{p} < Δ_{p}^{U} \cap Δ_{μ}^{L} < δ_{μ}^{s} < Δ_{μ}^{U} ∣ D_{n_{j}})$ , and make go/no-go decisions based on the following rules.
- If BBP_j < C(n_j), terminate the trial early and conclude that products T and R are not similar.
- If BBP_j ≥ C(n_j), continue to enroll and randomize patients to arms T and R until the (j + 1)th interim analysis.
Final analysis: If the trial has not been terminated early, then a final analysis will be conducted based on BBP_J, which is calculated using all observed data of a total of 2n_J patients.
- If BBP_J < C(n_J), accept H₀ and conclude that products T and R are not similar.
- If BBP_J ≥ C(n_J), reject H₀ and conclude that products T and R are similar in terms of both efficacy and safety.

Mimicking Zhou et al.²⁴, we assume the probability cutoff C(n_j) to be a function of the interim sample size n_j with the following form

C (n_{j}) = λ {(\frac{n_{j}}{n_{J}})}^{γ},

(2.5)

where λ > 0 and γ > 0 are tuning parameters that control the operating characteristics of the design, and n_J is the maximum sample size in each arm. Note that the value of BBP depends on the sample size of D_n. When the sample size is small, the joint posterior distribution of δ_p and $δ_{μ}^{s}$ tends to have a large variability, leading to a moderate BBP value under the alternative hypothesis. We require that γ > 0 such that C(n_j) is monotonically increasing with the interim sample size n_j, or equivalently, the accumulated information. The probability of making a “no-go” decision is smaller at early stages of the trial. The purpose of such a choice is to make as few mistakes as possible at the beginning of the trial when the information is sparse. As the trial proceeds and information accumulates, the increasing value of C(n_j) facilitates a more stringent rule for making a “go” decision. In the proposed design, λ and γ are carefully calibrated using simulated data so that the frequentist type I error rate can be maintained at a nominal level and the study power is maximized. Specific steps of calibrating λ and γ are described in Section 2.4.

In this paper, we only terminate the trial early when the interim data indicate that the biosimilar drug is different from the reference product in either the toxicity probability or the scaled treatment effect. It is also technically possible to generalize the BOB design by additionally including an early stopping rule for strong evidence of biosimilarity. This can be done by introducing an upper cutoff of BBP_j, in addition to the lower cutoff C(n_j). However, a caveat of such a practice is that early stopping due to biosimilarity tends to result in a smaller sample size of the trial, thus leading to insufficient evidence in terms of other perspectives, such as PK/PD or immunogenicity analyses.

2.3 ∣. Frequentist Operating Characteristics

Following the FDA guidance on complex innovative clinical trial designs²⁵, we propose to control the frequentist operating characteristics of the proposed Bayesian design in a “hybrid” sense. In our framework, the test of the null hypothesis H₀ is an intersection–union test (IUT) consisting of tests for respective safety and efficacy endpoints. H₀ can be rejected if and only if each of the individual hypotheses for both endpoints in H₀ can be rejected. In the aforementioned Bayesian decision framework, rejection of H₀ requires large BBP_j’s throughout the trial. Let $R$ denote the rejection of the null hypothesis H₀. In the sequential BOB design, the expression of $R$ is given by

R = ⋂_{j = 1}^{J} {{BBP}_{j} \geq C (n_{j})} .

Given a set of δ_p and $δ_{μ}^{s}$ , the probability of rejecting H₀, denoted as $π (δ_{p}, δ_{μ}^{s})$ , is

π (δ_{p}, δ_{μ}^{s}) = Pr (R ∣ δ_{p}, δ_{μ}^{s}) = Pr (⋂_{j = 1}^{J} {{BBP}_{j} \geq C (n_{j})} ∣ δ_{p}, δ_{μ}^{s}) .

Denote α as the prespecified level of the type I error rate. To calibrate the proposed Bayesian design, we suggest two procedures to control the overall type I error rate: one corresponds to a rigid control, and the other one is a loose control. Due to the fact that H₀ is a composite null hypothesis, the overall type I error rate for the test of the composite hypothesis (2.3) is the largest value of $π (δ_{p}, δ_{μ}^{s})$ in the region Θ₀ defined by H₀, or $\max_{(δ_{p}, δ_{μ}^{s}) \in Θ_{0}} {π (δ_{p}, δ_{μ}^{s})}$ .

More specifically, the first procedure is to control the maximum type I error rate under each point of Θ₀ in a point-wise way. In other words, we propose to identify the parameters (λ, γ) such that the maximum type I error rate when $(δ_{p}, δ_{μ}^{s}) \in Θ_{0}$ can be controlled at the prespecified level α, that is,

\max_{(δ_{p}, δ_{μ}^{s}) \in Θ_{0}} {π (δ_{p}, δ_{μ}^{s})} \leq α .

The maximum type I error rate $\max_{(δ_{p}, δ_{μ}^{s}) \in Θ_{0}} {π (δ_{p}, δ_{μ}^{s})}$ can be identified using a grid searching algorithm by numerically enumerating values of $(δ_{p}, δ_{μ}^{s})$ in Θ₀. We denote the proposed BOB design based on such a stringent type I error controlling procedure as BOB_s. Noticing that the combinations $(δ_{p}, δ_{μ}^{s}) \in Θ_{0}$ with $δ_{p} = Δ_{p}^{L or U}$ (or $δ_{μ}^{s} = Δ_{μ}^{L or U}$ ) represent the least favorable scenarios in Θ₀, we can use the following simplified procedure for type I error rate control,

\max_{(δ_{p}, δ_{μ}^{s}) \in Θ_{00}} {π (δ_{p}, δ_{μ}^{s})} \leq α,

(2.6)

where

Θ_{00} = {(δ_{p}, δ_{μ}^{s}) : δ_{p} = Δ_{p}^{L or U}, Δ_{μ}^{L} \leq δ_{μ}^{s} \leq Δ_{μ}^{U}} \cup {(δ_{p}, δ_{μ}^{s}) : Δ_{p}^{L} \leq δ_{p} \leq Δ_{p}^{U}, δ_{μ}^{s} = Δ_{μ}^{L or U}} .

The second procedure is to control, in a Bayesian way, the maximum of four average type I error rates under four null scenarios when δ_p or $δ_{μ}^{s}$ is fixed at its lower or upper biosimilarity limit, respectively. Given a parameter set Θ, denote the average probability of rejecting H₀ as

\bar{π} (Θ) = E (π (δ_{p}, δ_{μ}^{s}) ∣ δ_{p}, δ_{μ}^{s} \in Θ) = \int_{(δ_{p}, δ_{μ}^{s}) \in Θ} Pr (R ∣ δ_{p}, δ_{μ}^{2} \in Θ) f_{Θ} (δ_{p}, δ_{μ}^{s}) d δ_{p} d δ_{μ}^{s},

where $f_{Θ} (δ_{p}, δ_{μ}^{s})$ is the joint density function of $(δ_{p}, δ_{μ}^{s})$ given Θ. We additionally propose to control the type I error rate in a loose sense, that is,

\max_{1 \leq i \leq 4} {\bar{π} (Θ_{0 i})} \leq α

(2.7)

where

Θ_{01} = {(δ_{p}, δ_{μ}^{s}) : δ_{p} = Δ_{p}^{L}, δ_{μ}^{s} \sim Uniform (Δ_{μ}^{L}, Δ_{μ}^{U})}; Θ_{02} = {(δ_{p}, δ_{μ}^{s}) : δ_{p} = Δ_{p}^{U}, δ_{μ}^{s} \sim Uniform (Δ_{μ}^{L}, Δ_{μ}^{U})}; Θ_{03} = {(δ_{p}, δ_{μ}^{s}) : δ_{p} \sim Uniform (Δ_{p}^{L}, Δ_{p}^{U}), δ_{μ}^{s} = Δ_{μ}^{L}}; Θ_{04} = {(δ_{p}, δ_{μ}^{s}) : δ_{p} \sim Uniform (Δ_{p}^{L}, Δ_{p}^{U}), δ_{μ}^{s} = Δ_{μ}^{U}} .

We denote the proposed BOB design based on such a control of the average type I error rate as BOB_avg. Note that the uniform distribution in Θ_0i serves as the weighting function for calculating the average type I error rate, and it can be replaced by other distributions such as the truncated normal distribution. The choice of an appropriate weighting function is crucial to the design calibration and it determines the power of the proposed design. Clinicians and statisticians should choose reasonable settings in consideration of the characteristics of the tested products and available resources.

In general, BOB_s provides a more stringent control of the type I error rate than BOB_avg because the former controls the maximum type I error rate point-wisely, whereas the latter controls the average rate. In both ways, we fix the value of δ_p or $δ_{μ}^{s}$ to be the lower or upper biosimilarity limit while varying the value of the other to reflect the fact that two-sided tests are considered in our design. Similar to the type I error control, we calibrate the proposed design to maximize the empirical power under the following alternative parameter set, ${\tilde{Θ}}_{1} = {(δ_{p}, δ_{μ}^{s}) : δ_{p} = 0, δ_{μ}^{s} = 0}$ . As a result, the empirical power can be expressed as $π (0, 0) = Pr (R ∣ δ_{p}, δ_{μ}^{s} \in {\tilde{Θ}}_{1})$ .

2.4 ∣. Design Calibration

In our proposed design, the parameters λ and γ introduced in Section 2.2 should be calibrated through a simulation-based approach. Our strategy to optimize λ and γ is based on a grid search: first we find the values of (λ, γ) that yield the desirable empirical type I error rate, then select the one that produces high empirical power and small expected sample size as the optimal design parameters. Specifically, we calibrate the parameters λ and γ according to the following steps:

Step 1: Elicit the null hypothesis H₀ and the alternative hypothesis H₁ from investigators, and specify the level of the type I error rate α and the maximum sample size. To make sure the BOB design can have adequate power under a reasonable sample size, the endpoint with a larger variance usually requires a wider biosimilarity margin.

Step 2: Perform a grid search to find all possible combinations of (λ, γ) for the stopping cutoff C(n_j) that lead to a type I error rate of less than or equal to the prespecified level α, using either criterion (2.6) or (2.7).

Step 3: Among the admissible set of (λ, γ) identified in Step 2, find the maximum value of empirical power under ${\tilde{Θ}}_{1}$ , which is denoted by π_max, and further find the values of (λ, γ) from the admissible set that lead to the empirical power no less than π_max – ϵ, where ϵ is a constant used to maintain high power and is usually set to be 0.01.

Step 4: Select the value of (λ, γ) that yields the minimum expected sample size from the set identified in Step 3 as the optimal design parameters.

Given a prespecified maximum sample size, the power of the resulting design may be excessively low or high. In this case, one can further calibrate the sample size to obtain a desirable power. The calibration procedure is time consuming, especially in Step 2, because the BBP is calculated through the MCMC method and the type I error rate should be computed across a grid of null scenarios. To address this problem, we propose to calibrate the proposed design in an approximate way by using asymptotic posterior distributions of δ_p and $δ_{μ}^{s}$ to calculate BBP. Details about such an approximation are shown in Supplementary Material. Note that such asymptotic posterior distributions are only used in calibration, whereas the exact joint posterior distribution of δ_p and $δ_{μ}^{s}$ based on MCMC is used in trial implementation or simulation. The computation time required by a calibration when the sample size is 160 provides a striking example: the exact method based on the MCMC approach requires about 909.2 core hours while the approximate method requires just 10.4 core hours. The approximate method requires about 1/90 of the computation resources of the exact method. Additionally, the calibration results of both methods are particularly similar, as shown in Table S1 and Table S2 of the Supplementary Material.

The calibration procedure also requires the specification of the correlation between safety and efficacy, ρ. In practice, the value of ρ can be obtained from historical studies. Sometimes, it may be challenging to specify the value of ρ precisely when calibrating a design, we propose to simply use ρ = 0 in the calibration procedure to reduce the burden of parameter calibration.

Based on our extensive simulation studies, we find that the proposed design calibrated based on ρ = 0 is very robust and can maintain a good control of the type I error rate. Further details about such an issue are provided in the sensitivity analyses of Section 3.2.

3 ∣. NUMERICAL STUDIES

In this section, we report a simulation study to assess the performance of the proposed design. We use the following three metrics to evaluate the operating characteristics of a design: (1) empirical power, the probability that the test rejects the null hypothesis H₀ when the alternative hypothesis H₁ is true; (2) empirical type I error rate, the probability that the test rejects H₀ when H₀ is true; and (3) expected sample size, the average sample size summarized over all simulated trials.

3.1 ∣. Simulation Settings

In our simulation study, we take non-informative priors for the proposed Bayesian design by specifying the hyperparameters as, a_k = b_k = 1; m_k = 0, υ_k = 0.001; and α_k = β_k = 0.001. We consider the sample size n_J for both arms (T and R) to be 100, 160, and 220, resulting in a total sample size of 200, 320, and 440, respectively. Biosimilarity is established if −0.20 < δ_p < 0.20 and −0.40 < $δ_{μ}^{s}$ < 0.40 are met at the same time. We assume a common correlation coefficient ρ between X_ki and Y_ki in arms T and R and consider three scenarios: no correlation (ρ = 0), a medium correlation (ρ = 0.3), and a high correlation (ρ = 0.5). Scenarios with a negative ρ are omitted because the results are similar to those under a positive ρ by symmetry.

We generate X_ki and Y_ki from the distributions described in Section (2.1) with p_R = 0.5, μ_R = 0, and $τ_{R}^{2} = τ_{T}^{2} = {0.8}^{2}$ . Under these configurations for the reference product, products T and R are similar if both p_T ∈ (0.30, 0.70) and μ_T ∈ (−0.32, 0.32) are satisfied. When comparing the power of different designs, we set p_T to be 0.42, 0.50, 0.58, and μ_T to be −0.20, −0.10, 0, 0.10, 0.20. When examining the type I error rates, we consider the two least favorable scenarios, with the first scenario about safety and the second about efficacy. In scenario 1, we fix p_T to be p_R – 0.20 = 0.30 or p_R + 0.20 = 0.70, and set μ_T to be from −0.32 to 0.32, with a step 0.064; and similarly in scenario 2, we fix μ_T to be μ_R – 0.32 = −0.32 or μ_R + 0.32 = 0.32, and set p_T to be from 0.30 to 0.70, with a step 0.04. Technically speaking, when both p_T and μ_T are outside of the respective similarity regions, the type I error rate of the proposed design would be smaller than that evaluated under the least favorable scenarios. The prespecified nominal level α is set at 0.05. The simulation results are summarized based on 20,000 replicated trials under each pair of (p_k, μ_k), k = T or R.

We compare the performance of the proposed BOB designs (BOB_s and BOB_avg) to those of some commonly used methods including

FE Frequentist fixed design considering a single efficacy endpoint. FE is a frequentist fixed-sample design. FE adopts a two-sample t-test approach for the scaled average bioequivalence test^21,26 to evaluate the biosimilarity of the efficacy endpoint.

FS Frequentist fixed design considering a single safety endpoint. FS is a frequentist fixed-sample design that performs the frequentist two one-sided tests (TOST) procedure²⁷ for both sides to test the safety endpoint.

BAE/BAS Bayesian adaptive design considering a single efficacy or safety endpoint. BAE and BAS are Bayesian group-sequential designs that respectively consider the efficacy and safety as a single primary endpoint. These probability cutoffs in BAE and BAS are optimized in the same way as the proposed BOB design.

FES Frequentist fixed design considering both efficacy and safety endpoints. FES is a frequentist fixed-sample design that combines the FE and FS designs to test both efficacy and safety endpoints. The significance level of FES is controlled using the intersection-union method²⁸.

In fixed-sample designs, the final analysis takes place when the number of patients per arm equals 100, 160, or 220. In adaptive designs, we perform three interim analyses when n_J/4, n_J/2, and 3n_J/4 patients have been enrolled in each arm, and a final analysis when all 2n_J patients have been treated. Although comparing the univariate designs (i.e., FE, FS, BAE, and BAS) with the bivariate designs (i.e., BOB and FES) seems relatively unfair as they have different estimands, we still include the former to comprehensively quantify the gains as well as the limitations of the proposed design with bivariate endpoints. Through our simulation study, we can see that ignoring one important endpoint in the trial monitoring may lead to an inflation of the type I error rate and infrequent early terminations of unpromising products. We search for pairs of (λ, γ) for the Bayesian designs under sample size n_J of 100, 160, or 220, assuming a correlation of ρ = 0 between efficacy and safety, to control the type I error, as well as to maintain the power. The calibrated values of tuning parameters are provided in Table S3 of the Supplementary Materials.

3.2 ∣. Simulation Results

Type I error rate control

Figure 1 shows the type I error rates and the expected sample sizes under H₀ for the considered designs when the maximum sample size n_J = 160 and the correlation between safety and efficacy ρ = 0. We also select and present some representative values of the type I error rates and expected sample sizes in Table S4 of the Supplementary Materials.

Type I error rate and expected sample size of seven designs with the maximum sample size per arm n_J = 160 and the correlation between efficacy and safety ρ = 0. Panel (a), type I error rate when p_T is fixed at 0.30 or 0.70. Panel (b), expected sample size when p_T is fixed at 0.30 or 0.70. Panel (c), type I error rate when μ_T is fixed at ±0.32. Panel (d), expected sample size when μ_T is fixed at ±0.32. Because FE, FS, and FES are fixed designs without any interim analysis, their expected sample sizes are identical as shown by the coincident dashed lines.

In panels (a) and (b) of Figure 1, the value of p_T is fixed at 0.30 or 0.70 and the value of μ_T is varied from −0.32 to 0.32. Given that (p_R, μ_R) = (0.5, 0), the test product is similar to the reference product in efficacy but not in safety. Under this scenario, the univariate designs using a single safety endpoint, such as FS and BAS, as well as the bivariate designs including FES and BOB_s, can control the type I error rate at the prespecified level of 0.05. By contrast, the univariate designs based on a single efficacy endpoint, such as FE and BAE, fail to incorporate the information from the safety data, so their type I error rates are severely inflated, especially when μ_T is close to μ_R. For example, in Table S4 of the Supplementary Material, when $δ_{μ}^{s} = 0$ , the type I error rates of the FS, BAS, FES, and BOB_s designs are well controlled below 5% while the BAE design has a type I error rate of 93.6%.

The BOB_avg design has a slightly larger type I error rate, due to the loose control of the type I error rate, than the BOB_s design. As shown in Figure 1, although the BOB_avg design possesses a good control of type I error in most scenarios, the rate is inflated under some circumstances when μ_T and μ_R are particularly close.

For example, the type I error rate of the BOB_avg design is 8.8% when $δ_{μ}^{s} = 0$ , which becomes 0.5% when $δ_{μ}^{s} = - 0.40$ . Furthermore, the FS and BAS designs do not take the efficacy data into account, thus they have almost constant empirical type I error rates at 0.05 when δ_p is fixed and $δ_{μ}^{s}$ is varied. On the other hand, the type I error rates of the bivariate designs, such as FES, BOB_s, and BOB_avg, decrease with the increase of the absolute difference $∣ δ_{μ}^{s} ∣$ , exhibiting a better ability to detect the dissimilarity.

In terms of the expected sample size under the null scenarios, BOB_s and BOB_avg designs have uniformly better performance than the other competing designs, due to the ability to early terminate at interim analyses as well as the incorporation of both efficacy and safety endpoints. For example, in Table S4 of the Supplementary Materials, when $δ_{μ}^{s} = 0$ , the BOB_s design saves about 60 more patients than the fixed-sample designs and about 5 more patients than the BAS design, respectively. Given that both BOB_s and BOB_avg can maintain (or roughly maintain) the type I error rate but with much smaller sample sizes, the two proposed BOB designs are more efficient than the other designs in identifying the differences between the reference and biosimilar products, leading to savings in the sample size.

Likewise, the above conclusions among the considered designs remain unchanged in panels (c) and (d) of Figure 1, where the value of μ_T is fixed at ±0.32, and that of p_T varies from 0.30 to 0.70, indicating that the test product is similar to the reference product in safety, but not in efficacy. Specifically, the FS and BAS designs that only use the safety primary endpoint have severely inflated type I error rates, whereas the other designs can control (or roughly control in the case of the BOB_avg design) the type I error rate at the prespecified level of 0.05. The sample size savings of the BOB_s and BOB_avg designs are more prominent when there is a considerable difference in safety between products T and R.

To summarize, BOB_s and BOB_avg designs control the type I error rates well with both endpoints and greatly reduce the required sample sizes at the same time. More importantly, BOB_s and BOB_avg designs are more likely to detect the dissimilarity and terminate the trial early when there is a considerable difference between products T and R, thus avoiding a waste of resources.

Empirical power

Table 1 reports the empirical power of different designs under the sample size per arm n_J = 160 and the correlation between efficacy and safety ρ = 0. Because the univariate designs including FE, FS, BAE, and BAS only use one endpoint in decision making, their power is not affected by the value of the other endpoint. As a result, they yield almost the same power when the value of the considered endpoint is fixed and that of the other varies. For example, in Table 1, when $δ_{μ}^{s} = 0$ , the BAE design yields the same power of 93.6% under δ_p = −0.08, 0, and 0.08. Likewise, the BAS design yields the same power of 68.9% under δ_p = −0.08 regardless of the values of $δ_{μ}^{s}$ .

TABLE 1.

Comparison of power (%) and expected sample size (EN) using seven designs under the scenarios with the maximum sample size per arm n_J = 160 and the correlation between efficacy and safety ρ = 0.

δ_p	$δ_{μ}^{s}$		Design
δ_p	$δ_{μ}^{s}$		FE	FS	BAE	BAS	FES	BOB_s	BOB_avg
	−0.250	Power (%)	38.0	70.0	37.9	68.9	26.5	25.8	35.8
		EN	160	160	136.6	152.3	160	126.7	128.0
	−0.125	Power (%)	79.1	70.0	77.9	68.9	55.2	54.3	64.5
		EN	160	160	153.4	152.3	160	143.2	144.3
−0.08	0	Power (%)	94.4	70.0	93.6	68.9	65.6	65.9	74.4
		EN	160	160	157.7	152.3	160	148.2	148.9
	0.125	Power (%)	78.8	70.0	77.7	68.9	54.9	53.9	64.4
		EN	160	160	153.5	152.3	160	143.4	144.3
	0.250	Power (%)	37.9	70.0	37.3	68.9	26.2	25.2	35.0
		EN	160	160	137.0	152.3	160	126.8	128.3
	−0.250	Power (%)	38.0	94.7	37.9	94.0	36.1	34.8	45.2
		EN	160	160	136.6	158.4	160	133.8	135.0
	−0.125	Power (%)	79.1	94.7	77.9	94.0	74.9	72.7	80.5
		EN	160	160	153.4	158.4	160	150.7	151.2
0	0	Power (%)	94.4	94.7	93.6	94.0	89.3	88.0	92.0
		EN	160	160	157.7	158.4	160	155.2	155.5
	0.125	Power (%)	78.8	94.7	77.7	94.0	74.5	72.6	80.5
		EN	160	160	153.5	158.4	160	150.7	151.2
	0.250	Power (%)	37.9	94.7	37.3	94.0	35.8	34.4	44.6
		EN	160	160	137.0	158.4	160	134.2	135.5
	−0.250	Power (%)	38.0	70.0	37.9	68.9	26.6	22.6	32.4
		EN	160	160	136.6	152.6	160	126.3	127.7
	−0.125	Power (%)	79.1	70.0	77.9	68.9	55.4	50.9	62.1
		EN	160	160	153.4	152.6	160	143.1	144.2
0.08	0	Power (%)	94.4	70.0	93.6	68.9	66.2	64.2	73.3
		EN	160	160	157.7	152.6	160	148.0	148.9
	0.125	Power (%)	78.8	70.0	77.7	68.9	55.5	51.2	62.3
		EN	160	160	153.5	152.6	160	143.3	144.3
	0.250	Power (%)	37.9	70.0	37.3	68.9	26.7	21.8	32.0
		EN	160	160	137.0	152.6	160	126.7	128.3

Open in a new tab

FE/FS, frequentist fixed design considering a single efficacy or safety endpoint; BAE/BAS, Bayesian adaptive design considering a single efficacy or safety endpoint; FES, frequentist fixed design considering both efficacy and safety endpoints; BOB_s the proposed Bayesian optimal biosimilar trial design for both efficacy and safety endpoints under a stringent type I error control; BOB_avg the proposed design that controls the average type I error rate. δ_p is the difference between the toxicity probabilities of arms T and R, and $δ_{μ}^{s}$ is the scaled difference in the treatment effects.

The power of bivariate designs, such as FES, BOB_s, and BOB_avg, is lower than that of univariate designs. This is because the incorporation of co-primary endpoints makes it more difficult to claim the similarity between products T and R. Among the bivariate designs, BOB_s produces comparable power with FES and performs better under a relatively small sample size with n_J = 100 (Tables S9-S12 in the Supplementary Materials). Between the two BOB designs, as expected, the BOB_avg design yields a higher power than BOB_s because of its relatively loose control of the type I error rate. For example, when δ_p = 0 and $δ_{μ}^{s} = 0$ , BOB_avg yields about 4% improvement in power compared with the BOB_s design.

It is worth noting that the expected sample sizes of BOB_s and BOB_avg designs are both smaller than those of the competing designs. For example, Table 1 shows that the BOB_s design needs 5–34 fewer patients than the FES design while maintaining a similar power level; the BOB_avg design leads to a higher power than the FES design while it can save at least 5 more patients in a trial. With consideration of the comparable power, these results demonstrate the advantage of the BOB designs in trial cost-effectiveness.

Sensitivity analyses

We conducted additional simulation studies to evaluate the effect of the correlation ρ on the performance of our designs. Figure S1 in the Supplementary Materials shows the type I error rates and the expected sample sizes under the sample size per arm n_J = 160 and the correlation between efficacy and safety ρ = 0.5, and Table S6 - Table S8 in the Supplementary Materials report the empirical power of different designs under n_J = 160 and ρ = 0.3, 0.5, and −0.5. Negative correlations are not reported in detail because the simulation results with −ρ are particularly similar to those with ρ due to symmetry. For example, when n_J = 160, results based on ρ = −0.5 displayed in Table S8 of the Supplementary Materials exhibit a similar pattern as those displayed in Table S7 with ρ = 0.5.

As expected, the absolute value of the correlation ρ plays a role in the power (or type I error rate) function $π (δ_{p}, δ_{μ}^{s})$ of the bivariate designs such as FES, BOB_s and BOB_avg. Specifically, $π (δ_{p}, δ_{μ}^{s})$ increases as ∣ρ∣ increases when $ρ δ_{p} δ_{μ}^{s} > 0$ (i.e., the sign of $δ_{p} δ_{μ}^{s}$ is the same as that of ρ), while decreases as ∣ρ∣ increases when $ρ δ_{p} δ_{μ}^{s} < 0$ (i.e., the sign of $δ_{p} δ_{μ}^{s}$ is different from that of ρ). As a consequence, when ∣ρ∣ > 0, the maximum type I error rates of the BOB_s and BOB_avg designs, which are calibrated assuming ρ = 0, will be slightly inflated. Even so, the proposed BOB designs are very robust and can still maintain a good type I error rate. For example, as shown in Figure S1 in the Supplementary Materials, the maximum type I error rate of the BOB_s design is about 5.2%, a value which is quite close to the nominal level 5%. More intuitively, as shown in Figure 2, the inflation of the maximum type I error rates of the BOB designs can be maintained at a proper level of less than 1% when ∣ρ∣ > 0, and more notably is nearly negligible when ∣ρ∣ < 0.4. This slight inflation of the type I error rate caused by calibrating the design with ρ = 0 alleviates the burden of parameter calibration, because it is often not feasible to have an appropriate initial guess about ρ. Nonetheless, the concern on the type I error rate inflation can be avoided through the choice of a larger ∣ρ∣ in the design calibration procedure, such as ρ = 0.5. However, such a stringent choice is unnecessary under most scenarios as it would scarify the power of the design.

Maximum type I error rate of the three bivariate designs under various correlations between efficacy and safety ρ with the maximum sample size per arm n_J = 160.

Furthermore, we have also examined other sample sizes including n_J = 100 and 220 for the type I error rate and power. The results of these additional numerical studies are provided in Figure S2 - Figure S5 and Table S9 - Table S16 of the Supplementary Material, respectively. These findings show that the conclusions in terms of the power (or type I error rate) and sample size savings are not affected by different sample sizes.

4 ∣. TRIAL APPLICATION

Neovascular age-related macular degeneration (nAMD) is a leading cause of irreversible blindness among 50 years of age or older. A recombinant monoclonal antibody fragment named ranibizumab was approved over ten years by FDA and EMA for treating nAMD²⁹. However, the high cost of ranibizumab puts most patients out of reach, encouraging researchers to develop biosimilars. Woo et al.³⁰ conducted a two-arm, randomized, parallel-group phase III trial, in which they demonstrated the biosimilarity of SB11 (the test product) compared with the ranibizumab (the reference product). In this study, a total of 704 participants were randomized 1:1 to receive SB11 (N = 351) or ranibizumab (N = 353).

One of the primary efficacy endpoints is the change from baseline in central subfield thickness (CST) at week 4, a continuous endpoint measuring macular thickness. The mean change (and standard error) from baseline at week 4 for CST in the full analysis set (FAS) was −108μm (5) in the SB11 group and −100μm (5) in the ranibizumab group (i.e., μ_T = −108 and μ_R = −100). The reference value of the efficacy endpoint borrowed from the historical data is −110μm. We adopt the scaled equivalence margin of ±36/τ_R = ±0.38, where the unscaled margin ±36μm was predefined in this study as the biosimilarity limit to test the difference in the change in CST, and we use τ_R as the scaled factor for illustration. To note, the test in efficacy will be simplified as an unscaled test as we use the true variance for the reference product as the scaled factor in the margin. This study also reported detailed adverse events (AEs). We focus on treatment-emergent AEs (TEAEs) with the severity of moderate and severe and exclude mild TEAEs. The rates of moderate and severe TEAEs were 34.76% in the SB11 group and 33.43% in the ranibizumab group (i.e., p_T = 34.76% and p_R = 33.43%). No reference value or biosimilarity margin for the safety endpoint was given in this study; thus, for illustrative purposes, we take the value of 33.43% as the reference toxicity rate and adopt a margin of (−15.0%, 15.0%) to test the difference in the above safety endpoint.

Given that the efficacy and safety endpoints were independently reported in this study, it is impossible to obtain the estimated correlation between them based only on summary statistics. To redesign the trial, we consider three scenarios: no correlation between safety and efficacy (ρ = 0), a medium correlation (ρ = 0.3) and a high correlation (ρ = 0.5). Under the three scenarios, the BOB_s and BOB_avg designs described in Section 2.3 are performed to test biosimilarity. We generate efficacy and safety data in the same way as in Section 3.1 with μ_T = −108, μ_R = −100, p_T = 34.76%, and p_R = 33.43%. We conduct two interim analyses in our design when a total of 304 and 504 patients (i.e., 43.2% and 71.6% of the total sample size) have been enrolled, respectively. As described in Section 3.1, we calibrate the value of (λ, γ) using the reference values of the efficacy and safety endpoints of −110 and 33.43% under the assumption of ρ = 0. For each scenario and setting, we adopt the BOB_s and BOB_avg designs with the resulted (λ, γ) and calculate the power and expected sample size, and the results are displayed in Table 2. As a comparison, we also provide the power of the frequentist tests on safety and efficacy, respectively. In this trial, single tests of efficacy and safety endpoints yield high power of 99.0% and 98.4%, respectively. While combining safety and efficacy, the BOB designs still yield comparable power. Specifically, the BOB_s and BOB_avg designs yield the power of about 98%, very close to the single frequentist tests. In addition, the BOB designs yield similar power with different ρ under the same settings, indicating that the proposed designs are robust.

TABLE 2.

Application results of BOB to the biosimilar trial of SB11.

ρ		BOB_s			BOB_avg			Efficacy	Safety
ρ		0	0.3	0.5	0	0.3	0.5	Efficacy	Safety
	(λ, γ)	(0.9502, 1.08)			(0.9212, 1.02)
	Power (%)	97.3	97.9	98.4	98.3	98.8	99.1	98.7	98.4
	EN	351.7	352.3	352.4	351.7	352.7	352.4	353	353

Open in a new tab

Note: The power of efficacy and safety represent the power of frequentist equivalence tests on safety and efficacy, respectively. The expected sample size EN is calculated for the arm R.

We further assume numerous scenarios to obtain a global view of the power (or type I error rate) function $π (δ_{p}, δ_{μ}^{s})$ and the expected sample size of the BOB designs. Specifically, we fix the endpoints of the ranibizumab group as above and vary the efficacy value of the SB11 group from −143.2 to −56.8 and the safety endpoint from 15.4% to 51.4%, and the results with ρ = 0 are shown in Figure 3. It can be seen that the BOB_s design requires at least 345 patients per arm to obtain a power of greater than 80%. If the two products are not similar in safety or efficacy, the BOB_s design can save 100 patients more per arm than the fixed designs. As expected, the BOB_avg design shows a similar pattern to the BOB_s design but it provides a larger value of $π (δ_{p}, δ_{μ}^{s})$ . The results with ρ = 0.3 and 0.5 are shown in Figures S6 and S7 in the Supplementary Materials. As shown, the power varies with the value of ∣ρ∣, which aligns with the discussion about the role of ρ in Section 3.2.

Contour plots of the power (or type I error rate) function $π (δ_{p}, δ_{μ}^{s})$ (%) and expected sample size of BOB designs when ρ = 0 under different combinations of $δ_{μ}^{s}$ and δ_p. Panel (a), the power of the BOB_s design. Panel (b), the expected sample size of the BOB_s design. Panel (c), the power of the BOB_avg design. Panel (d), expected sample size of the BOB_avg design. The exact value collected from the SB11 trial $(δ_{μ}^{s}, δ_{p}) = (- 0.084, 1.4 %)$ is marked using a red dot in the four panels.

5 ∣. CONCLUDING REMARKS

We have proposed a Bayesian optimal design for biosimilar trials that simultaneously considers safety and efficacy endpoints and jointly tests the two endpoints in a uniform framework. We calibrate the proposed BOB design through a simulation-based approach to control the frequentist type I error rate, maximize the power, and save the sample size. We also introduce a more flexible way to control the overall type I error rate of the proposed design. Investigators can adjust the settings of the BOB design when dealing with different biological products to yield sound operating characteristics. The simulation results show that the BOB design requires fewer patients while providing comparable power with the frequentist bivariate design. The saving in the sample size is much more meaningful when products T and R are not similar so that the trial can be terminated as early as possible to avoid a waste of resources. Our framework is very general and the performance of our proposed BOB design is not sensitive to the choice of biosimilar margins. As a result, BOB can be readily generalized to incorporate other criteria, such as the scaled criterion for drug interchangeability³¹. In reality, however, the determination of the biosimilarity margin is a very complicated and complex process, requiring close collaborations between biostatisticians, clinical/preclinical investigators, and regulatory agencies. As an open question for future studies, Bayesian approaches may possibly be used to advance the determination of the biosimilarity margin by exploiting information from (historical) reference data or preclinical data.

Due to the use of the Bayesian biosimilar probability in the trial monitoring, the proposed design can be readily extended to more than two endpoints. Although under the Bayesian framework, it is natural to incorporate historical information through the use of informative prior distributions. In literature, there are several novel approaches for adaptive information-borrowing priors, such as the calibrated power prior approach¹⁰ and the robust meta-analytic prior¹⁹, it is of interest to investigate which information-borrowing approach yields better operating characteristics in our setting. Furthermore, as another useful avenue for future research, the extension of the proposed method to biosimilar trials with the time-to-event endpoints is warranted.

Supplementary Material

supinfo

NIHMS1834131-supplement-supinfo.pdf^{(439.2KB, pdf)}

ACKNOWLEDGEMENTS

We would like to thank the Editor, the Associate Editor, and the reviewer for their valuable comments and suggestions, with special thanks to the reviewer whose dedicated and meticulous effort has led to a much improved version of our paper. Lin’s research was partially supported by grants from the National Cancer Institute (5P30CA016672 and 1R01CA261978).

References

1.Challand R, Gorham H, Constant J. Biosimilars: where we were and where we are. J Biopharm Stat. 2014;24(6):1154–1164. [DOI] [PubMed] [Google Scholar]
2.Ingrasciotta Y, Cutroneo PM, Marcianò I, Giezen T, Atzeni F, Trifirò G. Safety of biologics, including biosimilars: perspectives on current status and future direction. Drug Saf. 2018;41(11):1013–1022. [DOI] [PubMed] [Google Scholar]
3.U.S. Food and Drug Administration (FDA). Guidance for industry: scientific considerations in demonstrating biosimilarity to a reference product. https://www.fda.gov/media/82647/download. Published April 2015. Accessed June 16, 2022.
4.Tsai WC. Update on biosimilars in Asia. Curr Rheumatol Rep. 2017;19(8):47. [DOI] [PubMed] [Google Scholar]
5.European Medicines Agency (EMA). Guideline on similar biological medicinal products containing biotechnology-derived proteins as active substance: non-clinical and clinical issues. https://www.ema.europa.eu/en/similar-biological-medicinal-products-containing-biotechnology-derived-proteins-active-substance-non. Published January 9, 2015. Accessed June 16, 2022. [Google Scholar]
6.Weise M. From bioequivalence to biosimilars: how much do regulators dare? Z Evid Fortbild Qual Gesundhwes. 2019;140:58–62. [DOI] [PubMed] [Google Scholar]
7.Chow SC, Wang J, Endrenyi L, Lachenbruch PA. Scientific considerations for assessing biosimilar products. Stat Med. 2013;32(3):370–381. [DOI] [PubMed] [Google Scholar]
8.Chow SC, Liu JP. Statistical assessment of biosimilar products. J Biopharm Stat. 2010;20(1):10–30. [DOI] [PubMed] [Google Scholar]
9.Chiu ST, Liu JP, Chow SC. Applications of the Bayesian prior information to evaluation of equivalence of similar biological medicinal products. J Biopharm Stat. 2014;24(6):1254–1263. [DOI] [PubMed] [Google Scholar]
10.Pan H, Yuan Y, Xia J. A calibrated power prior approach to borrow information from historical data with application to biosimilar clinical trials. J R Stat Soc Ser C Appl Stat. 2017;66(5):979–996. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Uozumi R, Hamada C. Adaptive seamless design for establishing pharmacokinetic and efficacy equivalence in developing biosimilars. Ther Innov Regul Sci. 2017;51(6):761–769. [DOI] [PubMed] [Google Scholar]
12.Weiss RE, Xia X, Zhang N, Wang H, Chi E. Bayesian methods for analysis of biosimilar phase III trials. Stat Med. 2018;37(20):2938–2953. [DOI] [PubMed] [Google Scholar]
13.Mielke J, Schmidli H, Jones B. Incorporating historical information in biosimilar trials: challenges and a hybrid Bayesian-frequentist approach. Biometrical J. 2018;60(3):564–582. [DOI] [PubMed] [Google Scholar]
14.Psioda MA, Hu K, Zhang Y, Pan J, Ibrahim JG. Bayesian design of biosimilars clinical programs involving multiple therapeutic indications. Biometrics. 2020;76(2):630–642. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Belay SY, Mu R, Xu J. A Bayesian adaptive design for biosimilar trials with time-to-event endpoint. Pharm Stat. 2021;20(3):597–609. [DOI] [PubMed] [Google Scholar]
16.Schellekens H, Smolen JS, Dicato M, Rifkin RM. Safety and efficacy of biosimilars in oncology [published correction appears in Lancet Oncol. 2017 Mar;18(3):e134]. Lancet Oncol. 2016;17(11):e502–e509. [DOI] [PubMed] [Google Scholar]
17.Das S, Johnson DB. Immune-related adverse events and anti-tumor efficacy of immune checkpoint inhibitors. J Immunother Cancer. 2019;7(1):306. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Thall PF, Simon R. Practical Bayesian guidelines for phase IIB clinical trials. Biometrics. 1994;50(2):337–349. [PubMed] [Google Scholar]
19.Schmidli H, Gsteiger S, Roychoudhury S, O’Hagan A, Spiegelhalter D, Neuenschwander B. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics. 2014;70(4):1023–1032. [DOI] [PubMed] [Google Scholar]
20.Chow SC. Analytical Similarity Assessment in Biosimilar Product Development. New York, NY: Chapman and Hall/CRC Press; 2018. [Google Scholar]
21.Tothfalusi L, Endrenyi L, Arieta AG. Evaluation of bioequivalence for highly variable drugs with scaled average bioequivalence. Clin Pharmacokinet. 2009;48(11):725–743. [DOI] [PubMed] [Google Scholar]
22.Davit BM, Chen ML, Conner DP, et al. Implementation of a reference-scaled average bioequivalence approach for highly variable generic drug products by the US Food and Drug Administration. AAPS J. 2012;14(4):915–924. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Tothfalusi L, Endrenyi L. An exact procedure for the evaluation of reference-scaled average bioequivalence. AAPS J. 2016;18(2):476–489. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhou H, Lee JJ, Yuan Y. BOP2: Bayesian optimal design for phase II clinical trials with simple and complex endpoints. Stat Med. 2017;36(21):3302–3314. [DOI] [PubMed] [Google Scholar]
25.U.S. Food and Drug Administration (FDA). Guidance for industry: interacting with the FDA on complex innovative trial designs for drugs and biological products. https://www.fda.gov/media/130897/download. Published December 2020. Accessed June 16, 2022.
26.Wellek S. Testing Statistical Hypotheses of Equivalence and Noninferiority. 2nd ed. New York, NY: Chapman and Hall/CRC Press; 2010. [Google Scholar]
27.Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm. 1987;15(6):657–680. [DOI] [PubMed] [Google Scholar]
28.Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests and equivalence confidence sets. Stat Sci. 1996;11(4):283–319. [Google Scholar]
29.Rosenfeld PJ, Brown DM, Heier JS, et al. Ranibizumab for neovascular age-related macular degeneration. N Engl J Med. 2006;355(14):1419–1431. [DOI] [PubMed] [Google Scholar]
30.Woo SJ, Veith M, Hamouz J, et al. Efficacy and safety of a proposed ranibizumab biosimilar product vs a reference ranibizumab product for patients with neovascular age-related macular degeneration: a randomized clinical trial. JAMA Ophthalmol. 2021;139(1):68–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Chow SC, Xu H, Endrenyi L, Song FY. A new scaled criterion for drug interchangeability. Chinese J Pharm Anal. 2015;35(5):844–848. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supinfo

NIHMS1834131-supplement-supinfo.pdf^{(439.2KB, pdf)}

[R1] 1.Challand R, Gorham H, Constant J. Biosimilars: where we were and where we are. J Biopharm Stat. 2014;24(6):1154–1164. [DOI] [PubMed] [Google Scholar]

[R2] 2.Ingrasciotta Y, Cutroneo PM, Marcianò I, Giezen T, Atzeni F, Trifirò G. Safety of biologics, including biosimilars: perspectives on current status and future direction. Drug Saf. 2018;41(11):1013–1022. [DOI] [PubMed] [Google Scholar]

[R3] 3.U.S. Food and Drug Administration (FDA). Guidance for industry: scientific considerations in demonstrating biosimilarity to a reference product. https://www.fda.gov/media/82647/download. Published April 2015. Accessed June 16, 2022.

[R4] 4.Tsai WC. Update on biosimilars in Asia. Curr Rheumatol Rep. 2017;19(8):47. [DOI] [PubMed] [Google Scholar]

[R5] 5.European Medicines Agency (EMA). Guideline on similar biological medicinal products containing biotechnology-derived proteins as active substance: non-clinical and clinical issues. https://www.ema.europa.eu/en/similar-biological-medicinal-products-containing-biotechnology-derived-proteins-active-substance-non. Published January 9, 2015. Accessed June 16, 2022. [Google Scholar]

[R6] 6.Weise M. From bioequivalence to biosimilars: how much do regulators dare? Z Evid Fortbild Qual Gesundhwes. 2019;140:58–62. [DOI] [PubMed] [Google Scholar]

[R7] 7.Chow SC, Wang J, Endrenyi L, Lachenbruch PA. Scientific considerations for assessing biosimilar products. Stat Med. 2013;32(3):370–381. [DOI] [PubMed] [Google Scholar]

[R8] 8.Chow SC, Liu JP. Statistical assessment of biosimilar products. J Biopharm Stat. 2010;20(1):10–30. [DOI] [PubMed] [Google Scholar]

[R9] 9.Chiu ST, Liu JP, Chow SC. Applications of the Bayesian prior information to evaluation of equivalence of similar biological medicinal products. J Biopharm Stat. 2014;24(6):1254–1263. [DOI] [PubMed] [Google Scholar]

[R10] 10.Pan H, Yuan Y, Xia J. A calibrated power prior approach to borrow information from historical data with application to biosimilar clinical trials. J R Stat Soc Ser C Appl Stat. 2017;66(5):979–996. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Uozumi R, Hamada C. Adaptive seamless design for establishing pharmacokinetic and efficacy equivalence in developing biosimilars. Ther Innov Regul Sci. 2017;51(6):761–769. [DOI] [PubMed] [Google Scholar]

[R12] 12.Weiss RE, Xia X, Zhang N, Wang H, Chi E. Bayesian methods for analysis of biosimilar phase III trials. Stat Med. 2018;37(20):2938–2953. [DOI] [PubMed] [Google Scholar]

[R13] 13.Mielke J, Schmidli H, Jones B. Incorporating historical information in biosimilar trials: challenges and a hybrid Bayesian-frequentist approach. Biometrical J. 2018;60(3):564–582. [DOI] [PubMed] [Google Scholar]

[R14] 14.Psioda MA, Hu K, Zhang Y, Pan J, Ibrahim JG. Bayesian design of biosimilars clinical programs involving multiple therapeutic indications. Biometrics. 2020;76(2):630–642. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Belay SY, Mu R, Xu J. A Bayesian adaptive design for biosimilar trials with time-to-event endpoint. Pharm Stat. 2021;20(3):597–609. [DOI] [PubMed] [Google Scholar]

[R16] 16.Schellekens H, Smolen JS, Dicato M, Rifkin RM. Safety and efficacy of biosimilars in oncology [published correction appears in Lancet Oncol. 2017 Mar;18(3):e134]. Lancet Oncol. 2016;17(11):e502–e509. [DOI] [PubMed] [Google Scholar]

[R17] 17.Das S, Johnson DB. Immune-related adverse events and anti-tumor efficacy of immune checkpoint inhibitors. J Immunother Cancer. 2019;7(1):306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Thall PF, Simon R. Practical Bayesian guidelines for phase IIB clinical trials. Biometrics. 1994;50(2):337–349. [PubMed] [Google Scholar]

[R19] 19.Schmidli H, Gsteiger S, Roychoudhury S, O’Hagan A, Spiegelhalter D, Neuenschwander B. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics. 2014;70(4):1023–1032. [DOI] [PubMed] [Google Scholar]

[R20] 20.Chow SC. Analytical Similarity Assessment in Biosimilar Product Development. New York, NY: Chapman and Hall/CRC Press; 2018. [Google Scholar]

[R21] 21.Tothfalusi L, Endrenyi L, Arieta AG. Evaluation of bioequivalence for highly variable drugs with scaled average bioequivalence. Clin Pharmacokinet. 2009;48(11):725–743. [DOI] [PubMed] [Google Scholar]

[R22] 22.Davit BM, Chen ML, Conner DP, et al. Implementation of a reference-scaled average bioequivalence approach for highly variable generic drug products by the US Food and Drug Administration. AAPS J. 2012;14(4):915–924. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Tothfalusi L, Endrenyi L. An exact procedure for the evaluation of reference-scaled average bioequivalence. AAPS J. 2016;18(2):476–489. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Zhou H, Lee JJ, Yuan Y. BOP2: Bayesian optimal design for phase II clinical trials with simple and complex endpoints. Stat Med. 2017;36(21):3302–3314. [DOI] [PubMed] [Google Scholar]

[R25] 25.U.S. Food and Drug Administration (FDA). Guidance for industry: interacting with the FDA on complex innovative trial designs for drugs and biological products. https://www.fda.gov/media/130897/download. Published December 2020. Accessed June 16, 2022.

[R26] 26.Wellek S. Testing Statistical Hypotheses of Equivalence and Noninferiority. 2nd ed. New York, NY: Chapman and Hall/CRC Press; 2010. [Google Scholar]

[R27] 27.Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm. 1987;15(6):657–680. [DOI] [PubMed] [Google Scholar]

[R28] 28.Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests and equivalence confidence sets. Stat Sci. 1996;11(4):283–319. [Google Scholar]

[R29] 29.Rosenfeld PJ, Brown DM, Heier JS, et al. Ranibizumab for neovascular age-related macular degeneration. N Engl J Med. 2006;355(14):1419–1431. [DOI] [PubMed] [Google Scholar]

[R30] 30.Woo SJ, Veith M, Hamouz J, et al. Efficacy and safety of a proposed ranibizumab biosimilar product vs a reference ranibizumab product for patients with neovascular age-related macular degeneration: a randomized clinical trial. JAMA Ophthalmol. 2021;139(1):68–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Chow SC, Xu H, Endrenyi L, Song FY. A new scaled criterion for drug interchangeability. Chinese J Pharm Anal. 2015;35(5):844–848. [Google Scholar]

PERMALINK

BOB: Bayesian Optimal Design for Biosimilar Trials with Co-Primary Endpoints

Xiaohan Chi

Zhangsheng Yu

Ruitao Lin

Summary

1 ∣. INTRODUCTION