Summary
For regulatory approval of a biosimilar product, extensive evaluations should be performed by rigorous clinical trials to establish the similarity between the reference product and the proposed biosimilar in terms of both efficacy and safety. Existing designs for biosimilar trials often use a single primary efficacy endpoint in trial monitoring, and then separately evaluate the safety of the biosimilar product in a secondary analysis at the trial completion. However, ignoring the safety endpoint and the correlation between safety and efficacy in trial monitoring may lead to a high false positive rate, or it may delay the termination of the trial when dissimilarity in safety is early detected. We propose a Bayesian optimal design for biosimilar trials by incorporating both safety and efficacy endpoints in a uniform framework. Based on a Bayesian joint safety and efficacy model, we sequentially use a so-called Bayesian biosimilar probability to make go/no-go decisions. We calibrate the Bayesian design to maximize the statistical power while maintaining the frequentist type I error rate at the nominal level. We carry out extensive simulation studies to show that the design has desirable performance in terms of the false positive rate and the average sample size. We also apply the proposed design to a biosimilar trial evaluating a ranibizumab product.
Keywords: Bayesian optimal design, Biosimilar, Co-primary endpoints, Power, Sequential design
1 ∣. INTRODUCTION
Biological products are substances mainly derived from living organisms and include vaccines, blood components, gene therapy, tissues, and proteins like monoclonal antibodies1, which have a wide variety of sizes and structural complexity. As a result, the manufacturing process is quite complex for innovator biological products, leading to high development costs2. A biosimilar is a biological product that is “highly similar” to an approved reference biological product in terms of efficacy, safety, and quality3. In contrast to innovator biological products, biosimilars can avoid duplicating expensive clinical trials and have lower development risks and costs. Hence research on biosimilars is crucial to the pharmaceutical industry because of cost savings for healthcare systems and consumers. Recent years have witnessed a booming biosimilar market, especially in Asia, because the low-cost biosimilars provide a desirable solution to the pricing hurdle4. In addition, the growing number of patents on approved biological products set to expire has accelerated the development of biosimilars.
For regulatory approval, the analytic similarity between the biosimilar and the reference product (i.e., biosimilarity) should be established based on meaningful clinical trials. Regulatory agencies such as the US Food and Drug Administration (FDA) and European Medicines Agency (EMA) have published guidelines for the development and approval of biosimilars3,5. Generally speaking, the definition of biosimilarity does not vary much among regulators. According to the FDA guideline3, biosimilarity is defined as that “the biological product is highly similar to the reference product notwithstanding minor differences in clinically inactive components,” and that “there are no clinically meaningful differences between the biological product and the reference product in terms of the safety, purity, and potency of the product.” It is worth noting that these definitions all emphasize the importance of the safety and efficacy of biosimilars. In other words, a manufacturer must demonstrate that its proposed biosimilar product is clinically and statistically indifferent to the reference product in terms of safety and efficacy.
In practice, it is more complex and challenging to characterize biological products6, because they have many fundamental differences from small-molecule drugs such as aspects of structure, safety, and pharmacological mechanisms7. It is almost impossible for a biosimilar to have the same active ingredients as the reference biological product. Hence researchers cannot directly apply conventional methods used to evaluate bioequivalence for small-molecule drugs to establish biosimilarity. Many statistical methods have been developed for conducting biosimilar clinical trials. Chow and Liu8 proposed a two-group parallel design for a bridging study based on the pharmacokinetic (PK), pharmacodynamic (PD), as well as efficacy. Chiu et al.9 borrowed the historical data of the approved biological products as the prior information for biosimilars based on a Bayesian method. Pan et al.10 proposed a Bayesian group sequential design that integrates historical information through the calibrated power prior approach. Uozumi and Hamada11 developed an adaptive seamless PK and efficacy design to combine the tests of PK parameters and efficacy. Weiss et al.12 proposed to incorporate prior information from the historical data of the reference product, and the results from in vitro and phase I PK/PD study of the biosimilar. Mielke et al.13 introduced a hybrid Bayesian-frequentist approach for incorporating historical data and controlling the type I error rate. Psioda et al.14 proposed a general Bayesian design framework to test biosimilarity in several indications simultaneously. Belay et al.15 proposed a two-stage Bayesian adaptive design for biosimilar trials with time-to-event endpoints.
Existing works typically focus on a single primary efficacy endpoint and seldom take the correlation between safety and efficacy into account. However, biological products are likely to cause systemic adverse effects, which are closely associated with the efficacy of the product16. For example, both prospective and retrospective analyses in non-small cell lung cancer patients have shown an association between the onset of immune-related adverse events (irAE) and the efficacy of anti-PD-1 and anti-PD-L1 antibodies17. Patients who experienced irAE had superior progression-free survival and overall survival compared to those who did not experience irAE. Negative correlation between safety and efficacy also has been observed in biological products2. Due to safety–efficacy interactions, it is possible that a biosimilar product is similar to the reference product in efficacy, but that they differ significantly from each other in safety profile. In this case, using efficacy as the primary endpoint may result in a false positive or a waste of resources if the dissimilarity cannot be detected early.
In this paper, we propose a Bayesian optimal design for biosimilar trials (BOB) by simultaneously using both safety and efficacy endpoints to evaluate the biosimilarity between a test biosimilar product and its reference product. Considering both safety and efficacy endpoints presents challenges to the statistical design. For example, the conclusions made based on respective efficacy and safety endpoints may go in different directions, making it difficult to interpret the biosimilarity. In addition, the correlation between efficacy and safety is typically unknown, and misspecification of the correlation may affect the study power or inflate the type I error rate. To address these challenges, we have built a joint model for efficacy and safety under the Bayesian framework. Based on the joint posterior distributions for efficacy and safety, we sequentially evaluate the biosimilarity to make interim go/no-go decisions. We propose calibration steps to maximize the power of this Bayesian design while maintaining the frequentist type I error rate at the prespecified nominal level. Our design is flexible in that clinicians can adapt the design settings, as well as control the trial rigidity according to their understandings of the biosimilar to be tested. Simulation studies show that the proposed design yields comparable power with the competitor designs and performs better in type I error rate control under various scenarios.
The remainder of this paper is organized as follows. In Section 2, we describe our joint probability model and propose a Bayesian optimal design considering both safety and efficacy endpoints. In Section 3, we conduct simulation studies to evaluate the performance of the new design. In Section 4, we exemplify the proposed design based on the reported data from a randomized clinical trial that evaluates a proposed ranibizumab biosimilar product. Finally, we conclude this paper with a discussion in Section 5. The R code to implement the proposed BOB design can be found at https://github.com/xiaohanchi/BOB_Design.
2 ∣. METHODS
2.1 ∣. Probability Model
In a biosimilar trial that tests the biosimilarity between a reference (R) product and a test (T) biosimilar, we consider a continuous efficacy response and a binary safety indicator as the co-primary efficacy and safety endpoints, respectively. In this paper, we simply assume the use of the equal randomization, so that the numbers of patients are equal between arms T and R. The proposed design is not bundled with the equal randomization assumption and can be adapted to accommodate various randomization schemes, including any fixed-ratio randomization or outcome-adaptive randomization.
Assume that n patients have been enrolled and treated in each arm so far, a pair of (Xki, Yki) is available for each patient i in arm k, i = 1, … , n and k = T, R, where Xki represents a continuous efficacy outcome (for example, the original or log-transformed efficacy data), and Yki indicates whether the patient has experienced an adverse event (AE) with Yki = 1 if a toxicity is observed during the follow-up; otherwise Yki = 0. We use a marginal-conditional approach to jointly model the bivariate safety–efficacy outcomes. In particular, we assume Yki marginally follows a Bernoulli distribution, and given Yki, Xki follows a conditional normal distribution whose mean depends on the value of Yki. That is,
| (2.1) |
where pk is the probability of a patient having an AE (i.e., the toxicity probability), μk1 and pk0 are the respective conditional means of Xki in patients with AE (Yki = 1) and patients without AE (Yki = 0); is the conditional variance. For simplicity, we assume a common conditional variance for Xki ∣ Yki = 1 and Xki ∣ Yki = 0. Such an equal-variance assumption can be relaxed straightforwardly. For arm k = T or R, the unconditional mean (μk, which is the population treatment effect in terms of efficacy) and variance () of Xki, and the correlation (ρk) between Xki and Yki are respectively
| (2.2) |
Apparently, the correlation between safety and efficacy in arm k is determined by the values of μk0 and μk1, as well as the toxicity probability pk, which turns out to be zero under the special case when μk0 = μk1.
We base the decision-making process in biosimilar trials on Bayesian posterior probabilities18. Under the Bayesian paradigm, we assume the following prior distributions for the unknown parameters:
where “i.i.d.” means independent and identically distributed; ak, bk, mk, υk, αk, and βk are prespecified hyperparameters for arm k = T, R, and their values may not necessarily be the same between treatment arms; and IG(α, β) denotes an inverse Gamma distribution with a shape parameter α and a scale parameter β.
Given the observed data Dn = {(XR1, YR1), … , (XRn, YRn); (XT1, YT1), … , (XTn, YTn)} of the first 2n patients in arms T and R, the posterior distribution of (Pk, μk0, μk1, ) can be derived as follows due to the conjugacy of the priors:
where , , , and
Based on simple algebra, the respective marginal posterior distributions of μk0 and μk1 are,
where tν (μ, σ2) denotes a t–distribution with ν degrees of freedom, the location parameter μ, and the scale parameter σ2. The posterior distributions of μk and are then derived from (2.2) based on the marginal posterior distributions of pk, μk0, μk1, and , which can be easily obtained using the Markov chain Monte Carlo method.
In some cases, due to the quickly developed manufacturing methods of biological products, historical information may not be available for the current trial. In addition, because of a lack of clinical knowledge about the new biosimilars, researchers are not likely to have enough prior information about the tested biological products. Under such circumstances, in order to avoid misleading inference caused by the inappropriate use of prior information, we adopt non-informative priors by taking ak = bk = 1, and υk, αk, βk → 0. Of note, we have additionally investigated the use of half-Cauchy or uniform distributions as the prior on σR, and found that the proposed design is not sensitive to the choice of the prior on (or σR) as long as they are non-informative.
In other situations, it is possible for some long-established biological products to access useful historical information, then we can incorporate information obtained from former studies as informative priors for the reference product. For example, we can choose values of hyperparameters for arm R based on clinicians’ understandings of the reference products and the historical studies while keeping non-informative priors for arm T. Alternatively, to avoid prior-data conflict, we can consider the approaches of adaptively incorporating historical information, such as the calibrated power prior approach10 and robust meta-analytic priors19. However, investigating the performance of the proposed design based on informative priors is beyond the scope of this paper and will be examined in a separate paper elsewhere.
2.2 ∣. Bayesian Optimal Design
Denote δp = pT – pR as the difference in toxicity probabilities between products T and R, and as the scaled difference in treatment effects. Biological products including biosimilars usually exhibit high within-subject variability in efficacy20. Rather than using the unadjusted difference μT – μR, we hereby adopt , the scaled average bioequivalence/biosimilarity criterion (SABE)21 that was advocated by the FDA advisory committee for pharmaceutical science and clinical pharmacology, to assess the interchangeability in efficacy between the reference and test products. Formally, we propose to evaluate the biosimilarity between T and R through testing the following hypotheses:
| (2.3) |
where
Here, and (or and ) are regulatory lower and upper biosimilarity limits for the safety endpoint (or the efficacy endpoint), respectively. The null hypothesis H0 means that either δp or (or both of them) does not meet its predefined criterion for biosimilarity, i.e., T and R are not similar in terms of safety or efficacy. The alternative hypothesis H1 means that both parameters satisfy their respective criteria, i.e., T and R are similar due to the similarity in both the toxicity probability and scaled mean efficacy. In other words, biosimilarity is established if H0 is rejected. The values of (, ) or (, ) can be specified from the investigators to reflect a clinically meaningful and regulatorily reasonable indifferent margin for safety or efficacy, respectively. Particularly, (, ) can be specified according to the reference-scaled average bioequivalence approach22,23. For example, if the unscaled biosimilarity limits are specified as ±0.2, the scaled limits (, ) can be defined as , where σW0 is a benchmark value predefined by the regulatory agency, say σW0 = 0.25. Furthermore, the mixed scaling limits, with , can be used to relax the criterion to establish biosimilarity when the reference variability exceeds the benchmark value .
In our design, we adopt a so-called Bayesian biosimilar probability (BBP) to evaluate the biosimilar hypotheses (2.3) under the Bayesian framework,
| (2.4) |
which can be numerically calculated based on the joint posterior distribution of pR, pT, βR, μT, ∣ Dn derived from Section 2.1. Here, BBP is a probability measure to quantify the degree of similarity between products T and R in terms of the safety and efficacy simultaneously. The correlation between safety and efficacy is also naturally incorporated in this probability. The larger the BBP, the smaller the difference between the two products in terms of safety and efficacy.
We propose to evaluate the biosimilarity between products T and R in a group sequential manner and use BBP to make interim go/no-go decisions. Specifically, let J denote the total number of analyses, including J – 1 interim analyses and one final evaluation; and let nj denote the number of patients in each arm accumulated up to the jth analysis, j = 1, 2, … , J. The maximum number of patients to be enrolled in the trial is 2nJ. The steps for making the go/no-go decision rules in the proposed BOB design are described as follows:
Initialization: Enroll and equally randomize the first 2n1 patients to T and R arms.
- Interim analyses: Given the jth interim data Dnj = {(XT1, YT1), … , (XTnj, YTnj); (XR1, YR1), … , (XRnj, YRnj)}, j = 1, 2, … , J – 1, we update the BBP, denoted as , and make go/no-go decisions based on the following rules.
- If BBPj < C(nj), terminate the trial early and conclude that products T and R are not similar.
- If BBPj ≥ C(nj), continue to enroll and randomize patients to arms T and R until the (j + 1)th interim analysis.
- Final analysis: If the trial has not been terminated early, then a final analysis will be conducted based on BBPJ, which is calculated using all observed data of a total of 2nJ patients.
- If BBPJ < C(nJ), accept H0 and conclude that products T and R are not similar.
- If BBPJ ≥ C(nJ), reject H0 and conclude that products T and R are similar in terms of both efficacy and safety.
Mimicking Zhou et al.24, we assume the probability cutoff C(nj) to be a function of the interim sample size nj with the following form
| (2.5) |
where λ > 0 and γ > 0 are tuning parameters that control the operating characteristics of the design, and nJ is the maximum sample size in each arm. Note that the value of BBP depends on the sample size of Dn. When the sample size is small, the joint posterior distribution of δp and tends to have a large variability, leading to a moderate BBP value under the alternative hypothesis. We require that γ > 0 such that C(nj) is monotonically increasing with the interim sample size nj, or equivalently, the accumulated information. The probability of making a “no-go” decision is smaller at early stages of the trial. The purpose of such a choice is to make as few mistakes as possible at the beginning of the trial when the information is sparse. As the trial proceeds and information accumulates, the increasing value of C(nj) facilitates a more stringent rule for making a “go” decision. In the proposed design, λ and γ are carefully calibrated using simulated data so that the frequentist type I error rate can be maintained at a nominal level and the study power is maximized. Specific steps of calibrating λ and γ are described in Section 2.4.
In this paper, we only terminate the trial early when the interim data indicate that the biosimilar drug is different from the reference product in either the toxicity probability or the scaled treatment effect. It is also technically possible to generalize the BOB design by additionally including an early stopping rule for strong evidence of biosimilarity. This can be done by introducing an upper cutoff of BBPj, in addition to the lower cutoff C(nj). However, a caveat of such a practice is that early stopping due to biosimilarity tends to result in a smaller sample size of the trial, thus leading to insufficient evidence in terms of other perspectives, such as PK/PD or immunogenicity analyses.
2.3 ∣. Frequentist Operating Characteristics
Following the FDA guidance on complex innovative clinical trial designs25, we propose to control the frequentist operating characteristics of the proposed Bayesian design in a “hybrid” sense. In our framework, the test of the null hypothesis H0 is an intersection–union test (IUT) consisting of tests for respective safety and efficacy endpoints. H0 can be rejected if and only if each of the individual hypotheses for both endpoints in H0 can be rejected. In the aforementioned Bayesian decision framework, rejection of H0 requires large BBPj’s throughout the trial. Let denote the rejection of the null hypothesis H0. In the sequential BOB design, the expression of is given by
Given a set of δp and , the probability of rejecting H0, denoted as , is
Denote α as the prespecified level of the type I error rate. To calibrate the proposed Bayesian design, we suggest two procedures to control the overall type I error rate: one corresponds to a rigid control, and the other one is a loose control. Due to the fact that H0 is a composite null hypothesis, the overall type I error rate for the test of the composite hypothesis (2.3) is the largest value of in the region Θ0 defined by H0, or .
More specifically, the first procedure is to control the maximum type I error rate under each point of Θ0 in a point-wise way. In other words, we propose to identify the parameters (λ, γ) such that the maximum type I error rate when can be controlled at the prespecified level α, that is,
The maximum type I error rate can be identified using a grid searching algorithm by numerically enumerating values of in Θ0. We denote the proposed BOB design based on such a stringent type I error controlling procedure as BOBs. Noticing that the combinations with (or ) represent the least favorable scenarios in Θ0, we can use the following simplified procedure for type I error rate control,
| (2.6) |
where
The second procedure is to control, in a Bayesian way, the maximum of four average type I error rates under four null scenarios when δp or is fixed at its lower or upper biosimilarity limit, respectively. Given a parameter set Θ, denote the average probability of rejecting H0 as
where is the joint density function of given Θ. We additionally propose to control the type I error rate in a loose sense, that is,
| (2.7) |
where
We denote the proposed BOB design based on such a control of the average type I error rate as BOBavg. Note that the uniform distribution in Θ0i serves as the weighting function for calculating the average type I error rate, and it can be replaced by other distributions such as the truncated normal distribution. The choice of an appropriate weighting function is crucial to the design calibration and it determines the power of the proposed design. Clinicians and statisticians should choose reasonable settings in consideration of the characteristics of the tested products and available resources.
In general, BOBs provides a more stringent control of the type I error rate than BOBavg because the former controls the maximum type I error rate point-wisely, whereas the latter controls the average rate. In both ways, we fix the value of δp or to be the lower or upper biosimilarity limit while varying the value of the other to reflect the fact that two-sided tests are considered in our design. Similar to the type I error control, we calibrate the proposed design to maximize the empirical power under the following alternative parameter set, . As a result, the empirical power can be expressed as .
2.4 ∣. Design Calibration
In our proposed design, the parameters λ and γ introduced in Section 2.2 should be calibrated through a simulation-based approach. Our strategy to optimize λ and γ is based on a grid search: first we find the values of (λ, γ) that yield the desirable empirical type I error rate, then select the one that produces high empirical power and small expected sample size as the optimal design parameters. Specifically, we calibrate the parameters λ and γ according to the following steps:
Step 1: Elicit the null hypothesis H0 and the alternative hypothesis H1 from investigators, and specify the level of the type I error rate α and the maximum sample size. To make sure the BOB design can have adequate power under a reasonable sample size, the endpoint with a larger variance usually requires a wider biosimilarity margin.
Step 2: Perform a grid search to find all possible combinations of (λ, γ) for the stopping cutoff C(nj) that lead to a type I error rate of less than or equal to the prespecified level α, using either criterion (2.6) or (2.7).
Step 3: Among the admissible set of (λ, γ) identified in Step 2, find the maximum value of empirical power under , which is denoted by πmax, and further find the values of (λ, γ) from the admissible set that lead to the empirical power no less than πmax – ϵ, where ϵ is a constant used to maintain high power and is usually set to be 0.01.
Step 4: Select the value of (λ, γ) that yields the minimum expected sample size from the set identified in Step 3 as the optimal design parameters.
Given a prespecified maximum sample size, the power of the resulting design may be excessively low or high. In this case, one can further calibrate the sample size to obtain a desirable power. The calibration procedure is time consuming, especially in Step 2, because the BBP is calculated through the MCMC method and the type I error rate should be computed across a grid of null scenarios. To address this problem, we propose to calibrate the proposed design in an approximate way by using asymptotic posterior distributions of δp and to calculate BBP. Details about such an approximation are shown in Supplementary Material. Note that such asymptotic posterior distributions are only used in calibration, whereas the exact joint posterior distribution of δp and based on MCMC is used in trial implementation or simulation. The computation time required by a calibration when the sample size is 160 provides a striking example: the exact method based on the MCMC approach requires about 909.2 core hours while the approximate method requires just 10.4 core hours. The approximate method requires about 1/90 of the computation resources of the exact method. Additionally, the calibration results of both methods are particularly similar, as shown in Table S1 and Table S2 of the Supplementary Material.
The calibration procedure also requires the specification of the correlation between safety and efficacy, ρ. In practice, the value of ρ can be obtained from historical studies. Sometimes, it may be challenging to specify the value of ρ precisely when calibrating a design, we propose to simply use ρ = 0 in the calibration procedure to reduce the burden of parameter calibration.
Based on our extensive simulation studies, we find that the proposed design calibrated based on ρ = 0 is very robust and can maintain a good control of the type I error rate. Further details about such an issue are provided in the sensitivity analyses of Section 3.2.
3 ∣. NUMERICAL STUDIES
In this section, we report a simulation study to assess the performance of the proposed design. We use the following three metrics to evaluate the operating characteristics of a design: (1) empirical power, the probability that the test rejects the null hypothesis H0 when the alternative hypothesis H1 is true; (2) empirical type I error rate, the probability that the test rejects H0 when H0 is true; and (3) expected sample size, the average sample size summarized over all simulated trials.
3.1 ∣. Simulation Settings
In our simulation study, we take non-informative priors for the proposed Bayesian design by specifying the hyperparameters as, ak = bk = 1; mk = 0, υk = 0.001; and αk = βk = 0.001. We consider the sample size nJ for both arms (T and R) to be 100, 160, and 220, resulting in a total sample size of 200, 320, and 440, respectively. Biosimilarity is established if −0.20 < δp < 0.20 and −0.40 < < 0.40 are met at the same time. We assume a common correlation coefficient ρ between Xki and Yki in arms T and R and consider three scenarios: no correlation (ρ = 0), a medium correlation (ρ = 0.3), and a high correlation (ρ = 0.5). Scenarios with a negative ρ are omitted because the results are similar to those under a positive ρ by symmetry.
We generate Xki and Yki from the distributions described in Section (2.1) with pR = 0.5, μR = 0, and . Under these configurations for the reference product, products T and R are similar if both pT ∈ (0.30, 0.70) and μT ∈ (−0.32, 0.32) are satisfied. When comparing the power of different designs, we set pT to be 0.42, 0.50, 0.58, and μT to be −0.20, −0.10, 0, 0.10, 0.20. When examining the type I error rates, we consider the two least favorable scenarios, with the first scenario about safety and the second about efficacy. In scenario 1, we fix pT to be pR – 0.20 = 0.30 or pR + 0.20 = 0.70, and set μT to be from −0.32 to 0.32, with a step 0.064; and similarly in scenario 2, we fix μT to be μR – 0.32 = −0.32 or μR + 0.32 = 0.32, and set pT to be from 0.30 to 0.70, with a step 0.04. Technically speaking, when both pT and μT are outside of the respective similarity regions, the type I error rate of the proposed design would be smaller than that evaluated under the least favorable scenarios. The prespecified nominal level α is set at 0.05. The simulation results are summarized based on 20,000 replicated trials under each pair of (pk, μk), k = T or R.
We compare the performance of the proposed BOB designs (BOBs and BOBavg) to those of some commonly used methods including
FE Frequentist fixed design considering a single efficacy endpoint. FE is a frequentist fixed-sample design. FE adopts a two-sample t-test approach for the scaled average bioequivalence test21,26 to evaluate the biosimilarity of the efficacy endpoint.
FS Frequentist fixed design considering a single safety endpoint. FS is a frequentist fixed-sample design that performs the frequentist two one-sided tests (TOST) procedure27 for both sides to test the safety endpoint.
BAE/BAS Bayesian adaptive design considering a single efficacy or safety endpoint. BAE and BAS are Bayesian group-sequential designs that respectively consider the efficacy and safety as a single primary endpoint. These probability cutoffs in BAE and BAS are optimized in the same way as the proposed BOB design.
FES Frequentist fixed design considering both efficacy and safety endpoints. FES is a frequentist fixed-sample design that combines the FE and FS designs to test both efficacy and safety endpoints. The significance level of FES is controlled using the intersection-union method28.
In fixed-sample designs, the final analysis takes place when the number of patients per arm equals 100, 160, or 220. In adaptive designs, we perform three interim analyses when nJ/4, nJ/2, and 3nJ/4 patients have been enrolled in each arm, and a final analysis when all 2nJ patients have been treated. Although comparing the univariate designs (i.e., FE, FS, BAE, and BAS) with the bivariate designs (i.e., BOB and FES) seems relatively unfair as they have different estimands, we still include the former to comprehensively quantify the gains as well as the limitations of the proposed design with bivariate endpoints. Through our simulation study, we can see that ignoring one important endpoint in the trial monitoring may lead to an inflation of the type I error rate and infrequent early terminations of unpromising products. We search for pairs of (λ, γ) for the Bayesian designs under sample size nJ of 100, 160, or 220, assuming a correlation of ρ = 0 between efficacy and safety, to control the type I error, as well as to maintain the power. The calibrated values of tuning parameters are provided in Table S3 of the Supplementary Materials.
3.2 ∣. Simulation Results
Type I error rate control
Figure 1 shows the type I error rates and the expected sample sizes under H0 for the considered designs when the maximum sample size nJ = 160 and the correlation between safety and efficacy ρ = 0. We also select and present some representative values of the type I error rates and expected sample sizes in Table S4 of the Supplementary Materials.
FIGURE 1.
Type I error rate and expected sample size of seven designs with the maximum sample size per arm nJ = 160 and the correlation between efficacy and safety ρ = 0. Panel (a), type I error rate when pT is fixed at 0.30 or 0.70. Panel (b), expected sample size when pT is fixed at 0.30 or 0.70. Panel (c), type I error rate when μT is fixed at ±0.32. Panel (d), expected sample size when μT is fixed at ±0.32. Because FE, FS, and FES are fixed designs without any interim analysis, their expected sample sizes are identical as shown by the coincident dashed lines.
In panels (a) and (b) of Figure 1, the value of pT is fixed at 0.30 or 0.70 and the value of μT is varied from −0.32 to 0.32. Given that (pR, μR) = (0.5, 0), the test product is similar to the reference product in efficacy but not in safety. Under this scenario, the univariate designs using a single safety endpoint, such as FS and BAS, as well as the bivariate designs including FES and BOBs, can control the type I error rate at the prespecified level of 0.05. By contrast, the univariate designs based on a single efficacy endpoint, such as FE and BAE, fail to incorporate the information from the safety data, so their type I error rates are severely inflated, especially when μT is close to μR. For example, in Table S4 of the Supplementary Material, when , the type I error rates of the FS, BAS, FES, and BOBs designs are well controlled below 5% while the BAE design has a type I error rate of 93.6%.
The BOBavg design has a slightly larger type I error rate, due to the loose control of the type I error rate, than the BOBs design. As shown in Figure 1, although the BOBavg design possesses a good control of type I error in most scenarios, the rate is inflated under some circumstances when μT and μR are particularly close.
For example, the type I error rate of the BOBavg design is 8.8% when , which becomes 0.5% when . Furthermore, the FS and BAS designs do not take the efficacy data into account, thus they have almost constant empirical type I error rates at 0.05 when δp is fixed and is varied. On the other hand, the type I error rates of the bivariate designs, such as FES, BOBs, and BOBavg, decrease with the increase of the absolute difference , exhibiting a better ability to detect the dissimilarity.
In terms of the expected sample size under the null scenarios, BOBs and BOBavg designs have uniformly better performance than the other competing designs, due to the ability to early terminate at interim analyses as well as the incorporation of both efficacy and safety endpoints. For example, in Table S4 of the Supplementary Materials, when , the BOBs design saves about 60 more patients than the fixed-sample designs and about 5 more patients than the BAS design, respectively. Given that both BOBs and BOBavg can maintain (or roughly maintain) the type I error rate but with much smaller sample sizes, the two proposed BOB designs are more efficient than the other designs in identifying the differences between the reference and biosimilar products, leading to savings in the sample size.
Likewise, the above conclusions among the considered designs remain unchanged in panels (c) and (d) of Figure 1, where the value of μT is fixed at ±0.32, and that of pT varies from 0.30 to 0.70, indicating that the test product is similar to the reference product in safety, but not in efficacy. Specifically, the FS and BAS designs that only use the safety primary endpoint have severely inflated type I error rates, whereas the other designs can control (or roughly control in the case of the BOBavg design) the type I error rate at the prespecified level of 0.05. The sample size savings of the BOBs and BOBavg designs are more prominent when there is a considerable difference in safety between products T and R.
To summarize, BOBs and BOBavg designs control the type I error rates well with both endpoints and greatly reduce the required sample sizes at the same time. More importantly, BOBs and BOBavg designs are more likely to detect the dissimilarity and terminate the trial early when there is a considerable difference between products T and R, thus avoiding a waste of resources.
Empirical power
Table 1 reports the empirical power of different designs under the sample size per arm nJ = 160 and the correlation between efficacy and safety ρ = 0. Because the univariate designs including FE, FS, BAE, and BAS only use one endpoint in decision making, their power is not affected by the value of the other endpoint. As a result, they yield almost the same power when the value of the considered endpoint is fixed and that of the other varies. For example, in Table 1, when , the BAE design yields the same power of 93.6% under δp = −0.08, 0, and 0.08. Likewise, the BAS design yields the same power of 68.9% under δp = −0.08 regardless of the values of .
TABLE 1.
Comparison of power (%) and expected sample size (EN) using seven designs under the scenarios with the maximum sample size per arm nJ = 160 and the correlation between efficacy and safety ρ = 0.
| δp | Design |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| FE | FS | BAE | BAS | FES | BOBs | BOBavg | |||
| −0.250 | Power (%) | 38.0 | 70.0 | 37.9 | 68.9 | 26.5 | 25.8 | 35.8 | |
| EN | 160 | 160 | 136.6 | 152.3 | 160 | 126.7 | 128.0 | ||
| −0.125 | Power (%) | 79.1 | 70.0 | 77.9 | 68.9 | 55.2 | 54.3 | 64.5 | |
| EN | 160 | 160 | 153.4 | 152.3 | 160 | 143.2 | 144.3 | ||
| −0.08 | 0 | Power (%) | 94.4 | 70.0 | 93.6 | 68.9 | 65.6 | 65.9 | 74.4 |
| EN | 160 | 160 | 157.7 | 152.3 | 160 | 148.2 | 148.9 | ||
| 0.125 | Power (%) | 78.8 | 70.0 | 77.7 | 68.9 | 54.9 | 53.9 | 64.4 | |
| EN | 160 | 160 | 153.5 | 152.3 | 160 | 143.4 | 144.3 | ||
| 0.250 | Power (%) | 37.9 | 70.0 | 37.3 | 68.9 | 26.2 | 25.2 | 35.0 | |
| EN | 160 | 160 | 137.0 | 152.3 | 160 | 126.8 | 128.3 | ||
| −0.250 | Power (%) | 38.0 | 94.7 | 37.9 | 94.0 | 36.1 | 34.8 | 45.2 | |
| EN | 160 | 160 | 136.6 | 158.4 | 160 | 133.8 | 135.0 | ||
| −0.125 | Power (%) | 79.1 | 94.7 | 77.9 | 94.0 | 74.9 | 72.7 | 80.5 | |
| EN | 160 | 160 | 153.4 | 158.4 | 160 | 150.7 | 151.2 | ||
| 0 | 0 | Power (%) | 94.4 | 94.7 | 93.6 | 94.0 | 89.3 | 88.0 | 92.0 |
| EN | 160 | 160 | 157.7 | 158.4 | 160 | 155.2 | 155.5 | ||
| 0.125 | Power (%) | 78.8 | 94.7 | 77.7 | 94.0 | 74.5 | 72.6 | 80.5 | |
| EN | 160 | 160 | 153.5 | 158.4 | 160 | 150.7 | 151.2 | ||
| 0.250 | Power (%) | 37.9 | 94.7 | 37.3 | 94.0 | 35.8 | 34.4 | 44.6 | |
| EN | 160 | 160 | 137.0 | 158.4 | 160 | 134.2 | 135.5 | ||
| −0.250 | Power (%) | 38.0 | 70.0 | 37.9 | 68.9 | 26.6 | 22.6 | 32.4 | |
| EN | 160 | 160 | 136.6 | 152.6 | 160 | 126.3 | 127.7 | ||
| −0.125 | Power (%) | 79.1 | 70.0 | 77.9 | 68.9 | 55.4 | 50.9 | 62.1 | |
| EN | 160 | 160 | 153.4 | 152.6 | 160 | 143.1 | 144.2 | ||
| 0.08 | 0 | Power (%) | 94.4 | 70.0 | 93.6 | 68.9 | 66.2 | 64.2 | 73.3 |
| EN | 160 | 160 | 157.7 | 152.6 | 160 | 148.0 | 148.9 | ||
| 0.125 | Power (%) | 78.8 | 70.0 | 77.7 | 68.9 | 55.5 | 51.2 | 62.3 | |
| EN | 160 | 160 | 153.5 | 152.6 | 160 | 143.3 | 144.3 | ||
| 0.250 | Power (%) | 37.9 | 70.0 | 37.3 | 68.9 | 26.7 | 21.8 | 32.0 | |
| EN | 160 | 160 | 137.0 | 152.6 | 160 | 126.7 | 128.3 | ||
FE/FS, frequentist fixed design considering a single efficacy or safety endpoint; BAE/BAS, Bayesian adaptive design considering a single efficacy or safety endpoint; FES, frequentist fixed design considering both efficacy and safety endpoints; BOBs the proposed Bayesian optimal biosimilar trial design for both efficacy and safety endpoints under a stringent type I error control; BOBavg the proposed design that controls the average type I error rate. δp is the difference between the toxicity probabilities of arms T and R, and is the scaled difference in the treatment effects.
The power of bivariate designs, such as FES, BOBs, and BOBavg, is lower than that of univariate designs. This is because the incorporation of co-primary endpoints makes it more difficult to claim the similarity between products T and R. Among the bivariate designs, BOBs produces comparable power with FES and performs better under a relatively small sample size with nJ = 100 (Tables S9-S12 in the Supplementary Materials). Between the two BOB designs, as expected, the BOBavg design yields a higher power than BOBs because of its relatively loose control of the type I error rate. For example, when δp = 0 and , BOBavg yields about 4% improvement in power compared with the BOBs design.
It is worth noting that the expected sample sizes of BOBs and BOBavg designs are both smaller than those of the competing designs. For example, Table 1 shows that the BOBs design needs 5–34 fewer patients than the FES design while maintaining a similar power level; the BOBavg design leads to a higher power than the FES design while it can save at least 5 more patients in a trial. With consideration of the comparable power, these results demonstrate the advantage of the BOB designs in trial cost-effectiveness.
Sensitivity analyses
We conducted additional simulation studies to evaluate the effect of the correlation ρ on the performance of our designs. Figure S1 in the Supplementary Materials shows the type I error rates and the expected sample sizes under the sample size per arm nJ = 160 and the correlation between efficacy and safety ρ = 0.5, and Table S6 - Table S8 in the Supplementary Materials report the empirical power of different designs under nJ = 160 and ρ = 0.3, 0.5, and −0.5. Negative correlations are not reported in detail because the simulation results with −ρ are particularly similar to those with ρ due to symmetry. For example, when nJ = 160, results based on ρ = −0.5 displayed in Table S8 of the Supplementary Materials exhibit a similar pattern as those displayed in Table S7 with ρ = 0.5.
As expected, the absolute value of the correlation ρ plays a role in the power (or type I error rate) function of the bivariate designs such as FES, BOBs and BOBavg. Specifically, increases as ∣ρ∣ increases when (i.e., the sign of is the same as that of ρ), while decreases as ∣ρ∣ increases when (i.e., the sign of is different from that of ρ). As a consequence, when ∣ρ∣ > 0, the maximum type I error rates of the BOBs and BOBavg designs, which are calibrated assuming ρ = 0, will be slightly inflated. Even so, the proposed BOB designs are very robust and can still maintain a good type I error rate. For example, as shown in Figure S1 in the Supplementary Materials, the maximum type I error rate of the BOBs design is about 5.2%, a value which is quite close to the nominal level 5%. More intuitively, as shown in Figure 2, the inflation of the maximum type I error rates of the BOB designs can be maintained at a proper level of less than 1% when ∣ρ∣ > 0, and more notably is nearly negligible when ∣ρ∣ < 0.4. This slight inflation of the type I error rate caused by calibrating the design with ρ = 0 alleviates the burden of parameter calibration, because it is often not feasible to have an appropriate initial guess about ρ. Nonetheless, the concern on the type I error rate inflation can be avoided through the choice of a larger ∣ρ∣ in the design calibration procedure, such as ρ = 0.5. However, such a stringent choice is unnecessary under most scenarios as it would scarify the power of the design.
FIGURE 2.
Maximum type I error rate of the three bivariate designs under various correlations between efficacy and safety ρ with the maximum sample size per arm nJ = 160.
Furthermore, we have also examined other sample sizes including nJ = 100 and 220 for the type I error rate and power. The results of these additional numerical studies are provided in Figure S2 - Figure S5 and Table S9 - Table S16 of the Supplementary Material, respectively. These findings show that the conclusions in terms of the power (or type I error rate) and sample size savings are not affected by different sample sizes.
4 ∣. TRIAL APPLICATION
Neovascular age-related macular degeneration (nAMD) is a leading cause of irreversible blindness among 50 years of age or older. A recombinant monoclonal antibody fragment named ranibizumab was approved over ten years by FDA and EMA for treating nAMD29. However, the high cost of ranibizumab puts most patients out of reach, encouraging researchers to develop biosimilars. Woo et al.30 conducted a two-arm, randomized, parallel-group phase III trial, in which they demonstrated the biosimilarity of SB11 (the test product) compared with the ranibizumab (the reference product). In this study, a total of 704 participants were randomized 1:1 to receive SB11 (N = 351) or ranibizumab (N = 353).
One of the primary efficacy endpoints is the change from baseline in central subfield thickness (CST) at week 4, a continuous endpoint measuring macular thickness. The mean change (and standard error) from baseline at week 4 for CST in the full analysis set (FAS) was −108μm (5) in the SB11 group and −100μm (5) in the ranibizumab group (i.e., μT = −108 and μR = −100). The reference value of the efficacy endpoint borrowed from the historical data is −110μm. We adopt the scaled equivalence margin of ±36/τR = ±0.38, where the unscaled margin ±36μm was predefined in this study as the biosimilarity limit to test the difference in the change in CST, and we use τR as the scaled factor for illustration. To note, the test in efficacy will be simplified as an unscaled test as we use the true variance for the reference product as the scaled factor in the margin. This study also reported detailed adverse events (AEs). We focus on treatment-emergent AEs (TEAEs) with the severity of moderate and severe and exclude mild TEAEs. The rates of moderate and severe TEAEs were 34.76% in the SB11 group and 33.43% in the ranibizumab group (i.e., pT = 34.76% and pR = 33.43%). No reference value or biosimilarity margin for the safety endpoint was given in this study; thus, for illustrative purposes, we take the value of 33.43% as the reference toxicity rate and adopt a margin of (−15.0%, 15.0%) to test the difference in the above safety endpoint.
Given that the efficacy and safety endpoints were independently reported in this study, it is impossible to obtain the estimated correlation between them based only on summary statistics. To redesign the trial, we consider three scenarios: no correlation between safety and efficacy (ρ = 0), a medium correlation (ρ = 0.3) and a high correlation (ρ = 0.5). Under the three scenarios, the BOBs and BOBavg designs described in Section 2.3 are performed to test biosimilarity. We generate efficacy and safety data in the same way as in Section 3.1 with μT = −108, μR = −100, pT = 34.76%, and pR = 33.43%. We conduct two interim analyses in our design when a total of 304 and 504 patients (i.e., 43.2% and 71.6% of the total sample size) have been enrolled, respectively. As described in Section 3.1, we calibrate the value of (λ, γ) using the reference values of the efficacy and safety endpoints of −110 and 33.43% under the assumption of ρ = 0. For each scenario and setting, we adopt the BOBs and BOBavg designs with the resulted (λ, γ) and calculate the power and expected sample size, and the results are displayed in Table 2. As a comparison, we also provide the power of the frequentist tests on safety and efficacy, respectively. In this trial, single tests of efficacy and safety endpoints yield high power of 99.0% and 98.4%, respectively. While combining safety and efficacy, the BOB designs still yield comparable power. Specifically, the BOBs and BOBavg designs yield the power of about 98%, very close to the single frequentist tests. In addition, the BOB designs yield similar power with different ρ under the same settings, indicating that the proposed designs are robust.
TABLE 2.
Application results of BOB to the biosimilar trial of SB11.
| ρ | BOBs |
BOBavg |
Efficacy | Safety | |||||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.3 | 0.5 | 0 | 0.3 | 0.5 | ||||
| (λ, γ) | (0.9502, 1.08) | (0.9212, 1.02) | |||||||
| Power (%) | 97.3 | 97.9 | 98.4 | 98.3 | 98.8 | 99.1 | 98.7 | 98.4 | |
| EN | 351.7 | 352.3 | 352.4 | 351.7 | 352.7 | 352.4 | 353 | 353 | |
Note: The power of efficacy and safety represent the power of frequentist equivalence tests on safety and efficacy, respectively. The expected sample size EN is calculated for the arm R.
We further assume numerous scenarios to obtain a global view of the power (or type I error rate) function and the expected sample size of the BOB designs. Specifically, we fix the endpoints of the ranibizumab group as above and vary the efficacy value of the SB11 group from −143.2 to −56.8 and the safety endpoint from 15.4% to 51.4%, and the results with ρ = 0 are shown in Figure 3. It can be seen that the BOBs design requires at least 345 patients per arm to obtain a power of greater than 80%. If the two products are not similar in safety or efficacy, the BOBs design can save 100 patients more per arm than the fixed designs. As expected, the BOBavg design shows a similar pattern to the BOBs design but it provides a larger value of . The results with ρ = 0.3 and 0.5 are shown in Figures S6 and S7 in the Supplementary Materials. As shown, the power varies with the value of ∣ρ∣, which aligns with the discussion about the role of ρ in Section 3.2.
FIGURE 3.
Contour plots of the power (or type I error rate) function (%) and expected sample size of BOB designs when ρ = 0 under different combinations of and δp. Panel (a), the power of the BOBs design. Panel (b), the expected sample size of the BOBs design. Panel (c), the power of the BOBavg design. Panel (d), expected sample size of the BOBavg design. The exact value collected from the SB11 trial is marked using a red dot in the four panels.
5 ∣. CONCLUDING REMARKS
We have proposed a Bayesian optimal design for biosimilar trials that simultaneously considers safety and efficacy endpoints and jointly tests the two endpoints in a uniform framework. We calibrate the proposed BOB design through a simulation-based approach to control the frequentist type I error rate, maximize the power, and save the sample size. We also introduce a more flexible way to control the overall type I error rate of the proposed design. Investigators can adjust the settings of the BOB design when dealing with different biological products to yield sound operating characteristics. The simulation results show that the BOB design requires fewer patients while providing comparable power with the frequentist bivariate design. The saving in the sample size is much more meaningful when products T and R are not similar so that the trial can be terminated as early as possible to avoid a waste of resources. Our framework is very general and the performance of our proposed BOB design is not sensitive to the choice of biosimilar margins. As a result, BOB can be readily generalized to incorporate other criteria, such as the scaled criterion for drug interchangeability31. In reality, however, the determination of the biosimilarity margin is a very complicated and complex process, requiring close collaborations between biostatisticians, clinical/preclinical investigators, and regulatory agencies. As an open question for future studies, Bayesian approaches may possibly be used to advance the determination of the biosimilarity margin by exploiting information from (historical) reference data or preclinical data.
Due to the use of the Bayesian biosimilar probability in the trial monitoring, the proposed design can be readily extended to more than two endpoints. Although under the Bayesian framework, it is natural to incorporate historical information through the use of informative prior distributions. In literature, there are several novel approaches for adaptive information-borrowing priors, such as the calibrated power prior approach10 and the robust meta-analytic prior19, it is of interest to investigate which information-borrowing approach yields better operating characteristics in our setting. Furthermore, as another useful avenue for future research, the extension of the proposed method to biosimilar trials with the time-to-event endpoints is warranted.
Supplementary Material
ACKNOWLEDGEMENTS
We would like to thank the Editor, the Associate Editor, and the reviewer for their valuable comments and suggestions, with special thanks to the reviewer whose dedicated and meticulous effort has led to a much improved version of our paper. Lin’s research was partially supported by grants from the National Cancer Institute (5P30CA016672 and 1R01CA261978).
References
- 1.Challand R, Gorham H, Constant J. Biosimilars: where we were and where we are. J Biopharm Stat. 2014;24(6):1154–1164. [DOI] [PubMed] [Google Scholar]
- 2.Ingrasciotta Y, Cutroneo PM, Marcianò I, Giezen T, Atzeni F, Trifirò G. Safety of biologics, including biosimilars: perspectives on current status and future direction. Drug Saf. 2018;41(11):1013–1022. [DOI] [PubMed] [Google Scholar]
- 3.U.S. Food and Drug Administration (FDA). Guidance for industry: scientific considerations in demonstrating biosimilarity to a reference product. https://www.fda.gov/media/82647/download. Published April 2015. Accessed June 16, 2022.
- 4.Tsai WC. Update on biosimilars in Asia. Curr Rheumatol Rep. 2017;19(8):47. [DOI] [PubMed] [Google Scholar]
- 5.European Medicines Agency (EMA). Guideline on similar biological medicinal products containing biotechnology-derived proteins as active substance: non-clinical and clinical issues. https://www.ema.europa.eu/en/similar-biological-medicinal-products-containing-biotechnology-derived-proteins-active-substance-non. Published January 9, 2015. Accessed June 16, 2022. [Google Scholar]
- 6.Weise M. From bioequivalence to biosimilars: how much do regulators dare? Z Evid Fortbild Qual Gesundhwes. 2019;140:58–62. [DOI] [PubMed] [Google Scholar]
- 7.Chow SC, Wang J, Endrenyi L, Lachenbruch PA. Scientific considerations for assessing biosimilar products. Stat Med. 2013;32(3):370–381. [DOI] [PubMed] [Google Scholar]
- 8.Chow SC, Liu JP. Statistical assessment of biosimilar products. J Biopharm Stat. 2010;20(1):10–30. [DOI] [PubMed] [Google Scholar]
- 9.Chiu ST, Liu JP, Chow SC. Applications of the Bayesian prior information to evaluation of equivalence of similar biological medicinal products. J Biopharm Stat. 2014;24(6):1254–1263. [DOI] [PubMed] [Google Scholar]
- 10.Pan H, Yuan Y, Xia J. A calibrated power prior approach to borrow information from historical data with application to biosimilar clinical trials. J R Stat Soc Ser C Appl Stat. 2017;66(5):979–996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Uozumi R, Hamada C. Adaptive seamless design for establishing pharmacokinetic and efficacy equivalence in developing biosimilars. Ther Innov Regul Sci. 2017;51(6):761–769. [DOI] [PubMed] [Google Scholar]
- 12.Weiss RE, Xia X, Zhang N, Wang H, Chi E. Bayesian methods for analysis of biosimilar phase III trials. Stat Med. 2018;37(20):2938–2953. [DOI] [PubMed] [Google Scholar]
- 13.Mielke J, Schmidli H, Jones B. Incorporating historical information in biosimilar trials: challenges and a hybrid Bayesian-frequentist approach. Biometrical J. 2018;60(3):564–582. [DOI] [PubMed] [Google Scholar]
- 14.Psioda MA, Hu K, Zhang Y, Pan J, Ibrahim JG. Bayesian design of biosimilars clinical programs involving multiple therapeutic indications. Biometrics. 2020;76(2):630–642. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Belay SY, Mu R, Xu J. A Bayesian adaptive design for biosimilar trials with time-to-event endpoint. Pharm Stat. 2021;20(3):597–609. [DOI] [PubMed] [Google Scholar]
- 16.Schellekens H, Smolen JS, Dicato M, Rifkin RM. Safety and efficacy of biosimilars in oncology [published correction appears in Lancet Oncol. 2017 Mar;18(3):e134]. Lancet Oncol. 2016;17(11):e502–e509. [DOI] [PubMed] [Google Scholar]
- 17.Das S, Johnson DB. Immune-related adverse events and anti-tumor efficacy of immune checkpoint inhibitors. J Immunother Cancer. 2019;7(1):306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Thall PF, Simon R. Practical Bayesian guidelines for phase IIB clinical trials. Biometrics. 1994;50(2):337–349. [PubMed] [Google Scholar]
- 19.Schmidli H, Gsteiger S, Roychoudhury S, O’Hagan A, Spiegelhalter D, Neuenschwander B. Robust meta-analytic-predictive priors in clinical trials with historical control information. Biometrics. 2014;70(4):1023–1032. [DOI] [PubMed] [Google Scholar]
- 20.Chow SC. Analytical Similarity Assessment in Biosimilar Product Development. New York, NY: Chapman and Hall/CRC Press; 2018. [Google Scholar]
- 21.Tothfalusi L, Endrenyi L, Arieta AG. Evaluation of bioequivalence for highly variable drugs with scaled average bioequivalence. Clin Pharmacokinet. 2009;48(11):725–743. [DOI] [PubMed] [Google Scholar]
- 22.Davit BM, Chen ML, Conner DP, et al. Implementation of a reference-scaled average bioequivalence approach for highly variable generic drug products by the US Food and Drug Administration. AAPS J. 2012;14(4):915–924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Tothfalusi L, Endrenyi L. An exact procedure for the evaluation of reference-scaled average bioequivalence. AAPS J. 2016;18(2):476–489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Zhou H, Lee JJ, Yuan Y. BOP2: Bayesian optimal design for phase II clinical trials with simple and complex endpoints. Stat Med. 2017;36(21):3302–3314. [DOI] [PubMed] [Google Scholar]
- 25.U.S. Food and Drug Administration (FDA). Guidance for industry: interacting with the FDA on complex innovative trial designs for drugs and biological products. https://www.fda.gov/media/130897/download. Published December 2020. Accessed June 16, 2022.
- 26.Wellek S. Testing Statistical Hypotheses of Equivalence and Noninferiority. 2nd ed. New York, NY: Chapman and Hall/CRC Press; 2010. [Google Scholar]
- 27.Schuirmann DJ. A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J Pharmacokinet Biopharm. 1987;15(6):657–680. [DOI] [PubMed] [Google Scholar]
- 28.Berger RL, Hsu JC. Bioequivalence trials, intersection-union tests and equivalence confidence sets. Stat Sci. 1996;11(4):283–319. [Google Scholar]
- 29.Rosenfeld PJ, Brown DM, Heier JS, et al. Ranibizumab for neovascular age-related macular degeneration. N Engl J Med. 2006;355(14):1419–1431. [DOI] [PubMed] [Google Scholar]
- 30.Woo SJ, Veith M, Hamouz J, et al. Efficacy and safety of a proposed ranibizumab biosimilar product vs a reference ranibizumab product for patients with neovascular age-related macular degeneration: a randomized clinical trial. JAMA Ophthalmol. 2021;139(1):68–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Chow SC, Xu H, Endrenyi L, Song FY. A new scaled criterion for drug interchangeability. Chinese J Pharm Anal. 2015;35(5):844–848. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



