Abstract
A two-stage enrichment design is a type of adaptive design, which extends a stratified design with a futility analysis on the marker negative cohort at the first stage, and the second stage can be either a targeted design with only the marker positive stratum, or still the stratified design with both marker strata, depending on the result of the interim futility analysis. In this paper we consider the situation where the marker assay and the classification rule are possibly subject to error. We derive the sequential tests for the global hypothesis as well as the component tests for the overall cohort and the marker-positive cohort. We discuss the power analysis with the control of the type-I error rate and show the adverse impact of the misclassification on the powers. We also show the enhanced power of the two-stage enrichment over the one-stage design, and illustrate with examples of the recent successful development of immunotherapy in non-small-cell lung cancer.
Keywords: Enrichment Design, Predictive Biomarker, Sensitivity and Specificity, Composite Hypothesis
1 |. INTRODUCTION
With the advancement of new genomic and proteomic technologies, biomarkers are becoming increasingly important in drug discovery and development. Clinical trials utilizing biomarkers in selecting patients for targeted therapies have been appearing in the literature since the past decade. Generally speaking, marker-based trial designs can be broadly classified as retrospective/prospective, or sequential/non-sequential. Recently Renfro et al.1 gave a review and provided many references and several examples. Shih and Lin2,3 considered hypotheses and relative efficiency of stratified designs and precision medicine designs, and showed that the stratified design is much more efficient than the so-called marker-based strategy designs. Wang et al.4,5 introduced a two-stage adaptive enrichment design, which (prospectively) randomizes an unselected patient population initially to either an experimental or a control group, if the experimental treatment effect does not pass a futility threshold in the marker-negative cohort at the interim analysis, accrual of the marker-negative cohort will be stopped and the remaining samples size will be re-allocated to only the marker-positive cohort. Diagrams 1 and 2 depict this design in detail in Sections 3 and 4. As shown, the trial starts with a stratified randomization design at stage 1, and in the case of mid-trial enrichment (case IIB), the trial will continue with a targeted design for stage 2; otherwise, it will continue with the stratified design (case IIA). Simulation studies have shown that this design with possible mid-trial enrichment is uniformly more powerful than the “adaptive signature design” of Freidlin and Simon6 in detecting a subgroup-specific treatment effect4. However, unlike the traditional stratified designs using patient’s baseline characteristics such as gender, age, race, or ECOG performance status, the biomarker status and classification rule are prone to errors due to imperfect assays and/or classification rules. Several papers have also considered misclassification problems with biomarkers, but in different contexts. For example, Krisam and Kieser7 investigated decision rules for selection of target populations and procedures for calculating the sample size required to achieve a specified selection probability. Wang et al.8 focused on misclassification issues in non-inferiority trials. Wang and Li9 applied a machine learning algorithm to validate biomarker’s predictive performance on assessing subpopulation-specific treatment effects. In this paper, we extend the methods of Wang et al.4,5 by considering the marker’s sensitivity and specificity. We discuss the hypotheses, critical values, interim analysis, type-I error rates and powers of the tests of treatment effects on the (unselected) entire study patient population and on the adaptively enriched patient subset. In Section 2, we first give the hypotheses of interest in enrichment designs. In Section 3, we set the foundation of a one-stage stratified design with marker classification errors. In Section 4, the enrichment design with a futility analysis and the path of the second stage is detailed. Sequential tests on the hypotheses of interest are derived. In Section 5, we give numerical examples from the recent successful development of immunotherapy in non-small-cell lung cancer. Section 6 provides final comments and discussions.
2 |. PARAMETERS AND HYPOTHESES
Consider a two-arm randomized controlled clinical trial to assess whether the effect of a test treatment is superior to control (standard of care). Suppose some baseline genomic biomarker (single biomarker or composite of markers) may be prognostic or predictive of a treatment effect. For simplicity, assume that the marker status is dichotomized (positive versus negative or present versus absent). Let D be the true marker status, M be the marker-appeared status, and A be the treatment assignment. Let and denote the mean and variance of the response variable Y of treatment group i (i = T or C) for patients in marker cohort j (j = 0 or 1; j = 0 being marker-negative and j = 1 being marker-positive). The marker assay and classification accuracy are measured by sensitivity and specificity . The prevalence rate of true marker-positives is . Assume λsen, λspec, and p are known as parts of the trial design parameters. The randomization ratio is to the test treatment and 1 − r to the control group. This randomization ratio applies to the stratified design in both stage 1 and stage 2. We now define the treatment effect parameters and label the corresponding null hypotheses as follows.
Denote the treatment effect for marker-negative stratum by ; the corresponding null hypothesis is . Denote the treatment effect for marker-positive stratum by ; the corresponding null hypothesis is . The overall treatment effect is , where w1(≥ 0) and w0(≥ 0) are some weights for treatment effects; the corresponding null hypothesis is H0a : δ = 0. An obvious weight is the the prevalence of the corresponding marker-defined cohorts; i.e., w1 = p and w0 = 1 − p. The quantity is interpreted as “treatment’s utility effect”2. Another choice is w1 = w0 = 1, which leads to the “absolute treatment effect” δ = δ+ + δ−.
For the overall patient population, we are interested in testing:
For the marker-positive cohort, we are interesting in testing:
The marker-positive cohort is of special interest especially when δ << δ+, and/or due to sample size or other feasibility limitation, there may not be enough power to test H0a versue H1a. The goal of an enrichment design is to increase the efficiency of a study in the presence of a pre-identified subset of patients who may especially respond to the new therapy.
2.1 |. Composite and Individual Hypotheses
Since we are pursuing either δ > 0 for the unselected patient population, or δ+ > 0 for the marker-positive subset, we are testing the composite hypothesis
Throughout the remainder of this paper, we will work with one-sided test and strongly control the (experiment-wise) type-I error rate α (e.g., at 0.025). The individual hypotheses H0a and H0+ will also be tested with assigned type-I error rates.
2.2 |. Positive and Negative Predictive Values
In the following, we first consider the stratified design of stage 1 for testing the marker-specific treatment effect hypotheses H− and H+. Denote . The positive predictive value PPV = P(D = 1|M = 1) is
Let
(1) |
and
(2) |
for i = T, C, where Δi = μi1 − μi0 is the true marker’s effect in treatment group i.
The negative predictive value NPV = P(D = 0|M = 0) is
Let
(3) |
and
(4) |
for i = T, C. See derivations of (2) and (4) in Shih and Lin3. Notice that under .
3 |. STRATIFIED DESIGN
The following shows the diagram of the stratified design:
With the stratified design, the sample means are unbiased estimators of for M = 1 and for M = 0). The sample variances are unbiased estimators of , where Nij are the sample sizes for the the set with (A = i, M = j). Note that , where and are the sample size for the marker-appeared positive cohort and the marker-appeared negative cohort after stratification, respectively.
3.1 |. Z test statistics
Following Shih and Lin3, the unbiased estimate of the true mean of the marker positive cohort is
(5) |
With its variance
(6) |
and variance estimator
(7) |
Z-statistic under H0+. If variances are not known, the Wald statistic AN(0, 1) under H0+ is readily formed from Eq. (5) and Eq. (7).
Unbiased estimator of the treatment effect in the marker negative cohort is
(8) |
with its variance
(9) |
and variance estimator
(10) |
Z-statistic AN(0, 1) under H0−. If variances are not known, the Wald test AN(0, 1) under H0− is readily formed from Eq. (8) and Eq. (10).
From Eqs. (5) and (8), unbiased estimator of can be obtained by
(11) |
where w = w1 + w0 with its variance
(12) |
and variance estimator
(13) |
Wald test under H0. If variances are not known, the Wald test statistic under H0 is readily formed from Eqs. (11) and (13).
Notice that the and are correlated, with
Hence, we have
Next we consider the correlation between the test statistics and . We have
Similarly
Note that, in the above equations involving variance and covariance, we have defined:
3.2 |. Type I error and Power
When designing a trial with in mind, there are global and individual levels of the type-I error and power to be considered. The global type I error rate is associated with testing the composite hypothesis H0:
The critical value c1 is obtained by allocating an alpha for the test of H0a. Then c2 can be solved for testing H0+ in the above equation. The corresponding global power for testing the composite hypothesis H0 is
With c1 and c2 solved, the power for testing the individual hypothesis H1a and H1+ are as follows, respectively:
and
4 |. TWO-STAGE ENRICHMENT DESIGN
The above discussion was a one-stage design without enrichment consideration. Now we consider the two-stage enrichment design. At Stage I, we stratify tN patients the same as shown in the diagram in Section 3. The tests are all the same as in Section 3.1 with sample size tN instead of N. If H0 is rejected, we stop the trial and claim that treatment is effective either for the overall (unselected) patient population or for the marker-positive patients. The tests of the individual hypotheses H1a and H1+ would shed more light on the conclusion. However, if H0 is not rejected, the trial will then continue to accrue (1 − t)N more patients in Stage II with two different scenarios, IIA or IIB, depending on the futility criterion (see Section 4.3) for the marker-negative patients, as shown in the following diagram.
In the following, we will discuss the details of scenario IIA and IIB, and the test of the hypotheses in each scenario. First, let
for i = T, C and j = 0, 1, where is the sample size at the kth stage, , and . Then we have
If the trial continues with IIA, i.e., treatment is not futile on marker-negatives, then
If the trial continues with IIB, i.e., terminating the marker-appeared negatives and enriching the marker-appeared positives, then
4.1 |. Z test statistics
For Stage I, formulas for treatment effect estimates and tests are similar to Section 3.1 with sample size changed to tN. For clear notation and completeness, they are:
and the Z-statistics for testing the individual and composite hypotheses are:
Going to the second stage, under scenario IIA, we continue the trial with all, marker-unselected patients. The estimate of treatment effect will combine the data from both stages as follows:
The Z-statistics for testing the individual and composite hypotheses are:
Under scenario IIB, we continue the trial with enrichment of the marker-appeared positive patients. The estimates of the treatment effects are
The Z-statistics for H0+ is:
4.2 |. Correlation of the tests
To control the type I error rate and to estimate the power, we need to calculate the correlation matrix of for scenario IIA and that of for scenario IIB. In Appendices A.1 and A.2, it is shown that the covariance matrix of the Z-statistics for scenario IIA is:
(14) |
And the covariance matrix of the Z-statistics for scenario IIB is
(15) |
Hence we have
4.3 |. Type I error and critical values
For the two-stage design, unlike the one-stage design in Section 3.2, we first need to split the overall alpha (e.g., α = 0.025) between the two stages. In Stage I, consider to spend only a fraction of the overall alpha, α1, on testing the global hypothesis H0 :
As before, the critical value c1 is obtained by allocating a portion of α1, α1a, for testing H0a. Then c2 can be solved for testing H0+ in the above equation.
For Stage II, the overall alpha is left with α − α1, to be controlled between the mutually exclusive scenarios IIA and IIB. Suppose we allocate a fraction α2 < α − α1 for the tests in scenario IIA and the rest for scenario IIB as follows:
A futility criterion for marker-negative patients will be used to determine α2 and . This can be done through a pre-specified threshold value c0 for the test statistic , or through the futility probability ; see Eq. (16). For example, if we want the futility probability to be 75% (50%), then from .
In case of IIA, the test treatment passes the pre-defined futility threshold value c0, i.e., . We continue with both marker-status cohorts and test H0 at Stage II. The alpha is controlled by
That is, the critical value b1 is obtained by allocating α2a = πα2, a portion π of α2, for testing H0a. Then b2 can be solved for testing H0+ in the above equation.
In case of IIB, the test treatment is futile on marker-negatives, i.e., . We continue with enriching marker-appeared positives and test only at Stage IIB. The alpha is controlled by
with the critical value b3 solved numerically using the correlation matrix.
The strategy of allocating alphas to either IIA or IIB is an important design consideration. Since the trial will only be in one scenario or the other, it would be ideal to maximize the alpha in either scenario. Toward this end, we can first rewrite α2 and , respectively, as
and
Next, if we split α − α1 into IIA and IIB with the same proportion as the odds of , i.e.,
(16) |
then we have
and
This indicates that, with the above alpha allocation strategy, the corresponding conditional type I error is α − α1 for either IIA or IIB. When the odds of non-futility versus futility is pre-determined, the critical values b1 and b2 are calculated such that
for scenario IIA based on the joint distribution of , and b3 is calculated such that
for scenario IIB based on the joint distribution of .
In summary, assuming that information for p, λsen, λspec, are available from earlier data, with specification of the design parameters α, α1, α1a, c0, α2a, r, t, w1, and w0, we can find the critical values c1, c2, b1, b2, b3 for the study based on the above formulas. Notice that α2 is determined by α − α1 and c0.
Table 1 provides the critical values for some commonly considered design parameters α = 0.025, α1 = 0.01, α1a = 0.005, = 0.5, and r = 0.5 with various p, λsen, λspec, and t, assuming , for w1 = p, w0 = 1 − p. As noted in Section 2.3, under . In many cases we do not expect opposite marker effects in the test versus control treatment groups when there is no treatment effect, thus setting ΔT = −ΔC = 0 is reasonable.
TABLE 1.
t | p | (λsen, λspec) | c0 | c1 | c2 | b1 | b2 | b3 | |
---|---|---|---|---|---|---|---|---|---|
0.25 | 0.3 | 0.50 | (1, 1) | 0.000000 | 2.575829 | 2.532415 | 2.594235 | 2.328787 | 2.157164 |
(0.8, 1) | 0.000000 | 2.575829 | 2.542259 | 2.590169 | 2.287361 | 2.185600 | |||
(1, 0.8) | 0.000000 | 2.575829 | 2.549772 | 2.585196 | 2.239521 | 2.399992 | |||
(0.8, 0.8) | 0.000000 | 2.575829 | 2.558969 | 2.571783 | 2.143252 | 2.405538 | |||
0.75 | (1, 1) | 0.674490 | 2.575829 | 2.532415 | 2.745093 | 2.313190 | 2.157104 | ||
(0.8, 1) | 0.674490 | 2.575829 | 2.542259 | 2.734727 | 2.239030 | 2.173154 | |||
(1, 0.8) | 0.674490 | 2.575829 | 2.549772 | 2.720949 | 2.155968 | 2.256922 | |||
(0.8, 0.8) | 0.674490 | 2.575829 | 2.558969 | 2.688348 | 1.999910 | 2.257962 | |||
0.4 | 0.50 | (1, 1) | 0.000000 | 2.575829 | 2.513759 | 2.584926 | 2.304964 | 2.153711 | |
(0.8, 1) | 0.000000 | 2.575829 | 2.528913 | 2.578815 | 2.256674 | 2.192592 | |||
(1, 0.8) | 0.000000 | 2.575829 | 2.533966 | 2.575764 | 2.234588 | 2.387809 | |||
(0.8, 0.8) | 0.000000 | 2.575829 | 2.549772 | 2.558687 | 2.132960 | 2.403011 | |||
0.75 | (1, 1) | 0.674490 | 2.575829 | 2.513759 | 2.725928 | 2.287686 | 2.153730 | ||
(0.8, 1) | 0.674490 | 2.575829 | 2.528913 | 2.709532 | 2.198006 | 2.175594 | |||
(1, 0.8) | 0.674490 | 2.575829 | 2.533966 | 2.701773 | 2.158765 | 2.253450 | |||
(0.8, 0.8) | 0.674490 | 2.575829 | 2.549772 | 2.663237 | 1.991353 | 2.257476 | |||
0.5 | 0.5 | (1, 1) | 0.000000 | 2.575829 | 2.491965 | 2.573627 | 2.278894 | 2.150032 | |
(0.8, 1) | 0.000000 | 2.575829 | 2.513759 | 2.564055 | 2.223744 | 2.200070 | |||
(1, 0.8) | 0.000000 | 2.575829 | 2.513759 | 2.563934 | 2.223621 | 2.368886 | |||
(0.8, 0.8) | 0.000000 | 2.575829 | 2.538317 | 2.543658 | 2.117890 | 2.398080 | |||
0.75 | (1, 1) | 0.674490 | 2.575829 | 2.491965 | 2.704109 | 2.261000 | 2.150036 | ||
(0.8, 1) | 0.674490 | 2.575829 | 2.513759 | 2.680187 | 2.155551 | 2.178152 | |||
(1, 0.8) | 0.674490 | 2.575829 | 2.513759 | 2.679966 | 2.155500 | 2.246676 | |||
(0.8, 0.8) | 0.674490 | 2.575829 | 2.538317 | 2.635724 | 1.977861 | 2.256011 | |||
0.5 | 0.3 | 0.50 | (1, 1) | 0.000000 | 2.575829 | 2.532415 | 2.564274 | 2.279878 | 2.142149 |
(0.8, 1) | 0.000000 | 2.575829 | 2.542259 | 2.563423 | 2.218527 | 2.187922 | |||
(1, 0.8) | 0.000000 | 2.575829 | 2.549772 | 2.561436 | 2.144792 | 2.349097 | |||
(0.8, 0.8) | 0.000000 | 2.575829 | 2.558969 | 2.552805 | 1.987892 | 2.375525 | |||
0.75 | (1, 1) | 0.674490 | 2.575829 | 2.532415 | 2.744433 | 2.254123 | 2.142182 | ||
(0.8, 1) | 0.674490 | 2.575829 | 2.542259 | 2.737397 | 2.146473 | 2.168536 | |||
(1, 0.8) | 0.674490 | 2.575829 | 2.549772 | 2.726401 | 2.020355 | 2.229198 | |||
(0.8, 0.8) | 0.674490 | 2.575829 | 2.558969 | 2.695093 | 1.803399 | 2.237989 | |||
0.4 | 0.50 | (1, 1) | 0.000000 | 2.575829 | 2.513759 | 2.556849 | 2.253368 | 2.134343 | |
(0.8, 1) | 0.000000 | 2.575829 | 2.528913 | 2.554990 | 2.179122 | 2.194796 | |||
(1, 0.8) | 0.000000 | 2.575829 | 2.533966 | 2.553545 | 2.144605 | 2.323219 | |||
(0.8, 0.8) | 0.000000 | 2.575829 | 2.549772 | 2.541300 | 1.976017 | 2.365879 | |||
0.75 | (1, 1) | 0.674490 | 2.575829 | 2.513759 | 2.728359 | 2.224014 | 2.134356 | ||
(0.8, 1) | 0.674490 | 2.575829 | 2.528913 | 2.715450 | 2.090076 | 2.169073 | |||
(1, 0.8) | 0.674490 | 2.575829 | 2.533966 | 2.708449 | 2.029105 | 2.217337 | |||
(0.8, 0.8) | 0.674490 | 2.575829 | 2.549772 | 2.670015 | 1.760296 | 2.233793 | |||
0.5 | 0.5 | (1, 1) | 0.000000 | 2.575829 | 2.491965 | 2.547819 | 2.225199 | 2.126321 | |
(0.8, 1) | 0.000000 | 2.575829 | 2.513759 | 2.543463 | 2.137472 | 2.201825 | |||
(1, 0.8) | 0.000000 | 2.575829 | 2.513759 | 2.543303 | 2.137357 | 2.294858 | |||
(0.8, 0.8) | 0.000000 | 2.575829 | 2.538317 | 2.360600 | 1.931625 | 2.355433 | |||
0.75 | (1, 1) | 0.674490 | 2.575829 | 2.491965 | 2.708278 | 2.193302 | 2.126300 | ||
(0.8, 1) | 0.674490 | 2.575829 | 2.513759 | 2.687160 | 2.031972 | 2.169525 | |||
(1, 0.8) | 0.674490 | 2.575829 | 2.513759 | 2.687205 | 2.031861 | 2.203476 | |||
(0.8, 0.8) | 0.674490 | 2.575829 | 2.538317 | 2.640342 | 1.744658 | 2.228230 |
: = futility probability at first stage
4.4 |. Global and marginal power
Under alternatives we have
The global power is:
where
(17) |
(18) |
(19) |
Power for testing the treatment effect in the overall cohort is:
where and P2a is defined in Eq. (17).
Power for testing the treatment effect in the marker-positive is:
where
and as defined in Eq. (19). Notice p2+ has the term Z ≤ b1 but does not.
Figures 1, 2 and 3 show the contour plots of power surfaces for the global (testing H0), overall cohort (testing H0a) and marker-positive cohort (testing H0+) hypotheses, respectively, across 0.10 ≤ δ ≤ 0.40 and 0.20 ≤ δ+ ≤ 0.55 for α = 0.025 by N, p, λsen and λspec assuming ΔC = 0, , c0 = 0, r = 0.5 and t = 0.5. Note that . From Figure 1, we see that the global power is increasing for both δ and δ+. The power increases as N, p, λsen or λspec increases as well. The impact of the imperfect assay and classification rule through λsen and λspec is obvious when contrasting the power figures with the same p and N. For example, with p = 0.4 and N = 500, δ = 0.15 and δ+ = 0.5, Figure 1 A with λsen = λspec = 1 shows power > 90%; however Figure 1 D with λsen = λspec = 0.8 shows power only about 70%. We also see that λsen has less impact than λspec in terms of global power. Figure 3 shows that the power for test marker-positive cohort effect increases with increasing δ+, but decreases with increasing δ for fixed δ+. Similar to global power, we also see that λsen has less impact than λspec in terms of power for marker-positive cohort effect. Figure 2 shows, as expected, that the power for test overall cohort effect increases with increasing δ for fixed δ+, but λsen and λspec have similar impact on the power for overall cohort effect.
4.5 |. Sample size determination
Given the type I error rate and power, we can use the information in Sections 4.3 and 4.4 to determine the sample size needed for the study. Tables B1 and B2 in Appendix B show the total sample size based on the marker-positive and global powers of 80% and 90% with various p, δ, δ+, λsen, λspec and information time t for α = 0.025 (one-sided test) assuming .
Table B1 is useful when the study plan sets on a sufficient power (e.g., 80% or 90%) in testing H0+ for the marker-positive cohort. It shows the required total sample size N and the resulting global power and the power for testing H0a for the whole population. In this case, the global power is always higher since it is to test the composite hypothesis H0. The power for the whole population would be low when the emphasis is on the much hopeful marker-positive subset with a fairly possible enrichment ( = 0.5). This is reflected by displaying δ = 0.15 while δ+ = 0.3 to 0.5 in this table. In these situations, though the power is low for testing the overall treatment effect, the design has sufficient power to detect the treatment effect on the marker-positive subset. This indicates that fewer patients are needed when the main objective of the study is to show treatment effect on the marker-positive subset. Clearly this is the main advantage of the two-stage enrichment design. Contrasting the case of λsen = λspec = 0.8 with the rest columns, for the same t and p, we can see the big impact of imperfect assay and classification rule on the increase of sample size requirement for a study.
Table B2 is useful when the objective of the trial is set on testing the composite hypothesis H0. The sample size is calculated to ensure that there is enough power to find significance on either the whole population or the marker-positive subset under H1. As shown, the total sample size monotonically decreases as either δ or δ+ increases. We also see the adverse impact on the sample size requirement with imperfect marker assay and classification rule. Moreover, λspec has more impact on the total sample sizes than λsen. The required total sample size is larger for later interim analysis time t since the enrichment would be delayed.
5 |. NUMERICAL EXAMPLES
Immunotherapy is a new paradigm for the treatment of non-small-cell lung cancer (NSCLC), and targeting the PD-1/PD-L1 pathway is a promising therapeutic option. Pembrolizumab is a new immunotherapy that blocks the PD-1 pathway and restores the body’s immune response against cancer cells and allows the immune system to recognize and kill cancer cells. We use the KEYNOTE-10 trial as an example to illustrate our method. This was a (phase 2/3) randomized trial to study Pembrolizumab versus Docetaxel for previously treated, PD-L1-positive, advanced NSCLC patients10. This trial stratified qualified subjects by biomarker’s TPS (tumor proportion score ≥ 50% vs 1 – 49%), which measures the extent of PD-L1 expression, then randomized subjects with 1:1:1 ratio to three treatment groups within each high and low TPS stratum. The companion diagnostic assay for PD-L1 expression was the Dako EnVision FLEX+HRP-Polymer kit using the 22C3 antibody clone, which was validated in the phase 1 KEYNOTE-001 trial11. Here we use the phase 1 KEYNOTE-001 data as the basis to “re-design” the KEYNOTE-10 as an “imaginary” two-stage enrichment trial to illustrate our method. For illustration, also since there was no significant difference between the two test doses of Pembrolizumab, we only look at the Pembrolizumab 2 mg (Pem) versus Docetaxel (Dox), which is the control/standard-of-care.
From the phase 1 KEYNOTE-001 trial (Figs S3A and S4)11, the prevalence was about 0.39 for TPS < 1%, 0.38 for TPS = 1–49%, and 0.23 for TPS ≥ 50%; λsen = λspec ≈ 0.80. Thus, we estimate the prevalence rate of the PD-L1 true “strongly positives” (TPS ≥ 50%) among the PD-L1 positive (TPS ≥ 1%) NSCLC patients being p ≈ 0.40; the appeared PD-L1 “strongly positive” prevalence is . Hence, , and .
The real phase 2/3 KEYNOTE-10 trial had both overall survival and progression-free survival (PFS) as primary endpoints. We only use PFS for the “imaginary” trial for illustration purpose. Suppose that the overall (one-sided) α = 0.025 and an interim analysis is planned at t = 0.5, with α1 = 0.01 allocated for Stage I. As stated in Herbst et al.10, the study aimed to show a benefit of Pem over Dox in PFS in patients with TPS ≥ 50% as well as in the whole TPS ≥ 1% cohort, so we allocate α1a = α1b = α1/2 = 0.005. Moreover, assume that we set the futility probability for the PD-L1 “weakly positive” (TPS 1–49%) subset to be 50% (implying c0 = 0). Then from Eq. (16), α2 = (0.025 – 0.01)/2 = 0.0075 for Stage IIA, and equal amount of for Stage IIB. In case of Stage IIA, we further allocate for testing the overall cohort, and equally the rest 0.00375 for the PD-L1 strongly positive subset. The log-rank test on the hazard ratios of PFS assuming an exponential distribution justifies . From Table 1, these design parameters lead to the critical values (c1, c2, b1, b2, b3) = (2.5758, 2.5498, 2.5413, 1.9760, 2.3659).
To illustrate power calculation we take the treatment effect information from Herbst et al.10. Let δ be the log-hazard ratio (Dox vs Pem) = 0.15 for the overall cohort. With a total of N = 1,000 patients and prevalence rate p = 0.40, we expect to have 190 PD-L1 strongly positive patients enrolled for the first stage and perform an interim analysis at t = 0.5. Assume δ+ = log-hazard ratio (Dox vs Pem) = 0.50 for the PD-L1 strongly positive subset. The power for testing the global hypothesis H0 is 96% (> 95% from Fig 1 D), the power for testing H0+ is 90% (> 85% from Fig 3 D) for the PD-L1 strongly positive subset, and the power for testing H0a is 26% (about 25% from Fig 2 D) for the overall cohort. With N = 500 patients, power for testing the global hypothesis H0 is 75%, power for testing H0+ is 69% for the PD-L1 strongly positive subset, and power for testing H0a is 14% for the overall cohort.
To illustrate sample size calculation, with the same alpha allocation, λsen = λspec = 0.80, and treatment effects δ = 0.15 and δ+ = 0.50 as above, trials usually aim a power of either 80% or 90% for the marker positive subset. Assuming a target of 90% power for testing H0+, Table B1 shows that a total sample size of N = 1004 is needed. The global power for the composite hypothesis is 96%, and the power for the overall cohort is 26%. If the target power is 80% for testing H0+, then N = 671 is needed. The global power for the composite hypothesis is 86%, and the power for the overall cohort is 18.5%.
6 |. DISCUSSION
First, the two-stage enrichment design discussed in this paper should not be confused with the so-called “targeted design”, which is usually only one-stage design but sometimes is also called “enriched design” or “enrichment design” by some authors. Gao et al.12 proposed a multistage biomarker-directed targeted design, not an enrichment design. The targeted design does not include the biomarker-negative subset, hence only the hypothesis H0+ can be tested, not the hypothesis H0a on the overall population. The target design is suitable when there is already evidence that the treatment may have beneficial effect only on the marker-positive subset, not on the marker-negative subset. On the other hand, the sequential two-stage enrichment design is suitable for the situation where such evidence is lacking. It is interesting to note that the KEYNOTE-010 trial, included all TPS ≥ 1% NSCLC patients to receive test treatment as their second-line therapy, showed significant treatment effect of Pembrolizumab over the SOC control for the PD-L1 strongly positive subset, but not in the overall cohort10. With this evidence as background, the next trial KEYNOTE-02413 for Pembrolizumab to treat NSCLC patients as their first-line therapy used a “targeted design”, enrolled only the PD-L1 strongly positive cohort (TPS ≥ 50%) and was a successful study to receive the FDA’s fast-tract approval.
The two-stage enrichment design, where the enrichment occurs at the second stage by stopping the enrollment of marker-negatives as indicated with an futility analysis and by continuing enrollment of only the marker-positives, is more efficient than the conventional stratified design without enrichment in testing both H0+ and H0. For example, as we illustrated in Section 5 that, for overall α = 0.025 and allocation of α1 = 0.01 for testing H0a at stage I, and other alpha allocations therein, assuming sensitivity=specificity=0.8 to detect treatment effects δ = 0.15 and δ+ = 0.50, if the target power is 80% for testing H0+, then N = 671 is needed by the proposed two-stage enrichment design (as shown in Table 2). The global power for the composite hypothesis is 86%, and the power for the overall cohort is 18.5%. In contrast, without such enrichment, following simply the (1-stage) stratified design in Section 3, the required sample size would be larger, N = 714. The global power for the composite hypothesis would be 82%, and the power for the overall cohort would be 27.5%. The two-stage enrichment design obviously enhances the power for testing the (more hopeful) marker-positive subset and the global hypothesis with a smaller sample size by reducing the power for the (less hopeful) overall cohort.
In addition, Figure 4 shows the comparisons on global power and sample size for 2-stage and single stage designs with various δ, δ+, λsen and λspec when α = 0.025 with α1 = 0.005, α2 = 0.02, r = 0.5, p = 0.3 for single stage design, and α1a = 0.005, α1b = 0.005, r = 0.5, p = 0.3, t = 0.5 for 2-stage design. The reference line in the top row of the graph is that horizontal and vertical values are the same. The top row of the graph shows the comparisons of powers between 2-stage and single stage designs with the same total sample size. The figure shows that, with the same sample size, the global power for 2 stage design is larger than that for single stage design since all the curves are above the reference line. The bottom row of the graph shows the ratio of sample sizes needed for 2-stage design versus single stage design in order to achieve the same global power from 60 to 90%. The figure shows that the sample size needed for 2-stage design is much smaller than that for single stage design in order to achieve the same global power.
Finally, in addition to the tables and graphs provided as a part of the illustration in this paper, we make our R programs available upon request for readers to calculate power and sample size for their enrichment designs as needed. The current research is limited to trial designs with two treatment groups and a fixed treatment allocation ratio r, as shown in the design schema. Extension to more than two treatment groups and application of method of Tang and Zhou14 for optimal allocation in enrichment setting are potential topics for future research.
ACKNOWLEDGEMENTS
The authors are grateful to the two referees and AE for their helpful comments that improved the presentation. The research of YL, WJS and SL was partially supported by NIH/NCI CCSG Grant 3P30CA072720.
APPENDIX
A. DERIVATIONS OF EQUATIONS (14) AND (15)
A.1. Correlation matrix of for scenario IIA
Between the Z-statistics of stage 1, we have from Section 3.1:
Between the Z-statistics of Stage I and Stage II, since
we have
Furthermore, from
we have
In summary, the covariance matrix of the Z-statistics for scenario IIA is:
A.2. Correlation matrix of for scenario IIB
For Scenario IIB, since
we have
In summary, the covariance matrix of the Z-statistics for scenario IIB is
B. TOTAL SAMPLE SIZE TABLES FOR SOME COMMON SCENARIOS
TABLE B1.
(λsen, λspec) | (1, 1) | (0.8, 1) | (1, 0.8) | (0.8, 0.8) | (1, 1) | (0.8, 1) | (1, 0.8) | (0.8, 0.8) | ||
---|---|---|---|---|---|---|---|---|---|---|
p | δ+ | 80% power for testing H0+ |
||||||||
t = 0.25 | t = 0.5 | |||||||||
0.3 | 0.3 | N | 1519 | 2261 | 9109 | 16831 | 1709 | 2778 | 4636 | 8418 |
Global | 0.914 | 0.967 | 1.000 | 1.000 | 0.938 | 0.988 | 1.000 | 1.000 | ||
H1a | 0.489 | 0.644 | 0.919 | 0.994 | 0.569 | 0.770 | 0.920 | 0.994 | ||
0.4 | N | 670 | 880 | 1236 | 7589 | 697 | 931 | 1367 | 3882 | |
Global | 0.845 | 0.870 | 0.905 | 1.000 | 0.853 | 0.886 | 0.932 | 0.999 | ||
H1a | 0.199 | 0.267 | 0.369 | 0.838 | 0.223 | 0.309 | 0.447 | 0.841 | ||
0.5 | N | 374 | 479 | 702 | 1162 | 380 | 483 | 714 | 1290 | |
Global | 0.824 | 0.836 | 0.855 | 0.909 | 0.826 | 0.842 | 0.866 | 0.941 | ||
H1a | 0.102 | 0.133 | 0.193 | 0.313 | 0.113 | 0.149 | 0.220 | 0.396 | ||
0.4 | 0.3 | N | 934 | 1226 | 1480 | 3152 | 987 | 1368 | 1664 | 4627 |
Global | 0.856 | 0.891 | 0.912 | 0.992 | 0.867 | 0.917 | 0.941 | 1.000 | ||
H1a | 0.284 | 0.374 | 0.438 | 0.67 | 0.328 | 0.458 | 0.539 | 0.905 | ||
0.4 | N | 441 | 552 | 712 | 1135 | 453 | 568 | 721 | 1223 | |
Global | 0.821 | 0.836 | 0.844 | 0.895 | 0.823 | 0.843 | 0.853 | 0.923 | ||
H1a | 0.116 | 0.149 | 0.189 | 0.296 | 0.131 | 0.172 | 0.217 | 0.37 | ||
0.5 | N | 259 | 315 | 436 | 674 | 264 | 322 | 433 | 671 | |
Global | 0.813 | 0.82 | 0.823 | 0.85 | 0.812 | 0.822 | 0.827 | 0.861 | ||
H1a | 0.065 | 0.079 | 0.104 | 0.158 | 0.072 | 0.091 | 0.117 | 0.185 | ||
0.5 | 0.3 | N | 678 | 849 | 958 | 1555 | 699 | 895 | 981 | 1842 |
Global | 0.827 | 0.852 | 0.852 | 0.922 | 0.831 | 0.863 | 0.863 | 0.960 | ||
H1a | 0.184 | 0.235 | 0.261 | 0.399 | 0.211 | 0.283 | 0.306 | 0.539 | ||
0.4 | N | 341 | 412 | 506 | 764 | 348 | 426 | 502 | 776 | |
Global | 0.812 | 0.821 | 0.821 | 0.849 | 0.810 | 0.823 | 0.822 | 0.862 | ||
H1a | 0.080 | 0.098 | 0.116 | 0.175 | 0.090 | 0.116 | 0.132 | 0.210 | ||
0.5 | N | 206 | 248 | 318 | 482 | 212 | 258 | 314 | 476 | |
Global | 0.807 | 0.812 | 0.811 | 0.827 | 0.807 | 0.812 | 0.811 | 0.829 | ||
H1a | 0.048 | 0.057 | 0.067 | 0.098 | 0.053 | 0.066 | 0.075 | 0.114 | ||
p | δ+ | 90% power for testing H0+ |
||||||||
t = 0.25 | t = 0.5 | |||||||||
0.3 | 0.3 | N | 5208 | 9303 | 13151 | 21695 | 2949 | 4680 | 6581 | 10848 |
Global | 1.000 | 1.000 | 1.000 | 1.000 | 0.995 | 1.000 | 1.000 | 1.000 | ||
H1a | 0.751 | 0.905 | 0.973 | 0.999 | 0.750 | 0.905 | 0.973 | 0.999 | ||
0.4 | N | 1036 | 1477 | 4736 | 11536 | 1042 | 1529 | 2594 | 5771 | |
Global | 0.943 | 0.969 | 1.000 | 1.000 | 0.946 | 0.974 | 0.997 | 1.000 | ||
H1a | 0.293 | 0.405 | 0.663 | 0.941 | 0.311 | 0.451 | 0.659 | 0.941 | ||
0.5 | N | 570 | 761 | 1058 | 5662 | 557 | 750 | 1082 | 2886 | |
Global | 0.922 | 0.936 | 0.954 | 1.000 | 0.923 | 0.939 | 0.961 | 0.999 | ||
H1a | 0.147 | 0.200 | 0.273 | 0.669 | 0.151 | 0.214 | 0.309 | 0.667 | ||
0.4 | 0.3 | N | 1914 | 2668 | 5881 | 13200 | 1487 | 2447 | 3117 | 6603 |
Global | 0.985 | 0.994 | 1.000 | 1.000 | 0.958 | 0.991 | 0.997 | 1.000 | ||
H1a | 0.479 | 0.59 | 0.754 | 0.970 | 0.452 | 0.66 | 0.755 | 0.970 | ||
0.4 | N | 646 | 862 | 1030 | 4791 | 636 | 853 | 1037 | 2555 | |
Global | 0.92 | 0.936 | 0.944 | 1.000 | 0.92 | 0.939 | 0.948 | 0.998 | ||
H1a | 0.161 | 0.219 | 0.257 | 0.613 | 0.173 | 0.243 | 0.293 | 0.618 | ||
0.5 | N | 370 | 480 | 600 | 993 | 362 | 460 | 588 | 1004 | |
Global | 0.911 | 0.919 | 0.923 | 0.952 | 0.91 | 0.919 | 0.923 | 0.96 | ||
H1a | 0.085 | 0.111 | 0.134 | 0.215 | 0.09 | 0.119 | 0.149 | 0.259 | ||
0.5 | 0.3 | N | 969 | 1337 | 1404 | 7692 | 972 | 1371 | 1429 | 3879 |
Global | 0.926 | 0.952 | 0.952 | 1.000 | 0.927 | 0.959 | 0.957 | 1.000 | ||
H1a | 0.250 | 0.341 | 0.352 | 0.813 | 0.278 | 0.401 | 0.413 | 0.814 | ||
0.4 | N | 468 | 597 | 685 | 1116 | 467 | 591 | 675 | 1144 | |
Global | 0.910 | 0.920 | 0.919 | 0.951 | 0.909 | 0.920 | 0.919 | 0.960 | ||
H1a | 0.102 | 0.132 | 0.147 | 0.234 | 0.112 | 0.150 | 0.167 | 0.291 | ||
0.5 | N | 279 | 349 | 424 | 660 | 650 | 348 | 415 | 648 | |
Global | 0.906 | 0.911 | 0.911 | 0.927 | 0.928 | 0.910 | 0.909 | 0.928 | ||
H1a | 0.057 | 0.071 | 0.081 | 0.123 | 0.146 | 0.081 | 0.091 | 0.145 |
TABLE B2.
(λsen, λspec) | (1, 1) | (0.8, 1) | (1, 0.8) | (0.8, 0.8) | (1, 1) | (0.8, 1) | (1, 0.8) | (0.8, 0.8) | ||
---|---|---|---|---|---|---|---|---|---|---|
p | δ | δ+ | 80% power for testing H0 |
|||||||
t = 0.25 | t = 0.5 | |||||||||
0.3 | 0.1 | 0.2 | 2269 | 2513 | 2973 | 3438 | 2354 | 2589 | 2958 | 3309 |
0.3 | 984 | 1176 | 1635 | 2186 | 1006 | 1185 | 1612 | 2094 | ||
0.4 | 507 | 611 | 954 | 1376 | 514 | 608 | 932 | 1321 | ||
0.2 | 0.3 | 833 | 861 | 948 | 1024 | 860 | 887 | 935 | 975 | |
0.4 | 568 | 630 | 745 | 864 | 588 | 649 | 741 | 831 | ||
0.5 | 369 | 432 | 554 | 699 | 380 | 441 | 550 | 672 | ||
0.3 | 0.4 | 409 | 415 | 448 | 476 | 424 | 426 | 440 | 451 | |
0.5 | 331 | 349 | 395 | 434 | 343 | 361 | 389 | 415 | ||
0.6 | 253 | 281 | 332 | 387 | 262 | 289 | 330 | 372 | ||
0.4 | 0.1 | 0.2 | 1756 | 1987 | 2351 | 2877 | 1812 | 2047 | 2337 | 2790 |
0.3 | 711 | 834 | 1134 | 1577 | 731 | 855 | 1119 | 1524 | ||
0.4 | 374 | 437 | 644 | 939 | 385 | 451 | 633 | 904 | ||
0.2 | 0.3 | 732 | 773 | 847 | 937 | 754 | 795 | 839 | 902 | |
0.4 | 439 | 499 | 589 | 724 | 453 | 514 | 585 | 701 | ||
0.5 | 269 | 317 | 402 | 539 | 277 | 325 | 398 | 521 | ||
0.3 | 0.4 | 378 | 387 | 415 | 447 | 385 | 397 | 410 | 428 | |
0.5 | 276 | 301 | 355 | 387 | 286 | 310 | 335 | 374 | ||
0.6 | 195 | 223 | 262 | 325 | 202 | 230 | 261 | 315 | ||
0.5 | 0.1 | 0.2 | 1413 | 1616 | 1874 | 2401 | 1448 | 1664 | 1858 | 2341 |
0.3 | 574 | 667 | 857 | 1206 | 590 | 694 | 847 | 1190 | ||
0.4 | 308 | 360 | 485 | 706 | 319 | 378 | 478 | 694 | ||
0.2 | 0.3 | 638 | 688 | 744 | 853 | 653 | 705 | 738 | 835 | |
0.4 | 354 | 408 | 469 | 606 | 362 | 420 | 465 | 591 | ||
0.5 | 215 | 254 | 307 | 428 | 221 | 262 | 304 | 418 | ||
0.3 | 0.4 | 343 | 359 | 382 | 418 | 349 | 366 | 376 | 403 | |
0.5 | 232 | 259 | 285 | 343 | 238 | 266 | 283 | 334 | ||
0.6 | 158 | 184 | 209 | 274 | 161 | 189 | 207 | 267 | ||
p | δ | δ+ | 90% power for testing H0 |
|||||||
t = 0.25 | t = 0.5 | |||||||||
0.3 | 0.1 | 0.2 | 3225 | 3576 | 3995 | 4526 | 3262 | 3600 | 3977 | 4398 |
0.3 | 1459 | 1774 | 2227 | 2941 | 1439 | 1737 | 2192 | 2835 | ||
0.4 | 760 | 947 | 1289 | 1852 | 732 | 899 | 1256 | 1765 | ||
0.2 | 0.3 | 1181 | 1175 | 1247 | 1329 | 1153 | 1191 | 1236 | 1279 | |
0.4 | 806 | 896 | 1002 | 1135 | 816 | 902 | 996 | 1104 | ||
0.5 | 537 | 635 | 756 | 931 | 538 | 632 | 747 | 904 | ||
0.3 | 0.4 | 549 | 558 | 586 | 616 | 545 | 565 | 578 | 589 | |
0.5 | 457 | 483 | 521 | 565 | 466 | 490 | 518 | 547 | ||
0.6 | 359 | 399 | 446 | 508 | 363 | 402 | 444 | 494 | ||
0.4 | 0.1 | 0.2 | 2487 | 2864 | 3177 | 3831 | 2507 | 2867 | 3156 | 3733 |
0.3 | 1009 | 1236 | 1521 | 2117 | 999 | 1203 | 1497 | 2045 | ||
0.4 | 523 | 641 | 856 | 1249 | 518 | 621 | 839 | 1203 | ||
0.2 | 0.3 | 1009 | 1064 | 1126 | 1227 | 1022 | 1079 | 1120 | 1192 | |
0.4 | 622 | 719 | 796 | 963 | 627 | 719 | 790 | 939 | ||
0.5 | 384 | 467 | 542 | 723 | 383 | 459 | 535 | 730 | ||
0.3 | 0.4 | 509 | 524 | 547 | 582 | 516 | 530 | 543 | 563 | |
0.5 | 385 | 421 | 451 | 510 | 391 | 426 | 449 | 497 | ||
0.6 | 277 | 322 | 354 | 432 | 279 | 322 | 352 | 421 | ||
0.5 | 0.1 | 0.2 | 1949 | 2290 | 2511 | 3207 | 1963 | 2293 | 2490 | 3134 |
0.3 | 777 | 933 | 1132 | 1604 | 783 | 934 | 1119 | 1567 | ||
0.4 | 412 | 494 | 638 | 932 | 420 | 502 | 629 | 914 | ||
0.2 | 0.3 | 874 | 950 | 995 | 1126 | 885 | 960 | 989 | 1097 | |
0.4 | 488 | 577 | 629 | 810 | 491 | 578 | 623 | 791 | ||
0.5 | 294 | 359 | 409 | 571 | 295 | 358 | 404 | 558 | ||
0.3 | 0.4 | 464 | 487 | 505 | 549 | 470 | 492 | 501 | 547 | |
0.5 | 320 | 362 | 382 | 459 | 324 | 365 | 379 | 449 | ||
0.6 | 217 | 260 | 280 | 366 | 219 | 260 | 277 | 344 |
References
- 1.Renfro LA, Mallick H, An MW, Sargent DJ, Mandrekar SJ. Clinical trial designs incorporating predictive biomarkers. Cancer Treatment Reviews. 2016;43:74–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Shih WJ, Lin Y. On study designs and hypotheses for clinical trials with predictive biomarkers. Contemporary Clinical Trials. 2017;:235–394. [DOI] [PubMed] [Google Scholar]
- 3.Shih WJ, Lin Y. Relative efficiency of precision medicine designs for clinical trials with predictive biomarkers. Statistics in Medicine. 2018;54(3):411–424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wang SJ, O’Neill RT, Hung HMJ. Approaches to evaluation of treatment effect in randomized clinical trials with genomic subset. Pharmaceutical Statistics. 2007;6(3):227–244. [DOI] [PubMed] [Google Scholar]
- 5.Wang SJ, Hung HMJ, O’Neill RT. Adaptive patient enrichment designs in therapeutic trials. Biometrical Journal. 2009;51:358–374. [DOI] [PubMed] [Google Scholar]
- 6.Freidlin B, Simon R. Adaptive signature design: an adaptive clinical trial design for generating and prospectively testing a gene expression signature for sensitive patients. Clinical Cancer Research. 2005;11:7872–7878. [DOI] [PubMed] [Google Scholar]
- 7.Krisam J, Kieser M. Decision rules for subgroup selection based on a predictive biomarker. Journal of Biopharmaceutical Statistics. 2014;24:188–202. [DOI] [PubMed] [Google Scholar]
- 8.Wang SJ, Hung HMJ, O’Neill RT. Genomic Classifier for Patient Enrichment: Misclassification and Type I Error Issues in Pharmacogenomics Noninferiority Trial. Statistics in Biopharmaceutical Research. 2011;3(2):310–319. [Google Scholar]
- 9.Wang SJ, Li MC. Impacts of Predictive Genomic Classifier Performance on Subpopulation-Specific Treatment Effects Assessment. Statistics in Biosciences. 2016;8:129–158. [Google Scholar]
- 10.Herbst RS, Baas P, Kim DW, et al. Pembrolizumab versus docetaxel for previously treated, PD-L1-positive, advanced non-small-cell lung cancer (KEYNOTE-010): a randomised controlled trial. The Lancet. 2016;387(10027):1540–1550. [DOI] [PubMed] [Google Scholar]
- 11.Garon EB, Rizvi NA, Hui R, et al. Pembrolizumab for the treatment of non-small-cell lung cancer. New England Journal of Medicine. 2015;372:2018–2028. [DOI] [PubMed] [Google Scholar]
- 12.Gao Z, Roy A, Tan M. Multistage adaptive biomarker-directed targeted design for randomized clinical trials. Contemporary Clinical Trials. 2015;42:119–131. [DOI] [PubMed] [Google Scholar]
- 13.Reck M, Rodriguez-Abreu D, Robinson AG, et al. Pembrolizumab versus chemotherapy for PD-L1-positive non-small-cell lung cancer. New England Journal of Medicine. 2016;375:1823–1833. [DOI] [PubMed] [Google Scholar]
- 14.Tang L, Zhou XH. A general framework of marker design with optimal allocation to assess clinical utility Statistics in Medicine. 2013; 32 (4): 620–630 [DOI] [PubMed] [Google Scholar]