Abstract
Prospective randomized clinical trials addressing biomarkers are time-consuming and costly, but are necessary for regulatory agencies to approve new therapies with predictive biomarkers. For this reason, recently there have been many discussions and proposals of various trial designs and comparisons of their efficiency in the literature. We compare statistical efficiencies between the marker-stratified design and the marker-based precision medicine design regarding testing/estimating four hypotheses/parameters of clinical interest, namely: treatment effects in each marker positive and negative cohorts, marker-by-treatment interaction, and the marker's clinical utility. As may be expected, the stratified design is more efficient than the precision medicine design. However, it is perhaps surprising to find out how low the relative efficiency can be for the precision medicine design. We quantify the relative efficiency as a function of design factors including the marker-positive prevalence rate, marker assay and classification sensitivity and specificity, and the treatment randomization ratio. It is interesting to examine the trends of the relative efficiency with these design parameters in testing different hypotheses. We advocate to use the stratified design over the precision medicine design in clinical trials with predictive biomarkers.
Keywords: Efficiency, Biomarker, Clinical trial, Precision medicine, Stratification, Clinical utility
1. Introduction
With the rapid development and use of biomarkers for personalized medicine, it is becoming an increasing trend to consider biomarkers in the study design and analysis for modern clinical trials. Sargent and Allegra,1 Freidlin et al2 and Hayes,3 among others, discussed design issues in clinical trials with biomarkers. Prospective randomized clinical trials addressing biomarkers are time-consuming and costly, but are necessary for regulatory agencies to approve new therapies with predictive biomarkers.3 For this reason, recently there have been many discussions and proposals of various trial designs and comparisons of their efficiency in the literature. Considering a biomarker or a panel of biomarkers that classifies patients as marker positives or negatives, Simon and Maitournam4 and Maitournam and Simon5 proposed “targeted (enrichment or selected) designs”, which restrict the accrual to only marker positive patients, as opposed to the usual all-comers “untargeted designs”. Sargent et al6 proposed the “biomarker-by-treatment interaction design”, which is the “marker-stratified design” that prospectively randomizes patients within each marker-status defined stratum. Stratified randomization designs are classical designs with orthogonality condition in the sense of a two-way ANOVA and are used widely in many clinical trials. When the design stratification is based on biomarker status, however, we need to consider possible misclassifications due to the imperfect assay and classification rule.3,7 An example of such a marker-stratified design is the Pembrolizumab versus docetaxel for previously treated, PD-L1-positive, advanced non-small-cell lung cancer (KEYNOTE-010) trial.8 The stratification biomarker was TPS (tumour proportion score 50% vs 1 - 49%), which measures the extent of PD-L1 expression. A diagram of the marker-stratified design is presented in Section 3.1.
Considering the case of two treatment groups, test (T) versus control/standard of care (C), Sargent and Allegra,1 and Mandrekar and Sargent9–11 proposed a new family of “modified biomarker-based-strategy designs”, which extended the “usual biomarker-based-strategy design” originally proposed by Hayes.12 The “modified” version is also called “augmented” marker-based strategy designs.7 Young et al13 labeled the “usual” and “modified” versions as “biomarker-based strategy design I and II”, respectively. Since all the biomarker clinical trial designs, even for the targeted/enrichment designs, involve some kind of strategy based on biomarkers, Shih and Lin14 recently unified these multifarious terms and renamed the family of “biomarker-based-strategy designs” as the “precision medicine designs” for better clarity, because this type of designs appeared mostly in oncology journals studying precision medicines. We follow the nomenclature of Shih and Lin14 in this paper.
The precision medicine design first randomizes patients into a marker-dependent arm and a marker-independent arm; see the diagram in Section 3.2. For the marker-dependent arm, marker-positive patients all receive the test treatment and marker-negative patients all receive the control treatment. For the marker-independent arm, with the “usual” strategy, patients always receive the standard of care C. With the “modified/augmented” strategy, patients follow the untargeted design, i.e., they are further randomized to either treatment group T or C without biomarker information. Sargent and Allegra1 discussed several advantages of the “modified” strategy over the “usual” strategy. Mandrekar and Sargent11 reported a trial by Potti and Nevins15 that adopted a modified biomarker-based strategy. Patients were first randomized to the genomics guided arm (marker-dependent) or the unguided arm (marker-independent). In that trial, patients in the marker-independent arm were further randomized to either adriamycin cyclophosphamide or docetaxel cyclophosphamide. These and other designs and generalizations to multiple markers and treatment groups have been reviewed recently by Renfro et al16 with examples. See more descriptions in later sections.
In consideration of efficiency of these biomarker trial designs,4–7,9–11,13 most literature performed relative efficiency either ambiguously without referencing to the hypothesis of interest, or inappropriately with tests for different hypotheses or estimators of different parameters. For example, Maitournam and Simom4 and Simom and Maitournam5 reported that targeted (or enrichment) designs are more efficient than untargeted (or all-comers) designs. Obviously, this is not an appropriate comparison in efficiency, since the treatment effect in a targeted design is limited to the marker-positive cohort only, while for the untargeted design, the treatment effect refers to the overall unselected population. Comparison of efficiency has to be based on testing the same hypothesis or estimating the same parameter of interest. Moreover, although Maitournam and Simom9–11 suggested that the stratified design has greater efficiency than the precision medicine design, their suggestion was based on the sum of the sample sizes required to test the hypotheses of treatment effects in marker-positive and marker-negative cohorts for the stratified design, against the sample size required to compare the marker-independent arm versus the marker-dependent arm for the precision medicine design. These are not compatible hypotheses for a meaningful comparison of efficiency. The relative efficiency issues are complex and not well understood, since there are many factors involved and needed analytic considerations. In this paper, the issue of relative efficiency of the stratified design versus the precision medicine design is analyzed with explicit formulas. We choose these two families of designs because they have been considered in almost all the recent literature in biomarker trial designs. Moreover, Shih and Lin14 showed that, in their different structures with different emphases, both stratified designs and precision medicine designs are capable in estimating/testing the same set of parameters/hypotheses of interest, including the treatment effects in each marker-defined stratum, marker-by-treatment interaction, and the marker's clinical utility. Therefore, we may (and should) compare the relative efficiency between the stratified design and the precision medicine design with respect to these hypotheses or associated parameters. In this paper, we carry out the efficiency comparisons in the context of mean difference of a continuous endpoint Y for hypotheses or parameters of special clinical interest. Conclusions should be applicable to binary and survival endpoints as well.
The rest of the paper is structured as follows. In Section 2, we layout notation for the design parameters and hypotheses of interest. Sections 3, 4 and 5 address the four hypotheses/parameters of interest: Section 3 for the treatment effects in the marker-status specific sub-cohorts; Section 4 for the biomarker-by-treatment interaction; Section 5 for the marker's clinical utility. The sub-sections are devoted to the two designs: first the stratified design, then the precision medicine design, and subsequently the result of relative efficiency. Section 6 illustrates the methods with two real clinical trials. Section 7 provides summary and more discussions. Detailed derivations and considerations on the robustness issue are given in the appendices.
2. Parameters and Hypotheses
Let D be the true marker status, M be the marker-appeared status, and A be the treatment assignment. Let μij = E(Y|A = i, D = j) and denote the mean and variance of the response of treatment group i (i = T or C) for patients in marker cohort j (j = 0 or 1; j = 0 being marker-negative and j = 1 being marker-positive). The marker assay and classification accuracy are measured by sensitivity λsen ≡ P(M = 1|D = 1) and specificity λspec ≡ P(M = 0|D = 0). The prevalence rate of true marker-positive is p ≡ P(D = 1). The randomization ratio is r ≡ P(A = T) to the test treatment and 1 – r to the control group for the stratified design in both marker-defined strata. The corresponding randomization ratio is denoted by f : (1 – f) for the marker-independent arm of the precision medicine design in general, where treatment randomization is conducted without marker information. We will justify later that it is reasonable to consider f = r for the purpose of assessing relative efficiency of the two designs when f ≠ 0. Notice that, the precision medicine design is the rename of the “modified/augmented” marker-based strategy design, which includes the so-called “usual” marker-based strategy design as a special case with f = 0. We now define the parameters and label the corresponding null hypotheses as follows.
The treatment effect for marker-negative stratum is δ0 = μT0 – μC0, the corresponding null hypothesis is H− : δ0 = 0. The treatment effect for marker-positive stratum is δ1 = μT1 – μC1, the corresponding null hypothesis is H+ : δ1 = 0. The marker-by-treatment interaction is θ ≡ (μT1 – μC1) – (μT0 – μC0), the corresponding null hypothesis is HI : θ = 0. The notion of clinical utility of the biomarker is defined by Hayes3 as “whether the use of the biomarker supports a treatment decision that produces a better outcome for the patient than if the marker results were not available”. The use of biomarker for treatment decision is reflected by the marker-dependent arm. Treatment decision without biomarker is reflected by the marker-independent arm. Thus, the marker's clinical utility is derived directly from the precision medicine design by comparing the mean of marker-dependent arm, , with the mean of the marker independent arm, μIN, as shown in Shih and Lin14:
Also see Section 5 later. The corresponding hypothesis is HU : Uf(λsen, λspec) = 0. Shih and Lin14 highlighted several differences between the marker's clinical utility Uf and the marker-by-treatment interaction θ. In the following, we test each of hypotheses H− to HU by the stratified design and the precision medicine design and compare their efficiencies. We quantify the relative efficiencies as functions of design factors including the prevalence rate, sensitivity, specificity, and the treatment randomization ratio, and examine the trends with wide ranges of different scenarios for these design factors.
3. Test Marker-Specific Treatment Effect H− and H+ and Relative Efficiency
In the following, we consider the stratified design and the precision medicine design for testing the marker-specific treatment effect hypotheses H− and H+. For both designs, due to the imperfection in assay and classification, the marker-appeared (positive and negative) means are mixtures of the true marker-positive and marker-negative means, as seen in the following expressions (1) and (3). The mixture fractions are the positive predictive and negative predictive values, respectively.
Denote q ≡ P(M = 1) = pλsen + (1 – p)(1 – λspec). The positive predictive value (PPV) = P(D = 1|M = 1) is
By simple algebra, the mean of marker-appeared positives is
| (1) |
and the variance is
| (2) |
for i = T, C. Note that ΔT (ΔC) is marker effect in the treatment group T (C).
The negative predictive value (NPV) = P(D = 0|M = 0) is
The corresponding mean and variance for the marker-appeared negatives are
| (3) |
and
| (4) |
for i = T, C.
For testing the hypotheses regarding the treatment effects, treatment-by-marker interaction, and marker's clinical utility, we will need to solve Eqs. (1) and (3) for the true means. Notice that in Eqs. (2) and (4), the variances all involve the same quantities ( ), showing that the way they are dependent on treatment groups is the same for both marker cohorts. Moreover, under the conventional assumption of homogeneous variances , we can see that in Eqs. (2) and (4), dominates the magnitude since the first term is at most , which is much smaller.
For both designs, assume the total sample size of N to start the process.
3.1. Stratified Design
A summary of the stratified design is shown in the following layout. We annotate the essential quantities defined previously in the diagram for easy references when reading the formula derivations in the text. Notice that the randomization ratio r : (1 – r) to treatment groups is best set according to the variance ratio of the treatment groups to maximize efficiency within each marker cohort. In practice r is usually the same for both marker cohorts. This is justified because, as seen from Eqs. (2) and (4), the way the variances are dependent on treatment groups is the same for both marker cohorts.

With the stratified design, the sample means are unbiased estimators of E(Y|A = i, M = j) (≡ μi* for M = 1 and ≡ μiϕ for M = 0). The sample variances are unbiased estimators of Var(Y|A = i, M = j) ( for M = 1 and ≡ for M = 0), where Nij are the sample sizes for the set with (A = i, M = j). Note that NTj = rNj, where N1 ∼ Binomial(N, q) is the (random) sample size of the marker-appeared positive stratum. N0 = N – N1.
Solving from Eqs. (1) and (3), the unbiased estimators of the true means of the marker positive cohort, μi0, are
Hence, the unbiased estimator of the treatment effect in the marker positive cohort is
| (5) |
It is shown in Appendix A that the variance of the unbiased estimator is
| (6) |
and this variance, with either approximation, can be estimated by
| (7) |
Wald test under H+ is readily formed from Eq. (5) and Eq. (7).
For the marker negative cohort, the unbiased estimators of μi0 are
Hence, the unbiased estimator of the treatment effect in the marker negative cohort is
| (8) |
It is shown in Appendix A.3 that
| (9) |
and the variance can be estimated by
| (10) |
Wald test under H− is readily formed from Eq. (8) and Eq. (10).
3.2. Precision Medicine Design
A summary of the precision medicine design is shown in the following layout. We also annotate the essential quantities in the diagram for easy references when reading the formula derivations in the text.

With the precision medicine design, patients are first randomized to two arms with t : (1 – t) proportions. The marker-independent arm (sample size Nt) is the untargeted design; patients are further randomized with f : (1 – f) proportions, without marker involvement, to treatment (sample size Ntf) or control (sample size Nt(1 – f)) groups. Notice that the randomization ratio f : (1 – f) is best set according to the variance ratio of in Eq (14). The way that is dependent on the treatment group i in Eq (14) is the same as that in Eqs. (2) and (4) for the randomization ratio r in the stratified design for both marker cohorts. Therefore, it is justified to set f = r in the sequel. Noticed also that neither r nor f may be 0 in order to estimate the true treatment effects for both marker positive and negative cohorts. The special case of f = 0, i.e., the “usual” marker-based strategy, is simply not permissible.
Let be the response of a patient who is randomized to treatment i, and be the sample mean and sample variance (i = T, C) in the marker-independent arm. Let be the response of a patient in the marker-dependent arm with treatment i, and be the sample mean and sample variance (i = T, C), where marker appeared-positives receive treatment T and marker appeared-negatives receive treatment C.
It is shown in Appendix B.2 that, for the marker positive cohort, the unbiased estimate of the treatment effect can be obtained by
| (11) |
and its variance is (with f = r ≠ 0)
| (12) |
which, with either approximation, can be estimated by
| (13) |
where, for i = T or C,
| (14) |
is the variance of a response in the marker-independent arm with treatment i, and
| (15) |
and
| (16) |
are the variances of a response in the marker-dependent arm with treatments T and C respectively.
Wald test under H+ is readily formed from Eq. (11) and Eq. (13).
Similarly, we show in Appendix B.3 for marker negative cohort that the unbiased estimate of the treatment effect is
| (17) |
and its variance is (with f = r ≠ 0)
| (18) |
which, with either approximation, can be estimated by
| (19) |
Wald test under H− is readily formed from Eq. (17) and Eq. (19).
3.3. Relative Efficiency with Respect to Testing H−, H+
To asses the relative efficiency (RE) in testing H− or estimating δ0 regarding the treatment effect in the marker negative cohort, we compare the variance
of the stratified design (c.f., Eq (9)) to the variance
of the precision medicine design (c.f., Eq (18)).
To assess RE in testing H+ or estimating δ1 regarding the treatment effect in the marker positive cohort, we compare the variance
of the stratified design (c.f., Eq (6)) to the variance
of the precision medicine design (c.f., Eq(12)).
As shown in the formulas above, RE depends on many factors in a complicated fashion. In the following assessment we set for all i = T, C and j = 0, 1. We examine the REs with respect to a wide range of prevalence rate, proportion randomized to the marker-independent arm for the precision medicine design, sensitivity and specificity, and the marker effects. With homogeneous variances for all i = T, C, we examine REs with r(= f) = 1/2.
For testing H+, Fig 1A depicts versus marker-positive prevalence rate in the range of 0.1 ≤ p ≤ 0.9, assuming λsen = λspec = 0.9 and t = 0.5, where t is the randomization ratio to the marker-independent arm for the precision medicine design. Fig 1B depicts RE versus 0.1 ≤ t ≤ 0.9 when p = 0.8. In each figure, we vary the marker effect for the treatment group T, ΔT = μT1 – μT0, from 0 to 0.75 while set ΔC = 0. As seen, RE < 0.5 in all cases, indicating that the precision medicine design is much less efficient than the stratified design in testing H+. The precision medicine design loses efficiency because it includes patients treated with the same treatment on both marker-independent and marker-dependent arms. It is perhaps expected that this overlap leads to efficiency loss for the precision medicine design; however, the extent to which it loses efficiency is somewhat surprising. Also surprising is that, as seen in Fig 1A, RE increases with increasing maker-positive prevalence rate. Fig 1B shows that RE is quadratic versus t and reaches maximum when t is near 0.55. The marker effect ΔT does not affect RE much in all cases, as expected since the first term in Eqs. (2), (4) and (14) is dominated by . Thus, in Fig 1C we pick ΔT = 0.45 to depict RE versus 0.5 ≤ λsen = λspec ≤ 1, for p = 0.2 to 0.8 and t = 0.5. Interestingly, the RE vs sensitivity trend is increasing for prevalence p > 0.4, but decreasing for prevalence p < 0.4. Thus, whether a better marker assay helps the stratified design more (or less) than the precision medicine design depends on the marker-positive prevalence rate.
Figure 1. Relative Efficiency of Test for Treatment Effect on Marker Positive and Negative Cohorts.
For testing H−, contrasting to testing H+, in Fig 1D, the trend is reversed for versus marker-positive prevalence rate p. The pattern in Fig 1E (RE versus randomization ratio t) is the same as above, but the RE magnitude is much lower. The pattern in Fig 1F (RE versus sensitivity/specificity λsen = λspec) is also reversed from that above; the RE vs sensitivity trend is positive for prevalence p < 0.4, but turns negative for prevalence p > 0.4.
In addition, we also show supplementary graphs in Appendix C for the robustness of the variance to design parameter values. Figures 3A to 3F indicate that, in testing H− or H+, the variances of the tests are more stable/robust for the stratified design (solid lines) than the precision medicine design (dashed lines) with respect to the varying marker-positive prevalence rate (Figs 3A and 3D), marker's sensitivity (Figs 3B and 3 E) and specificity (Figs 3C and 3F).
Figure 3. Variances in Testing for Treatment Effect on Marker Positive and Negative Cohorts (solid line for Stratified design, and dashed line for Precision Medicine Design).
4. Test Marker-by-Treatment Interaction HI and Relative Efficiency
The marker-by-treatment interaction is θ = (μT1 – μC1) – (μT0 – μC0) = δ1 – δ0, the corresponding null hypothesis is HI : θ = 0. It indicates how the marker status modifies the treatment effect. Derivations follow straightforwardly from the previous section. Again, as for testing H− and H+, f = r = 0 in testing HI.
4.1. Stratified Design
For the stratified design, from Eqs. (5) and (8), unbiased estimator of θ can be obtained by
| (20) |
with its variance
| (21) |
and variance estimator
| (22) |
4.2. Precision Medicine Design
For the precision medicine design, from Eqs. (11) and (17), unbiased estimator of θ can be obtained by
| (23) |
with its variance
| (24) |
and variance estimator
| (25) |
4.3. Relative Efficiency with Respect to Testing HI
To assess the relative efficiency in testing HI or estimating θ regarding the treatment by marker interaction, we compare Var (θ̂SF) in Eq. (21) of the stratified design to Var (θ̂PM) in Eq. (24) of the precision medicine design. We use the same setup for the parameters as in Section 3.3. In Fig 2A, setting λsen = λspec = 0.9 and t = 0.5, we see that the precision medicine design is only 25% or less efficient than the stratified design and that the RE trend is quadratic and reaches maximum when marker-positive prevalence rate p = 0.5. Similarly, Fig 2B shows that RE is quadratic versus t and reaches maximum when t is near 0.5. The marker effect ΔT makes minuscule difference in the RE trends. Thus, in Fig 2C we pick ΔT = 0.45 to depict RE versus 0.5 ≤ λsen = λspec ≤ 1, for p = 0.2 to 0.8 and t = 0.5. As shown, the RE trend is similar for p = (0.2 or 0.8), (0.4 or 0.6). Different from testing H+, a better marker assay always helps the stratified design more than helping the precision medicine design in testing HI. In addition, we also show supplementary graphs in Appendix C for the robustness of the variance to design parameter values. Figures 4A to 4C indicate that, in testing HI, the variances of the tests are also more stable/robust for the stratified design (solid lines) than the precision medicine design (dashed lines) with respect to the varying marker-positive prevalence rate (Fig 4A), marker's sensitivity (Fig 4B) and specificity (Fig 4C).
Figure 2. Relative Efficiency for Tests of Interaction and Marker Utility.
Figure 4. Variances in Tests of Interaction and Marker Utility (solid line for Stratified design, and dashed line for Precision Medicine Design).
5. Test Marker's Clinical Utility HU and Relative Efficiency
The notion of marker's clinical utility is not so explicit from the stratified design at the first glance, but can be derived directly from the precision medicine design by comparing the mean of marker-dependent arm, , with the mean of the marker independent arm, μIN.
Hence,
| (26) |
The corresponding hypothesis is HU : Uf(λsen, λspec, p) = 0. Shih and Lin14 discussed differences between the marker clinical utility and marker-by-treatment interaction. Both the third and the last equations above show that Uf is a weighted sum of δ1 and δ0, which will be used for the stratified design to estimate the marker's clinical utility by replacing f by r (see next section for justification). The last equation further indicates an interpretation that Uf as a weighted sum of treatment effects using the biomarker to support treatment decision discounting the treatment effects when the marker results were not available and the choice of treatment was made by randomization with ratio f. Different from testing H−, H+, and HI, f may be 0 in testing HU. When f = 0, marker's clinical utility is the “marker-directed treatment effect” relative to the “always control-treated effect in the absence of biomarker”. For f ≠ 0, from the third equation above, with corresponding estimators of δ1 and δ0,
| (27) |
5.1. Stratified Design
Eq. (26) reveals that marker's clinical utility for the stratified design is a weighted sum of δ1 and δ. In the stratified design, the randomization ratio r : (1 – r) is the same (and r ≠ 0) for each marker cohort regardless of the marker status, i.e., it is marker-independent. Therefore, it matches the meaning of f : (1 – f), the randomization ratio in the marker-independent arm of the precision medicine design. Therefore r = f (≠ 0) for the stratified design in Eqs. (26) and (27), and the estimated marker's clinical utility is, from Eqs. (5) and (8):
| (28) |
with its variance
| (29) |
where
and
5.2. Precision Medicine Design
For the marker dependent arm, we have direct unbiased estimator:
with its variance
| (30) |
where
from Eq. (2), and
from Eq. (4).
For the marker independent arm, we have direct unbiased estimator
with its variance
| (31) |
where
and
from Eq. (B.2).
Hence, the estimated marker's clinical utility is
with its variance
| (33) |
Notice that, unlike the situation of testing H−, H+, and HI, where it is necessary for f ≠ 0, it is permissible in testing HU that f = 0 here. Moreover, as commented previously, since the variance , above is not sensitive to f. This can be seen from the supplementary graphs provided in Appendix D (Fig 5), where in each graph (A to H) all the dash lines (f = 0 to 0.8 for precision medicine design) overlap.
Figure 5. Variances in Tests of Marker Utility for varying f (solid line for Stratified design with r = 0.5, and dashed line for Precision Medicine Design with t = 0.5 (except Figure B)).
5.3. Relative Efficiency with Respect to Testing HU
To assess the relative efficiency in testing HU or estimating Uf regarding the marker utility, we compare in Eq. (29) of the stratified design versus in Eq. (33) of the precision medicine design. We use the same setup for the parameters as in Sections 3.3 and 4.3, in particular, the usual homogeneous variance for i = T, C; j = 0, 1. This leads to examining REs with the optimal r = 0.5 for the stratified design. Also, as commented above, is insensitive to f. Thus we set f = 0.5 as well. We vary other design parameters over wide ranges as before. In Fig 2D, setting λsen = λspec = 0.9 and t = 0.5, we vary the marker-positive prevalence rate. In Fig 2E, we vary t. In Fig 2F, we vary λsen.
From Fig 2D we see that the precision medicine design is 60% or less efficient than the stratified design and that the RE trend is a quadratic function of the marker-positive prevalence rate. But different from testing HI, RE reaches minimum (25%) when marker-positive prevalence rate p = 0.5 in testing HU. This finding shows two important facts: First, it verifies the assertion of Shih and Lin14 on the differences between biomarker-treatment interaction and biomarker's clinical utility; Second, contrary to the suggestion in previous literature,9–11 it is not true that higher prevalence of the marker-positives always leads more loss of efficiency of the precision medicine design. The relative efficiency trend is mostly hypothesis-specific.
Fig. 2E shows that, similar to testing interaction, RE is quadratic versus t and reaches maximum (around 40%) when t is about 0.5. Again, the marker effect ΔT makes minuscule difference in the RE trends. Thus, in Fig 2F we pick ΔT = 0.45 to depict RE versus 0.5 ≤ λsen = λspec ≤ 1, for p = 0.2 to 0.8 and t = f = 0.5. As shown, the RE trend with respect to marker's sensitivity is similar for p = (0.2 or 0.8), (0.4 or 0.6). Different from testing HI, a better marker assay helps the precision medicine design more than helping the stratified design in testing HU.
In addition, we also address the question: How the two designs are likely to perform if the true parameter values are different from the assumed values. Toward this end, we examine how the variances and behave with varying design parameter values in supplementary graphs in Appendix C (Figures 4D-F). It is interesting to see that, in testing HU, the variance of the test is more stable/robust for the precision medicine design ( dashed lines) than the stratified design ( solid lines) with respect to the varying marker-positive prevalence rate (Fig 4D), marker's sensitivity (Fig 4E) and specificity (Fig 4F). This is the opposite to what we see in tests for H−, H+, and HI, but not overly stable or sufficiently robust – this can be seen in all Figs 4D-F that the variances of the precision medicine design are larger than the corresponding variances of the stratified design over the entire design parameter regions. This means that it is not possible that one can have a better efficiency in testing HU for the precision medicine design due to an assumed parameter value (such as prevalence rate, sensitivity or specificity) being different from the true value or vice versa, even though is more stable/robust than . For example, looking at Fig 4E to examine and versus λsen with other design parameters as specified on the heading, the solid curve labeled (prevalence) p = 0.8 is obviously changing from low (≈ 0.0038 at λsen = 0.6) to high (≈ 0.006 at λsen = 0.9) for , but the corresponding dashed line is quite flat (≈ 0.0142) for . Suppose one assumes that the marker's sensitivity λsen = 0.9, but actually the true sensitivity λsen = 0.6, then the true relative efficiency RE ≈ 0.0038/0.0142 = 0.27 rather than the assumed RE ≈ 0.006/0.0142 = 0.42. Both are quite low. Similar observations and comments can be made with Fig 4F for RE with respect to the specificity λspec.
6. Examples
Example 1 is a phase III trial in non-small-cell ling cancer using quantitative excision repair cross-complementing 1 (ERCC1) mRNA expression as a biomarker to customize the treatment of cisplatin.18 We use this example to illustrate issues in the precision medicine design used in this trial and potential advantages of a stratified design in case it were adopted. Corresponding to the diagram of the precision medicine design in Section 3.2, this trial first randomized patients with 1:2 ratio into the marker-independent arm and marker dependent arm (i.e., t = 1/3). In the marker-independent arm, patients always received the control/standard of care (SOC), docetaxel followed by cisplatin (i.e., f = 0). In the marker-dependent arm, patients with low levels of ERCC1 mRNA expression were also treated with the SOC, while the high genotypic group always received the test treatment, docetaxel followed by gemcitabine. Since there was not a test treatment group in the marker-independent arm (f = 0), the true treatment effects in the marker cohorts (δj, j = 0, 1) and true marker effects in each treatment group (Δi, i = T,C) could not be unbiasedly estimated/tested, neither could be the treatment-by-marker interaction. However, the primary goal of this study to assess the utilization of ERCC1 expression to guide the selection of platinum chemotherapy for lung cancer patients could be achieved as follows. The overall response rate (ORR) was 39.3% (out of 141 subjects) in the marker-independent arm, and 51.2% (out of 225 subjects) in the marker-dependent (genotypic) arm; c.f. Table 2 of Cobo et al.18 Thus, the marker's utility was estimated to be, in terms of ORR, a gain of 12% with the use of ERCC1 rRNA expression to guide the selection of platinum therapy versus not to use biomarker and always to treat patients with the SOC. The marker-positive (high level of ERCC1 mRNA expression) prevalence was about p ≈ 96/225 = 43%.
Even though this trial used f = 0, according to Section 5 and Appendix D Figs 5A to 5H the variance of the marker's utility estimate was insensitive to f for the precision medicine design. The relative efficiency is applicable to designs with f > 0 (say, f = 0.5) as well. Moreover, even though there were no explicit information on the sensitivity and specificity of the ERCC1 mRNA assay and classification rule and no estimate of the marginal marker's effect, we may still entertain the following question: If the same study employed a marker-stratified design instead, say with r = 0.5, how would this study gain in efficiency in testing the hypothesis of marker's clinical utility? Figs 2D-F can shed light on the answer. First, as commented before, Figs. 2D and 2E show that the marker's effects (ΔT – ΔC) makes minuscule difference in the RE trends. Figs 2E and 2F shows the maximum RE occurs at randomization t ≈ 0.5 and RE increases with increasing sensitivity and specificity. Giving an optimistic scenario that sensitivity=specificity= 0.90, checking the RE for the case of prevalence p ≈ 43%, both Figs. 2D and 2F show RE ≈ 0.25. This implies that if the trial were using a stratified design, the efficiency in testing HU could increase, or equivalently the sample size could be cut, by four folds. In addition, with a stratified design, the other hypotheses H+, H−, and HI could also be tested.
Example 2 is the pembrolizumab versus docetaxel for previously treated, PD-L1-positive, advanced non-small-cell lung cancer (NSCLC) trial.8 This trial stratified qualified subjects by biomarker's TPS (tumour proportion score ≥50% vs 1 - 49%), which measures the extent of PD-L1 expression, then randomized subjects with 1:1:1 ratio to three treatment groups within each high and low TPS stratum. For illustration, also since there was no significant difference between the two test doses of pembrolizumab, we only look at the pembrolizumab 2 mg (Pem) versus docetaxel (Dox), which is the control/SOC. The high TPS (“strongly positive” of PD-L1 marker) stratum (TPS+) had enrolled 442 patients and the low TPS cohort (TPS-) had 592 patients. Thus the prevalence rate of the PD-L1 strongly positives among the PD-L1 positive NSCLC patients was p ≈ 44%. According to Merck & Co., Inc.,19 66% of the NSCLC patients are PD-L1 positive (TPS≥1%). Thus, the prevalence of the PD-L1 strongly positives is p ≈ 30% among the NSCLC patient population. The PD-L1 IHC 22C3 assay and classification was shown to have high sensitivity and specificity and was approved by the FDA.20 The overall survival results were the following: In the TPS+ cohort, 58 deaths out of 139 patients in the Pem group (42%), 86 deaths out of 152 patients in the Dox group (57%). In the TPS- cohort, 114 deaths out of 205 patients in the Pem group (56%), 107 deaths out of 191 patients in the Dox group (56%%) With λsen = λspec ≈ 0.95, the positive predictive value τ = 0.89 and the negative predictive value η = 0.98. From Eq. (5) we obtain δ̂1 = [−0.98(0.42 – 0.57) + (1 – 0.89)(0.56 – 0.56)]/(1 – 0.89 – 0.98) = −0.168 97, and from Eq. (8) we have δ̂0 = [(1 – 0.98)(0.42 – 0.57) – 0.89(0.56 – 0.56)]/(1 – 0.89 – 0.98) = 3.448 3 × 10−3. The marker's clinical utility with respect to overall survival can be directly calculated by Eq. (27):
Ûr(λsen, λspec, p) = p(λsen – r)δ̂1 + (1 – p)(1 – λspec – r)δ̂0 ≈ 0.3(0.95 – 0.5)(−0.168 97) +0.7(1 – 0.95 – 0.5)(3.448 3 × 10−3) = −2.389 7 × 10−2. This means that, in practice, we estimate to save about additional 2.4% lives by treating advanced NSCLC patients with pembrolizumab or docetaxel using marker guided strategy (i.e., treat with Pem for TPS+ or else treat with Dox) versus equal randomization without marker's information. If one were using the precision medicine design with t = f = 0.5 for this study, a 3-fold sample size would be required to show this result since RE ≈ 0.30 according to Figs 2D or 2E.
7. Discussion
As pointed out by Hayes,3 the past President of American Society of Clinical Oncology (ASCO), prospective randomized clinical trials (PRCT) addressing biomarkers are precious and few because they are time-consuming and costly. But PRCT are necessary for the regulatory agencies to approve new therapies with predictive biomarkers. Therefore, statistical efficiency is a critical issue for considering trial designs. In this paper, we compared the efficiency between two families of designs: the marker-stratified designs and the precision medicine designs. We focused on these two designs because they are popular; almost all literature in biomarker trials discussed them. Stratified design are classical, but with biomarker we need to consider the assay and classification rule's sensitivity and specificity. On the other hand, the precision medicine designs (rename of the so-called “marker-based strategy designs”) is a relatively new proposal in the modern precision medicine era. More importantly, these two designs are capable of addressing the same set of hypotheses of interest, namely, H− (treatment effect in the marker-negative cohort), H+ (treatment effect in the marker-positive cohort), HI (the marker-by-treatment interaction), and HU (the marker clinical utility) as shown in this paper. Of course, the composite hypothesis HC = H− ⋂ H+ (overall treatment effect in the unselected population) is also of interest and testable with either of these two designs. It is sufficient and clearer to consider the relative efficiency in testing the individual hypothesis rather than multiple tests or a composite of these hypotheses. We emphasize that our focus is on the relative efficiency issue of these two designs, not on how to control the type-I error rate in testing a family of hypotheses. For the latter problem, we reference Millen et al.21
It is clear that relative efficiency of designs has to be based on testing/estimating the same hypothesis/parameter of interest. We showed that in addressing each H−, H+, HI, and HU, the stratified design is more efficient than the precision medicine design. The precision medicine design loses efficiency largely because it includes patients treated with the same treatment on both marker-independent and marker-dependent arms, as suggested by Mandrekar and Simon.9–11 However, unlike previously suggested by Mandrekar and Simon,9–11 our results (Fig 1 and Fig 2, Panels A and D) showed that the relative efficiency showed very different trends as a function of the marker-positive prevalence, depending on what hypothesis the two designs we address and compared them to. Fig 1A (with respect to testing H+) and 1D (with respect to testing H−) showed linear functions, but opposite (increasing/decreasing) trends. Fig 2A (with respect to testing interaction hypothesis) and 2D (with respect to testing marker utility hypothesis) showed quadratic functions, also with opposite (concave/convex) directions. Moreover, the extent to which the inferior efficiency of the precision medicine design to the stratified design is quite surprising in our investigation. Our evaluations (Figs 1 and 2) showed the patterns of the relative efficiency under wide ranges of design parameter values. The patterns of REs vary and are hypothesis-specific as expected, but the conclusion is consistent: Marker stratified design is much more efficient than the precision medicine design in all the hypotheses of interest. The examples we discussed in Section 6 illustrate how much gain in efficiency for the stratified design compared to the precision medicine design in testing/estimating biomarker's clinical utility.
Finally, based on the relative efficiency considered in this paper, we recommend marker-stratified designs over precision-medicine designs. In practice, when considering a marker-stratified design, the information on design parameters such as sensitivity, specificity, and prevalence of the biomarker-positives need to be assumed, together with the true treatment effect and variance, for sample size calculations. Information on these design parameters should be obtained from other studies. As uncertainty always exists when using information from other studies, a prudent researcher would contemplate various scenarios.
A. Derivations of Equations (6), (7), (9) and (10)
A.1. Approximation method
In the following when calculating
we need to consider . Since sample size needs to be at least 2 to calculate variance, in practice we consider and have
Liu et al17 approximated N1 by its expected value, E(N1) = Nq, and replaced . Or we may treat N1 as given, then
A.2. Derivations of (6) and (7)
Using the method shown in Section A.1, we have
| (A.1) |
Since our goal here is to compare the relative efficiency between designs, the approximation is consistently applied to both designs. Similarly,
| (A.2) |
Equation (6) is the sum of Equations (A.1) and (A.2).
Since the variances in (A.1) and (A.2) can be estimated by
Equation (7) is the sum of these two estimated variances.
A.3. Derivations of (9) and (10)
Since
| (A.3) |
and
| (A.4) |
Equation (9) is the sum of Equations (A.3) and (A.4).
Since the variances in (A.3) and (A.4) can be estimated by
Equation (10) is the sum of these two estimated variances.
B. Derivations of Equations (11) to (19)
B.1. Unbiased estimators of μij
For marker-independent arm, the means and variances for the treatment group i (= T or C) are
| (B.1) |
and
| (B.2) |
respectively.
Unbiased estimators of μi# can be obtained for i = T, C respectively by the sample means:
and
The variances of the estimators are
| (B.3) |
and
| (B.4) |
where , can be estimated by the sample variances and , respectively.
For Marker-dependent arm, the means for the two treatment groups are
| (B.5) |
and
| (B.6) |
The variances for the two treatment groups are
and
Let NDi be the sample size of the marker-appeared positive group (M = 1) with treatment i (= T or C). Note that ND1 ∼ Binomial(N(1 − t), q) and ND0 = N(1 − t) − ND1. The unbiased estimators of means in this arm can be obtained by
and
respectively.
Using the expected sample size E(ND1) = N(1 − t)q to approximate N1, or treating ND1 as given, the variances of the estimators are, respectively,
| (B.7) |
and
| (B.8) |
where and can be estimated by the sample variances and , respectively. Note that and by definition.
The unbiased estimators of μij can be derived by solving the above Eqs. (B.1), (B.5) and (B.6), we have
| (B.9a) |
| (B.9b) |
| (B.9c) |
| (B.9d) |
B.2. Derivations of (11), (12) and (13)
From Equations B.9a to B.9d, for the marker positive cohort, the unbiased estimate of the treatment effect, i.e., Equation (11), can be obtained by
From Equations B.3, B.4, B.7 and B.8 we have (Equation (12))
which, with either approximation, can be estimated by (Equation (13))
B.3. Derivations of (17), (18) and (19)
Similarly from Equations B.9a to B.9d, for the marker negative cohort, the unbiased estimate of the treatment effect, i.e., Equation (17), can be obtained by
From Equations B.3, B.4, B.7 and B.8 , we have (Equation (18))
which, with either approximation, can be estimated by (Equation (19))
C. Robustness of Variances to Design Parameters
In this appendix, we address the following question (raised by the Associate Editor): Is it possible that, (regarding the efficiency), one design may be overly sensitive (or not sufficiently robust against) potential deviations from the assumed parameter values. Toward this end, we plot the variances of the tests for each of the hypotheses (H−, H+, HI, HU) versus the design parameters (marker-positive prevalence rate, sensitivity and specificity).
In testing H− we plot of the stratified design (c.f., Eq (9)) together with of the precision medicine design (c.f., Eq (18)). In testing H+ we plot of the stratified design (c.f., Eq (6)) together with of the precision medicine design (c.f., Eq (12)). In testing H− or H+, the variances of the tests are more stable/robust for the stratified design (solid lines) than the precision medicine design (dashed lines) with respect to the varying marker-positive prevalence rate (Figs 3A and 3D), marker's sensitivity (Figs 3B and 3E), and specificity (Figs 3C and 3F).
Similarly in testing HI, we plot the test variances Var (θ̂SF) (c.f. Eq. (21)) of the stratified design and Var (θ̂PM) (c.f. Eq. (24)) of the precision medicine design. It is seen that the variance of the test for HI is also more stable/robust for the stratified design (solid lines) than the precision medicine design (dashed lines) with respect to the varying marker-positive prevalence rate (Fig 4A), marker's sensitivity (Fig 4B) and specificity (Figs 4C).
In testing HU, we also examine how the variances (c.f. Eq. (29)) of the stratified design versus (c.f. Eq. (33)) of the precision medicine design behave with varying design parameters. It is interesting to see that the variance of the test is more stable/robust for the precision medicine design ( dashed lines) than the stratified design ( solid lines) with respect to the varying marker-positive prevalence rate (Fig 4D), marker's sensitivity (Fig 4E) and specificity (Fig 4F). This is the opposite to what we see in tests for H−, H+, and HI. Nevertheless, in all graphs, the variances of the precision medicine design are larger, indicating lower efficiency, than the corresponding variances of the stratified medicine.
D. Sensitivity to f in Testing Marker's Clinical Utility
In this appendix, we demonstrate that under the conventional setting of homogeneous variances, , the variance (c.f. Eq. (33)) of the precision medicine design in testing HU is insensitive to the parameter f, the randomization ratio in the marker-independent arm. Since f only applies to the precision medicine design (dash lines), not the stratified design (solid lines), we see several dash lines for varying f (0 to 0.8) but only one solid line (for the stratified design, r = 0.5) in all the graphs in Fig 5. From all the graphs, we see that all the dashed line are almost indistinguishable, meaning that the variance , as functions of other parameters such as marker-positive prevalence rate, proportion of marker independent arm, sensitivity and specificity, is insensitive to f. For this reason, it is justified to set f = 0.5 without loss of generality in Section 5.3.
References
- 1.Sargent D, Allegra C. Issues in clinical trial designs for tumor marker studies. Seminars in Oncology. 2002;29:222–230. doi: 10.1053/sonc.2002.32898. [DOI] [PubMed] [Google Scholar]
- 2.Freidlin B, McShane LM, Korn EL. Randomized clinical trials with biomarkers: design issues. J Natl Cancer Inst. 2010;102:152–160. doi: 10.1093/jnci/djp477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Hayes DF. A bad tumor marker test is as bad a bad drug: The case for more consistent regulation of cancer diagnostics. ASC Connection. 2016 Dec 22; http://connection.asco.org/blogs.
- 4.Simon R, Maitournam A. Evaluating the Efficiency of Targeted Designs for Randomized Clinical Trials. Clinical Cancer Research. 2004;10:6759–6763. doi: 10.1158/1078-0432.CCR-04-0496. [DOI] [PubMed] [Google Scholar]
- 5.Maitournam A, Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine. 2005;24:329–339. doi: 10.1002/sim.1975. [DOI] [PubMed] [Google Scholar]
- 6.Sargent DJ, Conley BA, Allegra C, Collette L. Clinical trial designs for predictive marker validation in cancer treatment trials. J Clin Oncol. 2005;23:2020–27. doi: 10.1200/JCO.2005.01.112. [DOI] [PubMed] [Google Scholar]
- 7.Hoering A, LeBlanc M, Crowley JJ. Randomized phase III clinical trial designs for targeted agents. Clin Cancer Res. 2008;14:4358–4367. doi: 10.1158/1078-0432.CCR-08-0288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Herbst RS, Baas P, Kim DW, et al. Pembrolizumab versus docetaxel for previously treated, PD-L1-positive, advanced non-small-cell lung cancer (KEYNOTE-010): a randomised controlled trial. Lancet. doi: 10.1016/S0140-6736(15)01281-7. Published Online December 19, 2015 http://dx.doi.org/10.1016/S0140-6736(15)01281-7. [DOI] [PubMed]
- 9.Mandrekar SJ, Sargent DJ. Clinical trial designs for predictive biomarker validation: One size does not fit all. J Biopharm Stat. 2009;19:530–542. doi: 10.1080/10543400902802458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Mandrekar SJ, Sargent DJ. Clinical trial designs for predictive biomarker validation: theoretical considerations and practical challenges. J Clin Oncol. 2009;27:4027–34. doi: 10.1200/JCO.2009.22.3701. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Mandrekar SJ, Sargent DJ. Predictive biomarker validation in practice: lessons from real trials. Clinical Trials. 2010;7:567–573. doi: 10.1177/1740774510368574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Hayes DF, Trock B, Harris AL. Assessing the clinical impact of prognostic factors: when is statistically significant clinically useful? Breast Cancer Res Treat. 1998;52:305–319. doi: 10.1023/a:1006197805041. [DOI] [PubMed] [Google Scholar]
- 13.Young KY, Laird A, Zhou XH. The efficiency of clinical trial designs for predictive biomarker validation. Clinical Trials. 2010;7:557–566. doi: 10.1177/1740774510370497. [DOI] [PubMed] [Google Scholar]
- 14.Shih WJ, Lin Y. On study designs and hypotheses for clinical trials with predictive biomarkers. Contemporary Clinical Trials. 2017;62:140–145. doi: 10.1016/j.cct.2017.08.014. [DOI] [PubMed] [Google Scholar]
- 15.Potti A, Nevins JR. Utilization of genomic signatures to direct use of primary chemotherapy. Curr Opin Genet Develop. 2008 Feb;18(1):62–67. doi: 10.1016/j.gde.2008.01.018. [DOI] [PubMed] [Google Scholar]
- 16.Renfro LA, Mallick H, An MW, Sargent DJ, Mandrekar SJ. Clinical trial designs incorporating predictive biomarkers. Cancer Treatment Reviews. 2016;43:74–82. doi: 10.1016/j.ctrv.2015.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Liu C, Liu A, Hu J, Yuan V, Halabi S. Adjusting for misclassification in stratified biomarker clinical trial. Statistics in Medicine. 2014;33:3100–3113. doi: 10.1002/sim.6164. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Cobo M, Isla D, Massuti B, et al. Customizing cisplatin based on quantitative excision repai cross complementing 1 mRNA expression: a phase III trial in non-small-cell lung cancer. J Clin Oncol. 2007;25:2747–2754. doi: 10.1200/JCO.2006.09.7915. [DOI] [PubMed] [Google Scholar]
- 19.Merck & Co., Inc. KEYTRUDA: Start Informed With PD-L1 Expression in mNSCLC. https://www.keytruda.com/static/pdf/keytruda-pd-l1-expression-testing-guide.pdf. ONCO-1212896-0000 05/17.
- 20.Agilent Pathology Solutions. PD-L1 IHC 22C3 pharmDx is FDA-approved For In Vitro Diagnostic Use. https://www.agilent.com/cs/library/usermanuals/public/29158_pd-l1-ihc-22C3-pharmdx-nsclc-interpretation-manual.pdf.
- 21.Millen BA, Dmitrienko A, Ruberg S, Shen L. A statistical framework for decision making in confirmatory multipopulation tailoring clinical trials. Drug Information Journal. 2012;46:647–656. [Google Scholar]





