Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Aug 15.
Published in final edited form as: Stat Med. 2014 Apr 14;33(18):3100–3113. doi: 10.1002/sim.6164

Adjusting for Misclassification in a Stratified Biomarker Clinical Trial

Chunling Liu 1, Aiyi Liu 2,*,, Jiang Hu 3, Vivian Yuan 4, Susan Halabi 5
PMCID: PMC4107031  NIHMSID: NIHMS578974  PMID: 24733510

Abstract

Clinical trials utilizing predictive biomarkers have become a research focus in personalized medicine. We investigate the effects of biomarker misclassification on the design and analysis of stratified biomarker clinical trials. For a variety of inference problems including marker-treatment interaction in particular, we show that marker misclassification may have profound adverse effects on the coverage of confidence intervals, power of the tests, and required sample sizes. For each inferential problem we propose methods to adjust for the classification errors.

Keywords: Biomarkers, classification error, correction for error, personalized medicine, power and sample size, prevalence, randomized controlled clinical trials, sensitivity and specificity

1. Introduction

Advances in understanding the genetics and biology of certain cancers have led to the successful development of novel therapies that target specific pathways. A convincing example is given in [1] that reported a statistically significant overall hazard ratio estimate from a randomized clinical trial in which women with ovarian cancer were treated with either pegylated liposomal doxorubicin or topotecan. The authors further reported that among patients with platinum-sensitive disease, a more significant hazard ratio was found. However, among patients with platinum-refractory disease, the hazard ratio was not significant. The study results showed an evident interaction between treatment (pegylated liposomal doxorubicin or topotecan) and a biomarker (platinum).

When biomarker-treatment interaction is the primary research interest in a clinical trial, the stratified biomarker design is commonly used due to its fully taking the advantage of randomization and its ability to address various questions of interest; see, among others, [27]. In renal cell carcinoma, novel therapies that target the vascular endothelial growth factor (VEGF) and mammalian target of rapamycin (mTOR) pathways have been identified and are being used as treatment option for patients [89].

As another example, data from several centers have shown that retinoblastoma function may help differentiate if the androgen signaling pathway is viable. The loss of retinoblastoma status plays critical role in cell regulation and it suppresses androgen receptor expression and activity. It is estimated that 30% – 40% of prostate cancers will be androgen positive [1012]. Investigators are interested in whether patients with advanced prostate cancer respond to treatment differently according to their retinoblastoma status.

Predictive markers for response have been shown to be important in patients with advanced renal cancer carcinoma. Furthermore, it has been reported that inhibition of the VEGF pathway prolong clinical outcomes, such as objective response, progression-free survival and overall survival. A statistically significant interleukin 6 (IL-6) by treatment interaction in predicting progression-free survival (PFS) was observed in patients with metastatic renal cell carcinoma (p-value=0.009) [13]. In patients with high IL-6, the median PFS was 33 weeks and 10 weeks in patients treated with pazopanib and placebo, respectively. On the other hand, the median PFS was 42 weeks and 24 weeks in low IL-6 patients treated with pazopanib and placebo, respectively [13].

We consider a two-arm trial (treatment versus standard) with T being the treatment indicator, where T = 1 if treatment and T = 0 if standard. We confine attention to a dichotomous predictive biomarker whose status is denoted by G (=1 if positive and =0 if negative). The prevalence of the biomarker is denoted by ξG = Pr (G = 1). Then in a stratified biomarker design, patients with the same biomarker status are randomized into treatment arm or standard arm, as shown in the following figure:

Truemarkerstatus{positive(G=1){treatment(T=1),standard(T=0);negative(G=0){treatment(T=1),standard(T=0).

The primary interest of a stratified biomarker design is to investigate the marker-treatment interaction on a clinical endpoint, denoted by Y. Other questions that can be answered from the trial employing such a design include whether the treatments are different within the same marker status, or whether the clinical outcomes within the same treatment are different between marker status. These questions all involve inference on some function of the marker-by-treatment means of the clinical outcomes:

E(YG=g,T=t)=μgt,VAR(YG=g,T=t)=σgt2.

Define δg = μg1μg0 to be the mean outcome difference between treatments in the population with marker status G = g, and Δt = μ1tμ0t to be the mean outcome difference between positive and negative marker status in the same treatment, as a measure of the marker effects in treatment arm T = t. We are interested in testing separately or simultaneously the null hypothesis H0 : δg = 0, (g = 0, 1), or H0:Δt=0, (t = 0, 1). The null hypothesis of no marker by treatment interaction is then H0:γ=0, where

γ=δ1-δ0=Δ1-Δ0.

Because the independence between test statistics, the simultaneous null hypotheses can be tested by separately testing each individual null hypothesis with adequate allocation of the overall type I error rate, as demonstrated below for testing H0.

Let Wg be a standardized test statistic for testing H0g : δg = 0. The null hypothesis H0 = H00H01 is rejected if |Wg| > cg, for g = 0 or 1, where cg are properly chosen critical values.

Assuming that W0 and W1 are independent, which is the case in most trial settings, the power of the test is given by

ω(δ0,δ1)=Pr(W0>c0orW1>c1)=ω0(δ0)+ω1(δ1)-ω0(δ0)ω1(δ1)

where ωg(δg) = Pr (|Wg| > cg) is the power of the test that rejects H0g if |Wg| > cg. The type I error rate is thus given by ω(0, 0) = ω0(0) + ω1(0) − ω0(0)ω1(0) where ωg(0) is the type I error rate for testing H0g.

With significance level α to test the null hypothesis H0 and power 1 − β to detect marker-specific treatment differences δ0 and δ1, one can allocate adequately the type I error rate and the power to test separately the two null hypotheses, H00 and H01. Suppose the allocation for H0g is αg for type I error and 1 − βg for power at δg, then these allocated errors must satisfy α = α0 + α1α0α1, and β = β0β1. In practice one can assign smaller error rates to the more important hypotheses, e.g. H01 that concerns the treatment difference in the marker-positive group. With equal allocation of type I error rates and power, we have α0 = α1 = 1 − (1 − α)1/2 and β0 = β1 = β1/2. The null hypotheses H0:Δt=0: Δt = 0, (t = 0, 1) can be dealt with similarly.

In the present article, we investigate, both analytically and numerically, the adverse effects of biomarker classification errors on the design of a stratified biomarker clinical trial. For a variety of inference problems including marker-treatment interaction, we show that marker misclassification may have profound adverse effects on the coverage of confidence intervals, power of the tests, and required sample sizes. For each inference problem we propose methods to adjust for the classification errors. Sample size calculations adjusting for misclassification are presented in particular for testing marker-treatment interactions.

The paper is organized as follows. In Section 2, we present notations and preliminary results concerning the design of a stratified biomarker trial in the presence of marker misclassification. We then discuss the effects of misclassification on estimating treatment means in each marker stratum, and present a method to correct for misclassification in Section 3. We investigate the effects of misclassification on estimating treatment differences in each marker stratum in Section 4, followed by a method to correct for misclassification. We evaluate the effects of misclassification on marker differences in each treatment arm in Section 5, with a method to correct for marker misclassification. In Section 6, we address the marker-treatment interaction, starting with the investigation of the effects on power and sample size of misclassification, followed by a method to correct for misclassification and an approach to compute sample sizes to warrant adequate power to detect potential interaction. We then present an example and then discuss the findings in Section 7.

2. The Design in Presence of Misclassification

We assume that a gold standard exists to determine the true status G of the biomarker, with G = 1 being positive and 0 if otherwise. Due to reasons such as cost, ethics or administration, an imperfect assay is used, resulting in classification errors in determining the biomarker status. This is common in assaying a diagnostic biomarker; see, among others, [1416]. Wang et al. [16] demonstrated that misclassification can inflate type I error rates in a noninferiority trial with binary outcomes.

Let M be the observed status of G, with sensitivity π1 = Pr (M = 1 | G = 1) and specificity π0 = Pr (M = 0 | G = 0). For the biomarker to be practically useful, we assume that 1/2 < π0, π1 ≤ 1. It thus follows that the probability that the observed status of the marker is positive for a patient is

ξM=π1ξG+(1-π0)(1-ξG). (1)

We refer to ξM as the observed prevalence which is bounded by 1 − π0 and π1 because 0 ≤ ξG ≤ 1, and π0 + π1 > 0.

The actual stratified design is carried out according to the figure with the observed marker status M replacing the true status G.

Suppose that a total of N patients are enrolled into the trial. Let Yi be the observed clinical outcome of the ith (i = 1, …, N) patient with observed marker status Mi(= 0, 1), in treatment arm Ti(= 0, 1).

Let N1 be the number of patients with observed marker status being positive. Note that N1 is a random variable following a binomial distribution with size N and success probability ξM; thus E(N1) = M. Write N0 = NN1, the number of patients with observed marker status being negative. Let Nmt = λmtNm the number of patients in the subgroup with M = m and T = t, where the allocation proportions λmt ∈ [0, 1] are usually pre-specified, and λm1 + λm0 = 1. The allocation ratio of treatment to standard in the M = m group is then λm1/λm0. Equal allocation between treatments in the M = m group corresponds to λmt = 1/2. The targeted biomarker-strategy designs correspond to an extreme allocation with λ01 = 0; see, e.g., [4] and [16].

To simplify the notations, we assume that all the tests have significance level α and the confidence intervals have confidence level 1 − α. We will refer as “naive” procedures to those with no adjustment for classification errors, and as “error-adjusted” procedures to those that adjust for misclassification errors. Wherever there is no ambiguity, we will omit these distinctions.

The naive estimators of μgt and σgt2 are given by

μ^gt=1Ngt{i:Mi=g,Ti=t}Yi,σ^gt2=1Ngt-1{i:Mi=g,Ti=t}(Yi-μ^gt)2.

The naive confidence limits of μgt are calculated as

μ^gt±σ^gtZα/2/Ngt, (2)

where throughout Zr is the rth upper quantile of the standard normal distribution, that is, Φ(Zr) = 1 − r, where Φ denotes the standard normal distribution function.

The naive testing procedure rejects the null hypothesis H0g if

δ^g/sg>Zα/2, (3)

where

δ^g=μ^g1-μ^g0,sg2=1Ng1σ^g12+1Ng0σ^g02. (4)

Similarly the null hypothesis 0t : Δt = 0 is rejected if

Δ^t/st>Zα/2,

where

Δ^t=μ^1t-μ^0t,st2=1N1tσ^1t2+1N0tσ^0t2.

If there are no classification errors, then the aforementioned estimates are unbiased, and, if N is large enough, the tests have significance level α and the confidence intervals have coverage probability 1 − α. In the presence of misclassification, however, these claims need to be carefully examined and corrections need to be made to account for classification error whenever necessary.

Throughout, unless stated otherwise, distributions and their characteristics of estimators are unconditional, taking the randomness of the observed sample sizes Nm (m = 0, 1) into account. Such an unconditional approach will allow us to investigate the effects of the marker’s prevalence ξG as well. Conditional inference given Nm can be obtained in the derivation by replacing N with N1/ξM, where ξM is given in (18). To adjust for classification errors, we assume that the marker’s prevalence ξG, sensitivity π1, and specificity π0 are known; this implies that the marker’s positive and negative predictive values are also known, because of the well-know relationships:

τ1=π1ξG/ξM,τ0=π0(1-ξG)/(1-ξM). (5)

3. Estimating Stratum-Specific Treatment Means μgt

3.1. Effects of Misclassification

If the true marker status of the ith patient is Gi, then by the conditional expectations arguments we have

ζmt=E(YiMi=m,Ti=t)=g=01μgtPr(Gi=gMi=m),

noting that the treatments play no role in determining the marker’s status.

This leads to

ζ1t=τ1μ1t+(1-τ1)μ0t,ζ0t=τ0μ0t+(1-τ0)μ1t,

where τ1 = Pr (Gi = 1 | Mi = 1) and τ0 = Pr (Gi = 0 | Mi = 0) are the marker’s positive predictive value and negative predictive value, respectively. Similarly we have

νmt2=VAR(YiMi=m,Ti=t)=g=01(μgt2+σgt2)Pr(Gi=gMi=m)-ζmt2,

and thus

ν1t2=τ1(μ1t2+σ1t2)+(1-τ1)(μ0t2+σ0t2)-ζ1t2, (6)
ν0t2=τ0(μ0t2+σ0t2)+(1-τ0)(μ1t2+σ1t2)-ζ0t2. (7)

Taking the marker classification errors into account, we have E(μ̂gt) = ζgt and E(σ^gt2)=νgt2. The (unconditional) variances of the mean estimates are given by

VAR(μ^1t)=E{VAR(μ^1tN1t)}+VAR{E(μ^1tN1t)}=E(ν1t2N1t)ν1t2λ1tξMN (8)

and

VAR(μ^0t)=E(ν0t2N0t)ν0t2λ0t(1-ξM)N, (9)

noting that N1/N is a consistent estimator of ξM.

Therefore, in the presence of misclassification, the naive estimators, μ̂gt and σ^gt2, are no longer unbiased for the corresponding parameters (i.e., μgt and σgt) they estimate. The bias of the mean estimates is given by, respectively

E(μ^1t)-μ1t=-(1-τ1)Δt,E(μ^0t)-μ0t=(1-τ0)Δt. (10)

If we assume that, in the same treatment group, larger clinical outcomes are more likely to occur in patients with positive marker status, then the treatment mean will be underestimated for marker positive patients, but overestimated for the marker negative patients.

For large sample, μ̂1t and μ̂0t are asymptotically normally distributed with Ngt1/2(μ^gt-ζgt)~N(0,νgt2), where, throughout, “~” reads as “is distributed as”. Then the coverage probability of the naive confidence interval of μ1t in (2) is approximately

Pr(μ^1t-σ^1tZα/2/N1t1/2μ1tμ^1t+σ^1tZα/2/N1t1/2)Φ(c1t+Zα/2)-Φ(c1t-Zα/2) (11)

where

c1t=(1-τ1)Δt(λ1tNξM)1/2ν1t.

The power, as a function of c1t, strictly increases in (−∞, 0] and decreases in [0, ∞). Therefore, when the true marker status can be correctly classified, (18) gives the coverage probability approximately 100(1 − α)%. Otherwise, the asymptotic coverage probability of the naive confidence interval in (2) is always smaller than the nominal level of 1 − α. Indeed, the power can be substantially reduced; a particularly interesting observation is that the coverage probability approaches to zero when the sample size N gets larger.

3.2. Correction for Classification Error

From (10), unbiased estimators μgt of μgt can be derived by solving the equations:

{μ^1t=τ1μ1t+(1-τ1)μ0t,μ^0t=(1-τ0)μ1t+τ0μ0t.

We have

μ1t=τ0μ^1t-(1-τ1)μ^0tτ1τ0-(1-τ1)(1-τ0),μ0t=τ1μ^0t-(1-τ0)μ^1tτ1τ0-(1-τ1)(1-τ0).

It follows from (8) and (9) that the variances of the unbiased estimators are

VAR(μ1t)τ02ν1t2/(λ1tξMN)+(1-τ1)2ν0t2/{λ0t(1-ξM)N}{τ1τ0-(1-τ1)(1-τ0)}2

and

VAR(μ0t)τ12ν0t2/{λ0t(1-ξM)N}+(1-τ0)2ν1t2/(λ1tξMN){τ1τ0-(1-τ1)(1-τ0)}2.

Recall that E(σ^gt2)=νgt2 where σ^gt2 are given in (4). Consistent estimate VAR^(μ1t) of VAR(μ1t) and VAR^(μ0t) of VAR(μ0t) are given by

τ02σ^1t2/N1t+(1-τ1)2σ^0t2/N0t{τ1τ0-(1-τ1)(1-τ0)}2,τ12σ^0t2/N0t+(1-τ0)2σ^1t2/N1t{τ1τ0-(1-τ1)(1-τ0)}2,

respectively.

Note that in large sample (μgt-μgt)/{VAR^(μgt)}1/2~N(0,1). Therefore, if λgtconstant when N → ∞, then the error-adjusted confidence interval of μgt with limits μgt±Zα/2{VAR^(μgt)}1/2 has asymptotic coverage probability of 1 − α.

4. Inference on Marker-Specific Treatment Differences

4.1. Effects of Misclassification

We confine our attention to the marker positive group G = 1. The marker negative group can be dealt with similarly. Consider testing the null hypothesis H01 based on the statistics in (3). Taking misclassification into consideration, we have

E(δ^1)=ζ11-ζ10=τ1δ1+(1-τ1)δ0,VAR(δ^1)ν112λ11ξMN+ν102λ10ξMN. (12)

In large sample, δ̂1 asymptotically follows a normal distribution. Note that, under the simultaneous null hypothesis H0 : δ1 = δ0 = 0, E(δ̂1) = 0. The actual type I error rate is then given by

Pr(δ^1>s1Zα/2)Pr{δ^1(ν112λ11ξMN+ν102λ10ξMN)1/2>s1Zα/2(ν112λ11ξMN+ν102λ10ξMN)1/2}2{1-Φ(Zα/2)}=α, (13)

utilizing the fact that s12 defined in (4) is a consistent estimate of ν112/(λ11ξMN)+ν102/(λ10ξMN).

Therefore, under simultaneous null hypothesis H0, the naive tests maintain the type I error at the nominal level, regardless of the marker misclassification. However, unlike the cases when there is no classification error, the type I error rate of the test for the individual hypothesis H01 : δ1 = 0 depends on δ0, and thus is no longer controlled at the nominal level. Indeed, the power of the test at δ1 > 0 is given by

Φ{τ1δ1+(1-τ1)δ0(ν112λ11ξMN+ν102λ10ξMN)1/2-Zα/2}+Φ{-τ1δ1+(1-τ1)δ0(ν112λ11ξMN+ν102λ10ξMN)1/2-Zα/2} (14)

as compared to

Φ{δ1(σ112λ11ξGN+σ102λ10ξGN)1/2-Zα/2},

when there is no classification error.

The type I error rate follows by setting δ1 = 0 and is given by

Φ{(1-τ1)δ0(ν112λ11ξMN+ν102λ10ξMN)1/2-Zα/2},

which can be substantially inflated, and indeed approaches to 1 when N → ∞ and δ0 > 0.

Reduction in power due to misclassification may also be sizable. The loss of power attributes to the following observations. First, if we assume that marker-positive patients benefit more from the treatment than marker-negative patients, that is, δ1 > δ0, then δ1 > τ1δ1 − (1 − τ1)δ0. Secondly, assuming that σ1t = σ0t = σt, that is, the variations of the outcomes in the same treatment arm are not affected by the marker status. Then from (6) and (7) we have

ν1t2=σt2+τ1(1-τ1)Δt2>σt2,ν0t2=σt2+τ0(1-τ0)Δt2>σt2

if Δt ≠ 0.3) It is possible that τ1δ1 + (1 − τ1)δ0 ≈ 0, which may occur when only patients with marker positive status are benefited from the treatment, that is δ1 > 0 > δ0.

The classification error can also substantially affects the coverage probability of the naive confidence interval δ̂1 ± s1Zα/2. Similar to the derivation of (14), we obtain

Pr(δ^1-s1Zα/2δ1δ^1+s1Zα/2)=Φ(ϒ1+Zα/2)-Φ(ϒ1-Zα/2),

where

ϒ1=(1-τ1)γ(ν112λ11ξMN+ν102λ10ξMN)1/2.

Again, in the presence of classification error, the coverage probability is always smaller, and often substantially so, than the nominal level of 1 − α; it approaches to zero if N → ∞.

4.2. Correction for Classification Error

Similar to (12), we can show that

E(δ^0)-(1-τ0)δ1+τ0δ0,VAR(δ^0)ν012λ01(1-ξM)N+ν002λ00(1-ξM)N.

Therefore, unbiased estimates δg of δg can be obtained by solving the following equations:

{δ^1=τ1δ1+(1-τ1)δ0,δ^0=(1-τ0)δ1+τ0δ0.

It follows that

δ1=τ0δ^1-(1-τ1)δ^0τ1τ0-(1-τ1)(1-τ0),δ0=τ1δ^0-(1-τ0)δ^1τ1τ0-(1-τ1)(1-τ0).

The variance VAR(δ1) of the unbiased estimator δ1* is approximately

τ02{ν112/(λ11ξMN)+ν102/(λ10ξMN)}+(1-τ1)2[ν012/{λ01(1-ξM)N}+ν002/{λ00(1-ξM)N}]{τ1τ0-(1-τ1)(1-τ0)}2,

which can be estimated consistently by

VAR^(δ1)=τ02(σ^112/N11+σ^102/N10)+(1-τ1)2(σ^012/N01+σ^002/N00){τ1τ0-(1-τ1)(1-τ0)}2.

Note that in large sample (δ1-δ1)/{VAR^(δ1)}1/2~N(0,1). Therefore, the error-adjusted confidence interval of δ1 with limits δ1±Zα/2{VAR^(δ1)}1/2 has asymptotic coverage probability of 1 − α. Furthermore, the error-adjusted test that rejects H01 : δ1 = 0 if δ1/{VAR^(δ1)}1/2>Zα/2 has type I error approximately α, regardless of the value of δ0. The power is given by Φ[δ1/{VAR(δ1)}1/2-Zα/2].

5. Inference on Treatment-Specific Marker Effects

5.1. Effects of Misclassification

Consider the naive test procedure given in Section 2. Taking the classification errors into account we have

E(Δ^t)=(τ0+τ1-1)Δt,VAR(Δ^t)ν1t2λ1tξMN+ν0t2λ0t(1-ξM)N. (15)

Therefore Δ̂t is no longer unbiased for Δt. Indeed, it always underestimates Δt if Δt > 0 and overestimates Δt if Δt < 0. In large sample, Δ̂t asymptotically follows a normal distribution. Similar to the derivations of (13) and (14) we conclude that the naive test asymptotically maintains the type I error at the nominal level, and the power of the test at some Δt > 0 is given by

Pr(Δ^t>stZα/2)=Φ[(τ0+τ1-1)Δt{ν1t2λ1tξMN+ν0t2λ0t(1-ξM)N}1/2-Zα/2]

which can be substantially smaller than

Φ[Δt{σ1t2λ11ξGN+σ0t2λ0t(1-ξG)N}1/2-Zα/2],

the power when there is no classification error.

Furthermore, the coverage probability of the naive confidence interval Δ̂t ± tZα/2 of Δt is given by

Pr(Δ^t-stZα/2ΔtΔ^t+stZα/2)=Φ(ϒt+Zα/2)-Φ(ϒt-Zα/2)1-α,(and0ifN),

where

ϒt=(2-τ0-τ1)Δt{ν1t2λ1tξMN+ν0t2λ0t(1-ξM)N}1/2.

5.2. Correction for Classification Error

Correction for misclassification follows from the fact that

Δt=Δ^tτ0+τ1-1

is an unbiased estimator of Δt. The variance and its consistent estimator are given respectively by

VAR(Δt)ν1t2/(λ1tξMN)+ν0t2/{λ0t(1-ξM)N}(τ0+τ1-1)2,VAR^(Δt)=σ^1t2/N1t+σ^0t2/N0t(τ0+τ1-1)2.

In large sample (Δt-Δt)/{VAR^(Δt)}1/2~N(0,1). Assume that λgtconstant when N → ∞. Then, the error-adjusted confidence interval of Δt with limits Δt±Zα/2{VAR^(Δt)}1/2 has asymptotic coverage probability of 1 − α. The error-adjusted test that rejects 0t : Δt = 0 if Δt/{VAR^(Δt)}1/2>Zα/2 is equivalent to the naive test.

6. Inference on Marker-Treatment Interaction

6.1. Effects of Misclassification

Recall that the marker-treatment interaction effect is measured by γ = Δ1 − Δ0. It follows from (15) that the naive estimate of the interaction γ̂ = Δ̂1 − Δ̂0 has mean and variance, given respectively by E(γ̂) = (τ0 + τ1 − 1)γ and VAR(γ^)θ12/N where

θ12=ν102λ10ξM+ν002λ00(1-ξM)+ν112λ11ξM+ν012λ01(1-ξM).

Therefore the naive estimator of the marker-treatment interaction is biased and under-(over-)estimates the interaction if γ > (<)0. The naive test for interaction rejects the null hypothesis H0:γ=0 if

γ^/(s02+s12)1/2>Zα/2.

The power of the test at some γ > 0 is given by

Pr(γ^>Zα/2(s02+s12)1/2)=Φ{(τ0+τ1-1)γN1/2θ1-Zα/2}. (16)

It follows from (16) that the naive test maintains the type I error rate at the nominal level of α, regardless of the classification errors. However, the power of the test can be substantially adversely affected as compared to the power of the test with no misclassification, that is, Φ(γN1/2/θ0Zα/2), where θ0 is such that

θ02=σ102λ10ξG+σ002λ00(1-ξG)+σ112λ11ξG+σ012λ01(1-ξG).

The coverage probability of the naive confidence interval γ^±zα/2(s02+s12)1/2 of γ is given by

Pr{γ^-zα/2(s02+s12)1/2γγ^+zα/2(s02+s12)1/2}=Φ{(2-τ0-τ1)γN1/2θ1+Zα/2}-Φ{(2-τ0-τ1)γN1/2θ1-Zα/2},

which can be substantially lower (approaching 0 if N → ∞) than the nominal level of 1 − α.

For α = 0.05, σgt = 1, λmt = 1, γ = 0.936, and selected values of π0, π1, ξG, and N, Table 1 presents coverage probability of the naive confidence interval and the power of the naive test. In all cases the actual coverage probability is smaller that the nominal level of 0.95, many are of more than 25% reduction. The actual power is also substantially lower than that with no classification errors, some with more than 50% reduction in power. The coverage probability and the power increase as the classification accuracy improves. An increased sample size yields increased power but decreased coverage probability. For example, with 90% sensitivity and specificity respectively, and 40% marker prevalence, the naive coverage probability is 0.90 and the power is 0.71 if the sample size is N = 200. These two measures change to 0.84 and 0.95 respectively when the sample size doubles.

Table 1.

Coverage probability of the naive confidence interval and power of the naive test for marker-treatment interaction:

(N = 200, ξG = 0.4)
π1=0·80 0·85 0·90 0·95
π0= 0·80 0·74/0·46 0·78/0·52 0·82/0·58 0·85/0·64
0·85 0·80/0·53 0·83/0·59 0·86/0·65 0·89/0·70
0·90 0·85/0·61 0·88/0·66 0·90/0·71 0·91/0·76
0·95 0·90/0·69 0·91/0·73 0·93/0·78 0·94/0·82
(N = 200, ξG = 0.6)
π1=0 ·80 0·85 0·90 0·95
π0= 0·80 0·74/0·46 0·80/0·53 0·85/0·61 0·90/0·69
0·85 0·78/0·52 0·83/0·59 0·88/0·66 0·91/0·73
0·90 0·82/0·58 0·86/0·65 0·90/0·71 0·93/0·78
0·95 0·85/0·64 0·89/0·70 0·91/0·76 0·94/0·82
(N = 400, ξG = 0.4)††
π1=0·80 0·85 0·90 0·95
π0= 0·80 0·54/0·75 0·61/0·81 0·68/0·86 0·75/0·91
0·85 0·65/0·82 0·71/0·87 0·76/0·91 0·82/0·94
0·90 0·75/0·88 0·80/0·92 0·84/0·95 0·88/0·97
0·95 0·85/0·93 0·88/0·96 0·90/0·97 0·92/0·98
(N = 400, ξG = 0.6)††
π1=0·85 0·90 0·95 0·99
π0= 0·80 0·54/0·75 0·65/0·82 0·75/0·88 0·85/0·93
0·85 0·61/0·81 0·71/0·87 0·80/0·92 0·88/0·96
0·90 0·68/0·86 0·76/0·91 0·84/0·95 0·90/0·97
0·95 0·75/0·91 0·82/0·94 0·88/0·97 0·92/0·98

power=0.90 if no misclassification;

††

power=0.99 if no misclassification.

6.2. Correction for Classification Error

An unbiased estimator of the interaction effect γ can be given by

γ=γ^τ0+τ1-1.

The variance and its consistent estimator are given respectively by

VAR(γ)θ12N(τ0+τ1-1)2,VAR^(γ)=θ^12N(τ0+τ1-1)2

where

θ^12=σ^102λ10ξM+σ^002λ00(1-ξM)+σ^112λ11ξM+σ^012λ01(1-ξM).

In large sample (γ-γ)/{VAR^(γ)}1/2~N(0,1). Hence, the error-adjusted confidence interval of γ with limits γ±Zα/2{VAR^(γ)}1/2 has asymptotic coverage probability of 1 − α. The error-adjusted test that rejects H0:γ=0 if γ/{VAR^(γ)}1/2>Zα/2 is equivalent to the naive test.

6.3. Sample Size Adjustment

For the stratified biomarker design, the sample size N needs to be sufficiently large to ensure adequate power of 1 − β to detect a meaningful marker-treatment interaction γ. From (16) the sample size is given by

N=(Zα/2+Zβ)2θ12(τ0+τ1-1)2γ2. (17)

On the other hand, in the absence of misclassification the required sample size is

N=(Zα/2+Zβ)2θ02γ2.

It follows from (18) and (5) that

1τ0+τ1-1=(1-ξG)(2π0-1)ξM(1-ξG)(2π0-1)π1>0.

Furthermore, as pointed out in Section 4.1, the variance ν is usually larger than its counterpart σ. Therefore, a much larger sample size may be required to achieve the desirable power when classification errors exist.

Under the same specifications of parameters’ values (except for N) used for Table 1, Table 2 presents the actual sample size needed and its ratio to the sample size when there is no classification error. It shows that the sample size can be more than twice that required when there is no misclassification of the marker status.

Table 2.

Required sample size and its ratio to the sample size (= 200) when there is no misclassification

ξG = 0.4
π1=0·80 0·85 0·90 0·95
π0= 0·80 612/3·06 522/2·61 449/2·24 388/1·94
0·85 508/2·54 442/2·21 386/1·93 339/1·69
0·90 423/2·11 373/1·87 331/1·65 295/1·47
0·95 350/1·75 314/1·57 283/1·41 255/1·27
ξG = 0.6
π1=0·80 0·85 0·90 0·95
π0= 0·80 612/3·06 508/2·54 423/2·11 350/1·75
0·85 522/2·61 442/2·21 373/1·87 314/1·87
0·90 449/2·24 386/1·93 331/1·65 283/1·41
0·95 388/1·94 339/1·69 295/1·47 255/1·27

6.4. Example

We sought to design a phase III trial where patients with metastatic renal cell carcinoma will be randomized to sunitinib (standard of care) or sunitinib plus an experimental drug stratified by the IL-6 status. The primary endpoint is progression-free survival (PFS) rate at 6 months. IL-6 is a continuous variable with high IL-6 status defined as a value greater than or equal to 13 pg/mL; this cut-point value is based on the observed median as was reported in one study [13]. Based on observed data, the PFS rate at 6-months in low and high IL-6 patients treated with sunitinib is 48% and 18%, respectively. The hypothesized effect in low and high IL-6 patients treated with the experimental drug is 66% and 59%, respectively. The assay has 95% sensitivity and 90% specificity. Assuming equal allocation and 40% prevalence of high IL-6. Assuming further that a power of 0.85 is desirable to detect a marker-treatment interaction effect of γ = (0.59 − 0.18) − (0.66 − 0.48) = 0.23 in PFS rates. Using equation (17), the required sample size is about 1,020, or 255 patients are needed in each stratum of IL-6 by treatment. If on the other hand, the prevalence of high IL-6 status is 30%, then the required sample size is much larger, about 1,244, or 311 patients in each stratum. In contrast, the sample sizes are about 177 and 202 respectively per stratum when there is no classification errors for the two scenarios. Note that similar to the comparison of two independent proportions, in the calculation the stratum-specific variances σgt2 are set to be μ̄(1 − μ̄) where μ̄ is the average of stratum-specific rates, that is,

μ¯=(μ11+μ01+μ10+μ00)/4(=(0.59+0.66+0.18+0.48)/4=0.4775),

yielding σgt = 0.25.

7. Discussion

In the present paper we demonstrated both analytically and numerically that the misclassified biomarker status can have profound negative impact on various inference problems in a stratified biomarker trial. The methods developed are based on asymptotic theory and are suitable for most biomarker stratified trials that usually require relatively large sample sizes; however, caution needs to be taken for small-size trials.

It is worth noting that, as a result of the randomization, the naive test for marker-treatment interaction maintains the required type I error rates, but suffers considerably from loss of power due to misclassification, which in turn, results in larger sample sizes required for the trial.

Our investigation assumes that the marker’s prevalence ξG, sensitivity π1, and specificity π0 are all known. When the N patients are a representative sample of the targeted population, ξ^M=i=1NMi/N is an unbiased estimate of ξM. Then from (18), it follows that

ξ^G=ξ^M-(1-π0)π0+π1-1

is an unbiased estimate of the marker’s prevalence ξG. If sensitivity π1, and specificity π0 are unknown, then a preliminary study can be conducted to estimate π0 and π1.

The technical developments employed in the present paper can be readily extended to other biomarker-driven designs, for example, the biomarker enrichment strategy design in which only marker positive patients are randomized to receive treatments. However, as shown in the developments, data from all marker by treatment strata are needed to adjust for classification errors. For a review of useful biomarker based clinical designs, see, e.g. [3, 6, 18,19]. Although the choice of these various designs depends on the trial aims, the impact of biomarker misclassification can be substantial in each design, and needs further evaluations. For example, some designs involve testing multiple hypotheses concerning the various aspects of the marker-treatment effects. It is then important to investigate how the classification errors adversely affect the allocation of type I error rates and the power of the study. Such investigation is also warranted for adaptive and Bayesian biomarker designs.

Throughout, testing marker-treatment effects is formulated based on stratum-specific means, e.g. means of normal distributions or proportions of a dichotomous endpoint. The methods developed in the present paper could be generalized, with some tedious algebraic manipulations, to ordinal/categorical and longitudinal/repeated endpoints with stratum means as the primary interest. We are currently working on extending the method for time-to-event endpoints and longitudinally measured endpoints with hazards ratio and rates of change as the primary comparison, respectively. As can be expected, these types of endpoint require different and more complicated technical handling of the assumptions.

Increased advances in understanding the roles of molecular and genetic pathways in carcinogenesis are leading to the development of novel therapies that target the disease pathways. As a result of these advances, the landscape for performing clinical trials with biomarkers in cancer is evolving and becoming complex. Despite the large sample size required for the stratified biomarker design, we believe that this approach is realistic and worth it as it accounts for misclassification errors.

Acknowledgments

Research of A. Liu was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health. Research of S. Halabi was supported by grants R01-CA155296 and U01-CA157703.

Appendix: Some Technical Details

Proof of Eq. (1)

ξM=Pr(M=1)=Pr(M=1,G=1)+Pr(M=1,G=0)=Pr(M=1G=1)Pr(G=1)+Pr(M=1G=0)Pr(G=0)=π1ξG+(1-π0)(1-ξG).

Proof of Eq. (11)

Pr(μ^1t-σ^1tZα/2/N1t1/2μ1tμ^1t+σ^1tZα/2/N1t1/2)Pr(μ^1t-ν1tZα/2/N1t1/2μ1t)-Pr(μ^1t+ν1tZα/2/N1t1/2μ1t)=E{Φ(N1t1/2(1-τ1)Δtν1t+Zα/2)}-E{Φ(N1t1/2(1-τ1)Δtν1t-Zα/2)}Φ(c1t+Zα/2)-Φ(c1t-Zα/2)

Note that in the third expression the expectation is taken with respect to the random number N1t.

References

  • 1.Gordon AN, Tonda M, Sun S, Rackoff W Doxil study 30–49 investigators. Long-term survival advantage for women treated with pegylated liposomal doxorubicin compared with topotecan in a phase 3 randomized study of recurrent and refractory epithelial ovarian cancer. Gynecologic Oncology. 2004;95:1–8. doi: 10.1016/j.ygyno.2004.07.011. [DOI] [PubMed] [Google Scholar]
  • 2.Simon R, Maitournam A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research. 2004;10:6759–6763. doi: 10.1158/1078-0432.CCR-04-0496. [DOI] [PubMed] [Google Scholar]
  • 3.Mandrekar SJ, Sargent DJ. Clinical Trial Designs for Predictive Biomarker Validation: Theoretical Considerations and Practical Challenges. Journal of Clinical Oncology. 2009;27:4027–4034. doi: 10.1200/JCO.2009.22.3701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Freidlin B, McShane LM, Korn EL. Randomized clinical trials with biomarkers: Design issues. Journal National Cancer Institute. 2010;102:152–160. doi: 10.1093/jnci/djp477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Joo J, Geller NL, French B, Kimmel SE, Rosenberg Y, Ellenberg JH. Prospective alpha allocation in the clarification of optimal anticoagulation through genetics (COAG) trial. Clinical Trials. 2010;7:597–604. doi: 10.1177/1740774510381285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Simon R. Clinical trials for predictive medicine: new challenges and paradigms. Clinical Trials. 2010;7:516–524. doi: 10.1177/1740774510366454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lai TL, Lavori PW. Innovative clinical trial designs toward a 21st-century health care system. Statistics in Bioscience. 2011;3:145–168. doi: 10.1007/s12561-011-9042-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Motzer RJ, Hutson TE, Tomczak P, Michaelson MD, Bukowski RM, Rixe O, Oudard S, Negrier S, Szczylik C, Kim ST, Chen I, Bycott PW, Baum CM, Figlin RA. Sunitinib versus interferon alfa in metastatic renal-cell carcinoma. New England Journal of Medicine. 2007;356:115–124. doi: 10.1056/NEJMoa065044. [DOI] [PubMed] [Google Scholar]
  • 9.Rini BI, Halabi S, Rosenberg JE, Stadler WM, Vaena DA, Ou SS, Archer L, Atkins JN, Picus J, Czaykowski P, Dutcher J, Small EJ. Bevacizumab plus interferon-alpha versus interferon-alpha monotherapy in patients with metastatic renal cell carcinoma: Results of CALGB 90206. Journal of Clinical Oncology. 2008;26:5422–5428. doi: 10.1200/JCO.2008.16.9847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bosco EE, Wang Y, Xu H, Zilfou JT, Knudsen KE, Aronow BJ, Lowe SW, Knudsen ES. The retinoblastoma tumor suppressor modifies the therapeutic response of breast cancer. Journal of Clinical Investigation. 2007;117:218–228. doi: 10.1172/JCI28803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sharma A, Comstock CE, Knudsen ES, Cao KH, Hess-Wilson JK, Morey LM, Barrera J, Knudsen KE. Retinoblastoma tumor suppressor status is a critical determinant of therapeutic response in prostate cancer cells. Cancer Research. 2007;67:6192–6203. doi: 10.1158/0008-5472.CAN-06-4424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Sharma A, Yeow WS, Ertel A, Coleman I, Clegg N, Thangavel C, Morrissey C, Zhang X, Comstock CE, Witkiewicz AK, Gomella L, Knudsen ES, Nelson PS, Knudsen KE. The retinoblastoma tumor suppressor controls androgen signaling and human prostate cancer progression. Journal of Clinical Investigation. 2010;120:4478–4492. doi: 10.1172/JCI44239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Tran HT, Liu Y, Zurita AJ, Lin Y, Baker-Neblett KL, Martin AM, Figlin RA, Hutson TE, Sternberg CN, Amado RG, Pandite LN, Heymach JV. Prognostic or predictive plasma cytokines and angiogenic factors for patients treated with pazopanib for metastatic renal-cell cancer: a retrospective analysis of phase 2 and phase 3 trials. Lancet Oncology. 2012;13:827–837. doi: 10.1016/S1470-2045(12)70241-3. [DOI] [PubMed] [Google Scholar]
  • 14.Abecasis GR, Cherny SS, Cardon LR. The impact of genotyping error on family-based analysis of quantitative traits. European Journal of Human Genetics. 2001;9:130–134. doi: 10.1038/sj.ejhg.5200594. [DOI] [PubMed] [Google Scholar]
  • 15.Hao K, Li C, Rosenow C, Wong WH. Estimation of genotype error rate using samples with pedigree informationan application on the GeneChip Mapping 10K array. Genomics. 1992;84:623–630. doi: 10.1016/j.ygeno.2004.05.003. [DOI] [PubMed] [Google Scholar]
  • 16.Wang SJ, Hung HMJ, O’Neill RT. Genomic classifier for patient enrichment: Misclassification and type I error issues in pharmacogenomics noninferiority trial. Statistics in Biopharmaceutical Research. 2011;3:310–319. [Google Scholar]
  • 17.Maitournam A, Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine. 2005;24:329–339. doi: 10.1002/sim.1975. [DOI] [PubMed] [Google Scholar]
  • 18.Freidlin B, Korn EL. Biomarker-adaptive clinical trial designs. Pharmacogenomics. 2010;11:1679–1682. doi: 10.2217/pgs.10.153. [DOI] [PubMed] [Google Scholar]
  • 19.Gosho M, Nagashima K, Sato Y. Study Designs and Statistical Analyses for Biomarker Research. Sensors. 2012;12:8966–8986. doi: 10.3390/s120708966. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES