Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Apr 10.
Published in final edited form as: Biometrics. 2015 Sep 8;72(1):30–38. doi: 10.1111/biom.12392

Augmented case-only designs for randomized clinical trials with failure time endpoints

James Y Dai 1,, Xinyi Cindy Zhang 2,, Ching-Yun Wang 3,, Charles Kooperberg 4,
PMCID: PMC4808468  NIHMSID: NIHMS728072  PMID: 26347982

Summary

Under suitable assumptions and by exploiting the independence between inherited genetic susceptibility and treatment assignment, the case-only design yields efficient estimates for subgroup treatment effects and gene-treatment interaction in a Cox model. However it cannot provide estimates of the genetic main effect and baseline hazards, that are necessary to compute the absolute disease risk. For two-arm, placebo-controlled trials with rare failure time endpoints, we consider augmenting the case-only design with random samples of controls from both arms, as in the classical case-cohort sampling scheme, or with a random sample of controls from the active treatment arm only. The latter design is motivated by vaccine trials for cost-effective use of resources and specimens so that host genetics and vaccine-induced immune responses can be studied simultaneously in a bigger set of participants. We show that these designs can identify all parameters in a Cox model and that the efficient case-only estimator can be incorporated in a two-step plug-in procedure. Results in simulations and a data example suggest that incorporating case-only estimators in the classical case-cohort design improves the precision of all estimated parameters; sampling controls only in the active treatment arm attains a similar level of efficiency.

Keywords: Case-cohort design, Case-only estimator, Gene-treatment interaction, Nested case-control design, Pharmacogenetics

1. Introduction

Individuals respond differently to treatment or prevention modalities, depending on their genetic background, environmental exposures, and clinical characteristics (Charlab and Zhang, 2013). In clinical trials, there is a growing interest to discover and characterize individual or subgroup treatment responses, supplementing primary intent-to-treat analyses. For instance, the emerging pharmacogenetics research aims to identify genetic susceptibility that contributes to inter-individual variability of treatment efficacy and safety, in scales ranging from several candidate genes to the whole genome (Evans and McLeod, 2003; Weinshilboum and Wang, 2004). These studies underscore the potential of personalized medicine, and may also elucidate mechanisms of treatment effect.

To this end, this article pertains to sampling designs for characterizing the influence of pre-treatment biomarkers, for example a panel of genetic variants, on treatment effects in randomized clinical trials. Ancillary studies of this nature are increasingly common in the genomic era. However, biomakers can be expensive to measure. To study the association of biomarkers with with relatively uncommon study outcomes, including HIV infection, most cancers, and some cardiovascular events, it is cost-effective to adopt some form of outcome-dependent sampling. Popular outcome-dependent sampling schemes in cohort studies include the nested case-control design and the case-cohort design (Thomas, 1977; Prentice and Breslow, 1978; Prentice, 1986). Stratified versions of the two sampling designs to oversample certain groups have also been developed for better efficiency (Borgan et al., 2000; Langholz and Borgan, 1995). The properties and utilities of the two designs in cohort studies have been discussed (Self and Prentice, 1988; Langholz and Thomas, 1990).

Consider a two-arm, placebo-controlled randomized prevention trial with a rare failure event. The unique feature is that there is unequivocal design-imposed independence between the treatment assignment and pre-treatment biomarkers, e.g. germline genotypes. Exploiting this independence and assuming censoring being non-informative and independent of randomization arms, case-only methods are more efficient than the two aforementioned designs for estimating gene-treatment interactions and subgroup treatment effects on a rare disease endpoint (Vittinghoff and Bauer, 2006; Dai et al., 2012). These assumptions are better suited for phase III prevention trials where adverse effect is not of concern. Though computed from a logistic model, case-only estimators have the interpretation of hazard ratios in the Cox proportional hazards models. Sensitivity of case-only estimators toward violations of these assumptions has been investigated (Vittinghoff and Bauer, 2006). In recent years, use of case-only methods has started to permeate in prevention trials. See, for example, trials in the Women’s Health Initiative and the HIV Vaccine Trial Network (Prentice et al., 2010; Dai et al., 2014; Li et al., 2014).

The case-only design, however, does not allow estimation of the full set of parameters in a Cox model. Specifically, neither the genotype main effect nor the cumulative baseline hazard function is estimable from cases alone. These parameters are needed to study the absolute risk of the endpoint for genotype groups in each arm. This limitation hinders interpretation and utility of the estimated gene-treatment interaction, because the estimate of individual absolute risk when treated or when not treated will inform medical counseling and guide treatment selection (Gail et al., 1989; Janes et al., 2011). On the other hand, the traditional case-cohort or nested case-control sampling provides estimates of the baseline hazard and absolute risk, but does not incorporate gene-treatment independence. Leveraging the strengths of both types of designs, we consider augmenting the case-only design to enable estimation of the full set of Cox model parameters. In particular, we focus on variations of the case-cohort design in this paper, because it is easy to plan ahead a random subcohort for time-invariant genotypes in clinical trials, and because it has the advantage of accommodating multiple outcomes that may arise in an ancillary study.

Specifically, we consider two scenarios of adding controls to the case-only design, for both of which we can incorporate the case-only estimators in two-step plug-in estimation procedures:

  • Scenario I: Classical case-cohort design with controls drawn from both arms. In essence, this is one way of adding controls to the case-only design. In this scenario we essentially propose a novel two-step estimation procedure for the classical case-cohort design: the case-only estimator is first used to estimate gene-treatment interaction and treatment main effect, these estimators will then be plugged into established case-cohort estimation methods as offsets. This method allows widely used case-cohort sampling to take advantage of efficient case-only estimators. Our contributions also include an explicit formula of variance estimates for this two-step procedure.

  • Scenario II: Augmented case-only (ACO) design with controls drawn from the active treatment arm only. This is a novel design motivated by vaccine trials, as we will elaborate next. Although primarily driven by scientific rationale, this design is of statistical interest since only three of the four strata formed by case-control status and randomization arm are sampled. It violates the critical identifiability assumption of non-zero sampling probability for all strata in two-phase sampling (Robins et al., 1994; Breslow et al., 2003). The orthogonality between genotype and randomization arm has to be exploited in order to remedy this anomaly. We show that a similar two-step estimation procedure as for Scenario I will identify all parameters in a Cox model, and we show in the simulations that the estimators are nearly as efficient as those in Scenario I.

Scientifically, the motivation for selecting controls only from the active treatment arm (the ACO design in Scenario II) comes from studies on host genetics and immune correlates in HIV vaccine trials. It is common to study vaccine-specific immune responses in a pre-specified sample of trial participants in the vaccine arm, as no vaccine-induced immune responses are generated in the placebo arm. Case-cohort sampling is commonly used in this setting (McElrath et al., 2008). Take Li et al. (2014) for example, if a genotype in the FcγR gene is associated with varying vaccine protection in the RV144 trial, it is useful to investigate whether specific vaccine-induced immune responses are associated with such genotype, in order to understand functionally why the vaccine effect varies by host genetics. Such relationship can only be studied in the vaccine arm. In this sense, concentrating controls in the vaccine arm is cost-effective when a pharmacogenetic study is a component of a systematic approach for understanding treatment effect. Similar rationale applies to high-throughput biomarker studies for better understanding hormone effect in clinical trials in the Women’s Health Initiative (Pitteri et al., 2009).

This article is organized as follows. In Section 2.1 and Section 2.2, we review case-cohort sampling and case-only estimators, respectively. The latter section brings new insights on assumptions required for case-only estimators. In Section 2.3, we show that case-only estimators can be built into a two-step estimation procedure for the case-cohort design. The main parameter of interest we illustrate throughout the paper is the genetic main effect. The asymptotic covariance matrix of estimators resulting from the two-step procedure is derived. Extending the results from Section 2.3, we show in Section 2.4 that sampling controls only in the active treatment arm is adequate to estimate all Cox model parameters. For completeness we briefly address the alternative ACO design and the nested case-control sampling in Section 2.5. In Section 3 we compare the efficiency of the proposed designs and estimation methods in simulations, where the standard estimation procedure for a case-cohort design with the same sample size is treated as the benchmark. We present in Section 4 a data example with the standard case-cohort sampling, and we compare standard error estimates resulted from the proposed estimation procedures to the original case-cohort methods. We close with a discussion of the utility of the ACO design and some future work.

2. Method

Consider a two-arm, placebo-controlled randomized prevention trial in which participants were followed for evaluating treatment effect on time to certain failure event. Let Z denote a binary treatment indicator taking the value 1 if the participant is assigned to the active treatment arm, and 0 if assigned to the placebo arm. Let G denote the baseline biomarker of interest, say an inherited genetic variant, and let V be a set of pre-treatment variables to be adjusted in risk association. Denote Y and C as the failure time and the right-censoring time since randomization, respectively. Given Z, G, and V, C is assumed to be independent of Y. Data consist of independent and identically distributed (iid) vectors (Ti, Δi, Zi, Gi, Vi), where Ti = min(Yi, Ci), Δi = I(YiCi) for i = 1 … n, and I(·) is the indicator function. Cases are defined to be the participants who experienced the failure event (Δ = 1) during follow-up. Using the usual counting process notation, we define Ni(t) = I(Tit) and Ri (t) = I(Tit).

Let λ(t;G, Z, V) denote the hazard of the failure event occurring at time t for a subject with covariates (G, Z, V). Consider a proportional hazards model with gene-treatment interaction (Cox, 1972)

λ(t;G,Z,V)=λ0(t)exp(β1Gi+β2Zi+β3GiZi+β4Vi), (1)

where λ0(t) is a baseline hazard function. Denote by Xi = (Gi, Zi, GiZi, Vi)T the vector of baseline covariates included in the regression model and by β = (β1, β2, β3, β4) the vector of regression coefficients of interest. For the full cohort data, an estimate of β can be obtained by solving the usual Cox model score function (Cox, 1975),

U(β,t)=i=1n0t{XiX¯(β,t)}dNi(t), (2)

where (β, t) is the weighted mean of covariates at t, expressed as

X¯(β,t)=i=1nRi(t)exp(βTXi)Xii=1nRi(t)exp(βTXi). (3)

2.1 Standard case-cohort sampling design and estimation

The originally described case-cohort sampling design draws a random subcohort and all additional participants who experienced the clinical outcome (Prentice, 1986; Self and Prentice, 1988). With little loss of efficiency, the estimation procedures for the case-cohort design modify (3) using a subset of the entire cohort. Denote by 𝒮 the random subcohort in the case-cohort design. The Self-Prentice estimator used at-risk participants in the subcohort (Self and Prentice, 1988),

X^(β,t)=i𝒮Ri(t)exp(βTXi)Xii𝒮Ri(t)exp(βTXi), (4)

while the original Prentice estimator included one more observation, the event occurring at t. The difference of the two estimators is negligible when the sample size is large. More choices of (3) are available, such as the inverse-probability weighting (IPW) method in survey data to improve efficiency of case-cohort estimators (Binder, 1992; Barlow, 1994; Borgan et al., 2000).

2.2 Case-only estimator for gene-treatment interaction and subgroup effects

The treatment effect parameters β2 and β3 in model (1) can be estimated from data in cases only (Vittinghoff and Bauer, 2006; Dai et al., 2012), under suitable assumptions about event rate and censoring mechanism. In our notation, the probability of the treatment being z given an event occurring at time t conditional on covariates (G, V) can be expressed as

Pr(Z=z|T=t,Δ=1,G,V)=λ(t|Z=z,G,V)Pr(Tt|Z=z,G,V)Pr(Ct|Z=z,G,V)Pr(Z=z)lλ(t|Z=l,G,V)Pr(Tt|Z=l,G,V)Pr(Ct|Z=l,G,V)Pr(Z=l). (5)

Detailed derivation can be found in Dai et al. (2012). Equation (5) holds because of the independent censorship and the orthogonality between Z and (G, V). If the event is rare, i.e. Pr(Tt|Z,G, V) ≈ 1 for all t, and if the censoring time C is independent of treatment Z given (G, V), it follows that a simple logistic regression with an offset can estimate treatment effect parameters,

log{Pr(Z=1|T=t,Δ=1,G,V)Pr(Z=0|T=t,Δ=1,G,V)}=log(p1p)+γ1+γ2G, (6)

where γ1 ≈ β2, γ2 ≈ β3, and p is the probability of a trial participant being randomized to the treatment arm.

The models (1) and (6) can be more general than what is presented here. For example, the interaction between Z and V can be added into (1), perhaps also the interaction between Z and t, a time-varying hazard ratio. Similar derivations will lead to addition of V and t into (6) correspondingly.

Inspection of (5) and (6) sheds new insights on the assumptions. Arguably, the disease endpoint being rare and censoring being independent of treatment can be restrictive. The former assumption requires the cumulative probability of the event, not just the event probability in any given risk set, is nearly zero. In the context of binary outcomes and logistic regression, analysis of asymptotic bias suggested that the disease prevalence needs to be smaller than the order of n−1/2 (Tchetgen and Robins, 2010). This may work for prevention trials with endpoints such as HIV infections and most cancers, but perhaps not for therapeutic endpoints such as tumor response rates. In the failure time setting, the cumulative disease probability is gradually increasing over time, and so the rare disease assumption is perhaps not as stringent as in the setting of binary outcomes. This may explain that in the extensive simulations conducted by Vittinghoff and Bauer (2006), case-only estimators perform surprisingly well even with 20% cumulative event probability. Strictly speaking, what is truly required for derivation of (6) is

Pr(Tt|Z=1,G,V)Pr(Tt|Z=0,G,V)1,

which is indeed less restrictive than the rare disease assumption. Moreover, if G has no association with the failure time T and Pr(Tt|Z = 1, G, V)/Pr(Tt|Z = 0, G, V) is a function of Z but not G, then estimation of the interaction will still work even if the disease is not so rare, because the intercept of (6) is affected in this case but not the slope.

Violation of censoring being independent of treatment given covariates is more amenable. One can directly estimate Pr(Ct|Z = 1, G, V)/Pr(Ct|Z = 0, G, V) as a function of (t, G, V) from the data, assuming independent censorship. This quantity can be plugged into (6) as an offset to remove any bias induced by differential censoring. Furthermore, what is truly required for deriving (6) is

Pr(Ct|Z=1,G,V)Pr(Ct|Z=0,G,V)1.

As long as G is not associated with C conditional on Z and V and Pr(Ct|Z = 1, G, V)/Pr(Ct|Z = 0, G, V) is not a function of G, estimation of the interaction parameter in (6) is not affected.

2.3 Scenario 1: Case-cohort estimation incorporating the case-only estimators

Suppose a case-cohort sample has been drawn for measuring genetic factors, including controls from both arms and possibly stratified by arm and other baseline covariates. Let β̂2co and β̂3co denote the case-only estimators derived from (6). Plugging the case-only estimators into the Cox model (1),

λ(t;Z,G,V)=λ0(t)exp(β1G+β^2coZ+β^3coGZ+β4V). (7)

The usual case-cohort estimation can be adapted to obtain estimates of β1 and β4. For example, the estimator in Self and Prentice (1988) can be obtained by tweaking the coxph function in R as examplified in Therneau and Li (1999), and adding the estimated offset β̂2coZ + β̂3coGZ. The estimated λ0(t) can be obtained from the Breslow estimator based on estimates of regression parameters as previously described (Prentice, 1986; Self and Prentice, 1988).

The variance estimate of the resulting β̂1 and β̂4 has to account for the fact that β̂2co and β̂3co were estimated first by the data in cases. The derivation entails modification of the Murphy-Topel variance estimate of two-step estimators widely used in the econometrics literature (Murphy and Topel, 1985). Here we provide a brief sketch, starting from asymptotic expansions of (β̂2co, β̂3co) and (β̂1, β̂4).

Let γ = (β2, β3) and let βg = (β1, β4). Let U1 = ∑ U1i be the estimating equation for γ̂=(β̂2co, β̂3co) based on the logistic model (6), and U1i is the iid contribution from the ith case. For all controls, we assume U1i = 0. Suppose A1=lim1nU1/γ. By first-order Taylor expansion at γ,

1nU1=A1n(γ^γ)+op(1).

The asymptotic linear expansion of the case-cohort estimator β̂g after plugging in γ̂ requires some algebra, an example of which is provided in Section 2.3.1 next. Suppose U2 is the estimating equation for β̂g which it can be written as its asymptotically equivalent term ∑Wi, the sum of the iid score contributions. We define Wi = 0 for controls that are not included in the case-cohort sample. Let A2=lim1nU2/βg and A3=lim1nU2/γ. The first-order Taylor expansion of U2 at both βg and γ yields

1nU2=1nWi+op(1)=A2n(β^gβg)+A3n(γ^γ)+op(1).

By the central limit theorem and under mild regularity conditions, n(γ^γ)d𝒩(0,Σ1), where Σ1 is its asymptotic variance matrix 𝔼(A11U1iU1iTA11), the robust variance estimator. Similarly, n(β^βg)d𝒩(0,Σ2), where the asymptotic variance matrix Σ2=𝔼{A21(WiWiT+A3A11U1iU1iTA11A3TA3A11U1iWiTWiU1iTA11A3T)A21}.

2.3.1 The Self-Prentice estimator

The asymptotic linear expansions for various case-cohort estimators were presented in respective works (Lin and Wei, 1989; Binder, 1992; Lin and Ying, 1993; Lin, 2000; Borgan et al., 2000). Here we write out the expressions for the Self-Prentice estimator and leave the survey estimator using IPW to the Appendix. Both estimators are implemented in the R packages cch and Survey, two popular softwares for analyzing case-cohort data.

In our notation, the estimating function for the Self-Prentice estimator after plugging in the case-only estimators (β̂2co, β̂3co) can be written as

U2=U2i=Δi{X2iS(1)(βg,Ti;γ^)S(0)(βg,Ti;γ^)}, (8)

where X2i = (Gi, Vi)T, X1i = (Zi, GiZi)T, and

S(r)(βg,Ti;γ^)=1nsci𝒮Ri(t)exp(γ^TX1i+βgTX2i)X2ir (9)

for r = 0, 1, where nsc is the sample size of the random subcohort.

The asymptotic linearization of the Self-Prentice estimator can be expressed as A21Wi (Lin and Wei, 1989; Lin and Ying, 1993), where A2 = lim − (1/n)(∂U2/∂βg),

Wi=U2il=1nΔlYi(Tl)I(i𝒮)exp(γ^TX1i+βgTX2i)nS(0)(βg,Tl){X2iS(1)(βg,Tl;γ^)S(0)(βg,Tl;γ^)}.

2.4 Scenario II: Augmented case-only (ACO) design for vaccine trials

In the proposed ACO sampling design, the genotype is ascertained for a random subcohort from the active treatment arm only, denoted by 𝒮1, and all additional participants who developed the clinical outcome outside of 𝒮1. The set of sampled participants is therefore defined by {i : Δi = 1 or i ∈ 𝒮1}. Though the controls from the placebo arm are not sampled, we show next that a similar 3-step procedure estimates all parameters in the Cox model (1).

First, case-only estimators of β2 and β3 are obtained from (6). Second, based on the case-cohort sample in the active treatment arm, we estimate the parameters α = (α1, α2) in the Cox model

λ(t;Z=1,G,V)=λ0*(t)exp(α1G+α2V) (10)

by standard case-cohort methods (Prentice, 1986; Self and Prentice, 1988; Lin and Ying, 1993; Barlow, 1994), where parameterization in models (1) and (10) dictate that α1 = β1 + β3, α2 ≡ β4, and λ0*(t)=λ0(t)exp(β2). The estimated λ0*(t) can be obtained from the Breslow estimator based on the estimate of α as previously described (Prentice, 1986; Self and Prentice, 1988). Third, we compute the estimators of (β1, λ0(t)) using the estimators obtained in previous steps as follows:

β^1=α^1β^3,
λ^0(t)=λ^0*(t)exp(β^2).

The full set of parameters in a Cox model are therefore estimated.

Since the estimation of (α̂1, α̂2) and (β̂2co, β̂3co) both used data from cases in the active treatment arm, the two sets of estimators are correlated. We estimate the covariance matrix by the general estimating equation theory. As shown in Section 2.3, both estimators can be written as asymptotically linear estimators (Newey and Powell, 1990; Robins et al., 1994), that is, their asymptotic distribution can be expressed as n1/2i=1nBi+op(1), where Bi = A−1Ui is the iid influence function from each subject, A is the expected information matrix and Ui is iid estimating function. Expressions of A and Ui for (α̂1, α̂2) and (β̂2co, β̂3co) follow those in Section 2.3. Suppose the influence function for the case-cohort estimator (α̂1, α̂2) is B1, and suppose the influence function for the case-only estimator (β̂2, β̂3) is B2. Then by the central limit theorem, the covariance between (α̂1, α̂2) and (β̂2, β̂3) can be estimated as B1iTB2i.

2.4.1 The alternative ACO design and nested case-control design

For completeness, we briefly address an alternative ACO design, in which the subcohort is instead taken only from the placebo arm. Such design may be merely of theoretical interest to compare its efficiency to the ACO discussed in Section 2.4, as we will show in simulations. The estimation is simplified: the first step is the same, and in the second step β1, β4 and λ0 are directly estimated using the case-cohort data in the placebo arm.

Frequently used in the biomarker research, the nested case-control design randomly selects a fraction of controls in the risk set at each failure time. This design is particularly suitable for time-varying biomarkers. Little added efficiency can be realized when selecting more than 5 controls per case (Breslow et al., 1983). Our augmented case-only design can be similarly modified by using all cases plus a nested case-control sample from one of the two arms. The estimation procedure follows closely to those in Section 2.2, but using the conditional logistic regression or the IPW partial likelihood method in the second step (Goldstein and Langholz, 1992; Samuelsen, 1997), possibly with stratification (Langholz and Borgan, 1995). The asymptotic linearization of these estimators required for estimating the covariance matrix can be found in the respective literature.

3. Simulation

The performance of the proposed estimation and designs is evaluated in a simulation study with 1000 simulated datasets, using the standard estimation for the case-cohort design and the full cohort analysis as benchmarks. For Scenario I where standard case-cohort sampling has taken place, the interest is to evaluate how much efficiency is gained for the genetic main effect when we incorporate the case-only estimators into case-cohort estimation. For Scenario II, the interest is to compare efficiency of the main effect estimator under different designs, adding controls in the active treatment arm only or adding controls by other ways, all of which use a similar hybrid estimation procedure that incorporated the case-only estimator.

Across all scenarios the sample size in the trial is 3000, and the participants are randomized in a 1:1 ratio to the active treatment arm or the placebo arm. The genotype G is assumed to be depending on V, a baseline covariate following a Bernoulli distribution with rate 0.5, as logit{Pr(G = 1)} = −1.6 + 1.4V. The rate of variant allele is around 0.3. The cumulative probability of incident cases is set to be around 0.05. The event time is exponentially distributed, with a constant baseline hazard function of λ0(t) = 1. The true regression parameters associated with the set of covariates are listed in Table 1 and 2, with the parameter associated with V set to log(1.5). The censoring time is exponentially distributed with mean 1, independent of the event time. Administrative censoring is set such that the cumulative event rate is around 5%. The ACO designs consist of a random subcohort of varying sample sizes in the active treatment arm only or in the placebo arm only, plus all cases outside the subcohort. The standard case-cohort design was devised to have almost identical sample size as the ACO designs in each simulated dataset, though the subcohort is drawn randomly from the entire trial population.

Table 1.

Small-sample properties of estimators for the proposed designs and estimations in the Cox model (1) with a varying subcohort fraction

SC
Fraction
β1 = β2 = β3 = 0 β1 = −β2 = β3 = log 1.5 β1 = −β2 = β3 = log 2



β1a β1b β1c β2 β3 β1a β1b β1c β2 β3 β1a β1b β1c β2 β3
10% Bias 0 0.005 −0.004 −0.006 −0.012 0.016 0.025 0.010 0.001 −0.017 0.020 0.034 0.012 0.001 −0.014
Var 0.081 0.084 0.083 0.043 0.123 0.073 0.076 0.074 0.058 0.110 0.070 0.072 0.071 0.074 0.116
Var
0.083 0.086 0.083 0.041 0.124 0.071 0.076 0.071 0.054 0.112 0.069 0.074 0.068 0.072 0.121
CP 0.954 0.960 0.947 0.957 0.959 0.954 0.954 0.940 0.947 0.961 0.953 0.956 0.945 0.959 0.965

15% Bias −0.009 −0.009 −0.014 −0.003 −0.005 0.010 0.013 0.003 0.006 −0.014 0.016 0.023 0.008 0.003 −0.010
Var 0.077 0.082 0.079 0.041 0.127 0.066 0.071 0.069 0.056 0.110 0.064 0.069 0.065 0.074 0.117
Var
0.076 0.080 0.077 0.041 0.125 0.065 0.069 0.065 0.054 0.112 0.063 0.068 0.062 0.072 0.121
CP 0.947 0.950 0.946 0.959 0.957 0.960 0.955 0.938 0.951 0.964 0.954 0.949 0.943 0.956 0.964

20% Bias −0.012 −0.010 −0.014 −0.002 −0.003 0.006 0.011 0.002 0.006 −0.011 0.011 0.019 0.006 0.005 −0.011
Var 0.076 0.081 0.078 0.041 0.131 0.065 0.071 0.067 0.057 0.112 0.062 0.068 0.063 0.074 0.118
Var
0.073 0.077 0.074 0.041 0.125 0.061 0.066 0.062 0.054 0.112 0.059 0.065 0.059 0.072 0.120
CP 0.945 0.946 0.944 0.958 0.951 0.953 0.947 0.939 0.955 0.960 0.949 0.943 0.939 0.950 0.961

25% Bias −0.010 −0.009 −0.009 0.002 −0.005 0.010 0.015 0.008 0.011 −0.018 0.013 0.021 0.010 0.013 −0.017
Var 0.073 0.078 0.074 0.041 0.130 0.062 0.067 0.063 0.055 0.108 0.059 0.065 0.060 0.069 0.113
Var
0.071 0.075 0.072 0.041 0.125 0.059 0.064 0.060 0.054 0.112 0.057 0.063 0.057 0.072 0.120
CP 0.954 0.950 0.947 0.956 0.954 0.956 0.945 0.948 0.958 0.960 0.954 0.943 0.947 0.954 0.961

Notation: SC Fraction, subcohort sampling fraction of the entire cohort; CP, 95% coverage probability; β1a, β1b and β1c, the Self-Prentice estimator used to estimate β1 in the standard case-cohort design incorporating the case-only estimators, the augmented case-only design while sampling controls from the active treatment arm only, or the augmented case-only design with controls from the placebo arm only.

Table 2.

The efficiency of the proposed designs and estimations in the Cox model (1) with varying subcohort fraction, when compared to the full cohort analysis

β1 = β2 = β3 = β4 = 0 β1 = − β2 = β3 = log 1.5 β1 = − β2 = β3 = log 2



SC Fraction β1 β2 β3 β1 β2 β3 β1 β2 β3
10% Case-cohort 0.659 0.678 0.644 0.626 0.744 0.625 0.605 0.787 0.641
Case-cohort + case-only 0.812 0.995 1.001 0.775 0.992 1.017 0.761 0.997 1.025
ACO active 0.780 0.995 1.001 0.743 0.992 1.017 0.735 0.997 1.025
ACO placebo 0.788 0.995 1.001 0.760 0.992 1.017 0.751 0.997 1.025

15% Case-cohort 0.767 0.765 0.738 0.737 0.806 0.707 0.723 0.834 0.726
case-cohort + case-only 0.894 0.993 0.993 0.874 0.986 1.003 0.865 0.990 1.013
ACO active 0.839 0.993 0.993 0.813 0.986 1.003 0.797 0.990 1.013
ACO placebo 0.869 0.993 0.993 0.846 0.986 1.003 0.843 0.990 1.013

20% Case-cohort 0.820 0.856 0.789 0.808 0.883 0.777 0.792 0.908 0.793
Case-cohort + case-only 0.920 0.994 0.984 0.907 0.982 0.989 0.899 0.988 1.002
ACO active 0.866 0.994 0.984 0.836 0.982 0.989 0.816 0.988 1.002
ACO placebo 0.897 0.994 0.984 0.885 0.982 0.989 0.881 0.988 1.002

25% Case-cohort 0.871 0.882 0.839 0.854 0.906 0.827 0.839 0.930 0.840
Case-cohort + case-only 0.948 0.998 0.984 0.931 0.983 0.990 0.926 0.990 1.004
ACO active 0.883 0.998 0.984 0.855 0.983 0.990 0.839 0.990 1.004
ACO placebo 0.924 0.998 0.984 0.913 0.983 0.990 0.914 0.990 1.004

Notation: SC Fraction, subcohort sampling fraction of the entire cohort; Case-cohort, the standard case-cohort design and estimation with controls from both arms; Case-cohort + case-only, the case-cohort design incorporating the case-only estimators; ACO active, the augmented case-only design with controls from the active arm only; ACO placebo, the augmented case-only design with controls from the placebo arm only.

Table 1 shows the small-sample properties of the proposed hybrid estimation procedures for incorporating case-only estimators into the case-cohort estimation (β1a), and for the ACO design sampling controls from the active arm only (β1b) or from the placebo arm only (β1c). The Self-Prentice estimator was used in the second step for all three designs. The IPW estimator performs very closely to the Self-Prentice estimator and thus is omitted from this table and Table 2. Under the null hypothesis and under the moderate effect size for all three parameters, all estimators appear to be consistent as the biases of the estimators are all small relative to their empirical variability. The ACO design with controls from active arm tends to have bigger bias under the alternative. The estimated variances agree well with the empirical variances, and the coverage probabilities of 95% confidence intervals behave properly. Similar performance was observed for the simulations with a qualitative interaction model (Supplementary materials). We conclude that the hybrid estimation procedures detailed in Section 2.3 and 2.4 work well in the simulated datasets.

Table 2 shows the efficiency of the standard case-cohort design (with or without incorporating the case-only estimators) and the two ACO designs, relative to the full cohort analysis. The relative efficiency is calculated as the ratio of the sample variance of parameters estimated from the two designs. The results suggest that all designs incorporating case-only estimators lead to a major efficiency gain for all three parameters relative to the standard case-cohort estimation. The efficiency gains on β2 and β3 are not surprising as they are the case-only estimators (Vittinghoff and Bauer, 2006). More interestingly, over 10% efficiency gain is realized for the genetic main effect β1 when the case-only estimators are incorporated into the case-cohort analysis (Scenario I), or when the random sampling fraction is 5–15% in the ACO designs (Scenario II). Sampling controls from both arms appears to outperform sampling controls from one arm only in estimating the genetic main effect: compared to the standard case-cohort design and analysis, the ACO design with controls from the active arm gains 5–15% efficiency due to the use of the case-only estimator, depending on the subcohort fraction; additional 5–10% can be gained by allocating the controls in both arms, a design benefit given that the case-only estimator has been exploited in all these designs. The ACO design with controls from the placebo arm only outperforms the ACO design with controls from the active arm only, presumably because the former has a simpler estimation procedure. However, when potential systematic studies such as immune correlates are of interest, the ACO design with controls from the active treatment arm may still be more cost-effective.

4. Data application

We show a pedagogical example in HIV vaccine trials with a standard case-cohort sampling scheme. We analyzed the case-cohort data in four ways: standard case-cohort estimation, case-cohort estimation incorporating the case-only estimators (Section 2.3), and augmenting case-only data with controls from the vaccine arm only (Section 2.4). All estimators are compared to the standard case-cohort analysis in terms of standard errors for the genetic main effect.

The Step trial, a test-of-concept study that evaluates the protection of a cell-mediated immune vaccine, was prematurely terminated because the risk of HIV infection was evidently elevated in the vaccine arm compared to the placebo arm (Buchbinder et al., 2008). In order to understand this disappointing result, a host genetics study was conducted to assess the association of several immune genes, namely GM, KM and FcγR, with HIV infection and the vaccine effect. A case-cohort sample including 25% of study participants was pre-selected for storing blood samples, together with blood samples taken later from incident cases, and measuring immunogenecity (McElrath et al., 2008). The analysis of this genetic study has been reported elsewhere, and no significant gene-treatment interaction was found possibly due to the small sample size (Pandey et al., 2013). Following the strategy in Pandey et al. (2013), all four analyses were restricted to white males and the same set of covariates were adjusted for.

Table 3 shows the estimates for various designs and estimation procedures for the genetic variant in FcγR-2, coded by an additive genetic score (0/1/2). In the last three analyses, β2 and β3 were estimated by the case-only method, and thus all have smaller standard errors than the standard case-cohort estimates. The genetic main effect β1 was estimated using Self-Prentice estimator in all analysis. The standard error of the ACO using only 60 controls from the vaccine arm is similar to that from the standard case-cohort estimation with all 169 controls. The standard error of the case-cohort analysis incorporating case-only estimators is even smaller because more controls were used.

Table 3.

Comparison of various study designs and estimation procedures in a case-cohort genetic study in the STEP trial.

# cases # controls β̂1 (SE) β̂2 (SE) β̂3 (SE)
Case-cohort 56 169 0.17 (0.40) 1.05 (0.66) −0.32 (0.51)
Case-only 56 0 - 0.63 (0.49) −0.26 (0.38)
Case-cohort + case-only 56 169 0.19 (0.31) 0.63 (0.49) −0.26 (0.38)
ACO active 56 60 0.24 (0.38) 0.63 (0.49) −0.26 (0.38)

Notation: Case-cohort + case-only, the case-cohort design incorporating the case-only estimators; ACO active, the augmented case-only design using controls from the vaccine arm

Scientifically speaking, it is more cost-effective to store biology specimens of samples in the vaccine arm. If there was any genotype showing significant interactions with the vaccine, investigators could directly correlate the genotype with immune responses using samples in the vaccine arm (McElrath et al., 2008), to further explore the mechanism of differential vaccine protection.

5. Discussion

In prevention clinical trials with failure time endpoints, we investigated several ways of augmenting the case-only sampling design for studying the influence of pre-treatment biomarkers, such as genotypes, on treatment effect and the risk of the failure event. The goal is to be able to estimate all parameters in a Cox model, not just subgroup effects and the interaction, so that absolute risk can be computed. One way is to incorporate the case-only estimators into case-cohort estimation (Scenario I).We showed that such estimators and their variance estimates can be obtained by a hybrid estimation procedure. Motivated by vaccine trials, we also propose an augmented case-only design which builds on the efficient case-only method and adds controls from the active treatment arm only (Scenario II). Following a similar hybrid procedure, we showed that all parameters in a Cox model and the absolute risk can be estimated. Simulation results showed a sizable efficiency gain in estimating the genetic main effect over the standard case-cohort design, because of incorporating case-only estimators and exploiting gene-treatment independence. It is worthwhile to reiterate that the motivation of (Scenario II) is driven by scientific and cost-effective use of vaccine trial resources, not estimation efficiency, as the simulation showed that allocating controls in both arms achieved a better efficiency.

The assumptions required by case-only estimators can be restrictive. The rare disease and nondifferential censoring between arms may exclude many cancer therapeutic trials. The applications we consider in this article are primarily phase III prevention trials with a rare endpoint, for example HIV vaccine trials. These trials enroll healthy participants and evaluate the prevention effect of certain modality which should not induce severe adverse effects.

The benefit of concentrating controls in the active treatment arm is to have a bigger pool of active treatment recipients for systematically measuring both genotypes and a comprehensive profile of biological mediators. Ancillary studies of this nature are increasingly common in clinical trials (Pitteri et al., 2009; Li et al., 2014). The controls can be sampled from a random subcohort or sampled repeatedly from risk sets as in nested case-control sampling. For the ACO design, the efficiency of the estimators of the genetic main effect and the covariate can be further improved by using more efficient estimators of α1 and α2 in (10), for example, the method of efficient score equations (Nan, 2004), though the variance of the resulting two-step estimators can be difficult to derive.

In the same vein, ACO designs can be devised for clinical trials with binary endpoints, where logistic regression models are used for analysis. For rare endpoints, the two-step estimation procedure proposed in this article applies with minor modification. The efficiency comparison with respect to the standard logistic regression under case-control sampling design is a bit complicated, since gene-treatment independence can also be exploited in the maximum likelihood estimation (Dai et al., 2009). Furthermore, it is not clear whether the ACO design can be extended to common endpoints, where the case-only estimator is no longer applicable. Use of additive interaction is another topic of interest, as it may be better to inform public health impact of gene-treatment interaction and the biological inter-dependence. Gene-treatment interaction has been exploited to improve efficiency (Han et al., 2012). But it is not clear whether one can estimate parameters in an additive interaction model using a ACO design. We will pursue these topics in future work.

Supplementary Material

Supp MaterialS1

Acknowledgements

This work was supported by the National Institutes of Health grants P01 CA53996, R01 HL114901, R01 HG006164, R01 ES017030 and R21 HL121347. The authors thanks two reviewers and the Associate Editor for their constructive comments.

Appendix

The asymptotic linear expansion of the IPW estimator for case-cohort sampling

For computing the expected covariate values at each event time, those at-risk cases occurring outside of the random subcohort can be used with proper sampling weights. Specifically, in estimating function (8), the average term (9) is replaced by

S(r)(βg,Ti)=1ni𝒮𝒟1πiRi(t)exp(γ^TX1i+βgTX2i)X2ir,

where

πi={I(Δj=0,j𝒮)I(Δj=0,Zj=1)ifi𝒮andΔi=01ifi𝒟,

and 𝒟 is the set of cases.

The asymptotic expansion for the survey estimator using the inverse probability weights is modified as B2i=A21Wi, where A2 = lim − (1/n)(∂U2/∂βg), and

Wi=U2il𝒮𝒟1πlΔlYi(Tl)I(i𝒮𝒟)exp(γ^TX1i+βgTX2i)nS(0)(βg,Tl){X2iS(1)(βg,Tl;γ^)S(0)(βg,Tl;γ^)}.

The computation of the covariance matrix of βg follows similarly as in Section 2.3.

Footnotes

Supplementary Materials

Web Table referenced in Section 3 and the sample code to implement the estimation method are available with this paper at the Biometrics website on Wiley Online Library.

Contributor Information

James Y. Dai, Email: jdai@fredhutch.org, Fred Hutchinson Cancer Research Center and University of Washington, Seattle, Washington.

Xinyi Cindy Zhang, Email: xzhan2@fredhutch.org, Fred Hutchinson Cancer Research Center, Seattle, Washington.

Ching-Yun Wang, Email: cywang@fredhutch.org, Fred Hutchinson Cancer Research Center and University of Washington, Seattle, Washington.

Charles Kooperberg, Email: clk@fredhutch.org, Fred Hutchinson Cancer Research Center and University of Washington, Seattle, Washington.

References

  1. Barlow WE. Robust variance estimation for the case-cohort design. Biometrics. 1994;50:1064–1072. [PubMed] [Google Scholar]
  2. Binder DA. Fitting Cox’s proportional hazards models from survey data. Biometrika. 1992;79(1):139–147. [Google Scholar]
  3. Borgan O, Langholz B, Samuelsen SO, Goldstein L, Pogoda J. Exposure stratified case-cohort designs. Lifetime Data Anal. 2000;6:39–58. doi: 10.1023/a:1009661900674. [DOI] [PubMed] [Google Scholar]
  4. Breslow NE, Lubin JH, Marek P, Langholz B. Multiplicative models and cohort analysis. J. Amer. Statist. Assoc. 1983;78:1–12. [Google Scholar]
  5. Breslow NE, Robins JM, Wellner JA. Large sample theory for semiparametric regression models with two-phase, outcome-dependent sampling. Annals of Statistics. 2003;31:1110–1139. [Google Scholar]
  6. Buchbinder SP, Mehrotra DV, Duerr A, Fitzgerald DW, Mogg R, Li D, Gilbert PB, et al. Efficacy assessment of a cell-mediated immunity hiv-1 vaccine (the step study): a double-blind, randomised, placebo-controlled, test-of-concept trial. Lancet. 2008;372(9653):1881–1893. doi: 10.1016/S0140-6736(08)61591-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Charlab R, Zhang L. Pharmacogenomics: historical persective and current status. Methods Mol Biol. 2013;1015:3–22. doi: 10.1007/978-1-62703-435-7_1. [DOI] [PubMed] [Google Scholar]
  8. Cox DR. Regression models and life tables (with discussion) J. R. Statist. Soc. B. 1972;34:187–220. [Google Scholar]
  9. Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
  10. Dai JY, Kooperberg C, LeBlanc M, Prentice RL. Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika. 2012;99:929–944. doi: 10.1093/biomet/ass044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dai JY, LeBlanc M, Koopberg C. Semiparametric estimation exploiting covariate independence in two-phase randomized clinical trials. Biometrics. 2009;65:178–187. doi: 10.1111/j.1541-0420.2008.01046.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Dai JY, Li SS, Gilbert PB. Case-only methods for competing risks models with application to assessing differential vaccine efficacy by viral and host genetics. Biostatistics. 2014;15(1):196–203. doi: 10.1093/biostatistics/kxt018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dai JY, Logsdon BA, Huang Y, Hsu L, Reiner AP, Prentice RL, Kooperberg C. Simultaneously testing for marginal genetic association and gene-environment interaction. American Journal of Epidemiology. 2012;176:164–173. doi: 10.1093/aje/kwr521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Evans WE, McLeod HL. Pharmacogenomics-drug disposition, drug targets, and side effects. N Engl J Med. 2003;348(6):538–549. doi: 10.1056/NEJMra020526. [DOI] [PubMed] [Google Scholar]
  15. Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, Mulvihill JJ. Projecting individualized probabilities of developing breast cancer for while females who are being examined annually. J Natl Cancer Inst. 1989;81(24):1879–1886. doi: 10.1093/jnci/81.24.1879. [DOI] [PubMed] [Google Scholar]
  16. Goldstein L, Langholz B. Asymptotic theory for nested case-control sampling in the cox regression model. Annals of Statistics. 1992;20:1903–1928. [Google Scholar]
  17. Han SS, Rosenberg PS, Garcia-Closas M, Figueroa JD, Silverman D, Chanock SJ, Rothman N, Chatterjee N. Likelihood ratio test for detecting gene (g)-environment (e) interactions under an additive risk model exploiting g-e independence for case-control data. Am J Epidemiol. 2012;176:1060–1067. doi: 10.1093/aje/kws166. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Janes H, Pepe MS, Bossuyt PM, Barlow WE. Measuring the performance of markers for guiding treatment decisions. Ann Intern Med. 2011;154(4):253–259. doi: 10.1059/0003-4819-154-4-201102150-00006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Langholz B, Borgan Y. Counter-matching: a stratified nested case-control sampling method. Biometrika. 1995;82:69–79. [Google Scholar]
  20. Langholz B, Thomas DC. Nested case-control and case-cohort methods of sampling from a cohort: a critical comparison. American Journal of Epidemiology. 1990;131(1):169–176. doi: 10.1093/oxfordjournals.aje.a115471. [DOI] [PubMed] [Google Scholar]
  21. Li SS, Gilbert PB, Tomaras GD, Kijak G, Ferrari G, Thomas R, Pyo C, et al. Fcgr2c polyporphisms associate with hiv-1 vaccine protection in rv144 trial. Cancer Epidemiol Biomarkers Prev. 2014;124(9):3879–3890. doi: 10.1172/JCI75539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Lin DY. On fitting cox’s proportional hazards models to survey data. Biometrika. 2000;87:37–47. [Google Scholar]
  23. Lin DY, Wei LJ. The robust inference for the cox proportional hazards model. Journal American Statistical Association. 1989;84:1074–1078. [Google Scholar]
  24. Lin DY, Ying Z. Cox regression with incomplete covariate measurements. Journal American Statistical Association. 1993;88:1341–1349. [Google Scholar]
  25. McElrath MJ, De Rosa SC, Moodie Z, Dubey S, Kierstead L, Janes H, Defawe OD, et al. Hiv-1 vaccine-induced immunity in the test-of-concept step study: a case-cohort analysis. Lancet. 2008;372(9653):1894–1905. doi: 10.1016/S0140-6736(08)61592-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Murphy KM, Topel RH. Estimation and inference in two-step econometric models. J Bus Econ Stat. 1985;3:370–379. [Google Scholar]
  27. Nan B. Efficient estimation for case-cohort studies. Canadian Journal of Statistics. 2004;32:403–419. [Google Scholar]
  28. Newey WK, Powell J. Efficient estimation of linear and type i censored regression models under conditional quantile restrictions. Econometric Theory. 1990;6:295–317. [Google Scholar]
  29. Pandey JP, Namboodiri AM, Bu S, Tapsoba JD, Sato A, Dai JY. Immunoglobulin genes and the acquisition of hiv infection in a randomized trail of recombinant adenovirus hiv vaccine. Virology. 2013;441(1):70–74. doi: 10.1016/j.virol.2013.03.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Pitteri SJ, Hanash SH, Aragaki A, Amon LM, Chen L, Buson TB, Paczesny S, et al. Postmenopausal estrogen and progestin effects on the serum proteome. Genome Medicine. 2009;1(12):121. doi: 10.1186/gm121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11. [Google Scholar]
  32. Prentice RL, Breslow NE. Retrospective studies and failure time models. Biometrika. 1978;65:153–158. [Google Scholar]
  33. Prentice RL, Huang Y, Hinds DA, Peters U, Cox DR, Beilharz E, Chlebowski RT, et al. Variation in the fgfr2 gene and the effect of a low-fat dietary pattern on invasive breast cancer. Cancer Epidemiol Biomarkers Prev. 2010;19(1):74–79. doi: 10.1158/1055-9965.EPI-09-0663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of American Statistical Association. 1994;89:846–866. [Google Scholar]
  35. Samuelsen SO. A pseudolikelihood approach to analysis of nested case-control studies. Biometrika. 1997;84:379–394. [Google Scholar]
  36. Self SG, Prentice RL. Asymptotic distribution theory and efficiency results for case-cohort studies. Annals of Statistics. 1988;16:64–81. [Google Scholar]
  37. Tchetgen EJ, Robins J. The semiparametric case-only estimator. Biometrics. 2010;66:1138–1144. doi: 10.1111/j.1541-0420.2010.01401.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Therneau TM, Li H. Computing the cox model for case-cohort designs. Lifetime Data Anal. 1999;5(2):99–112. doi: 10.1023/a:1009691327335. [DOI] [PubMed] [Google Scholar]
  39. Thomas DC. Addendum to “methods of cohort analysis: appraisal by application to asbestos mining”. Journal of Royal Statistical Society, Serial A. 1977;140:483–485. [Google Scholar]
  40. Vittinghoff E, Bauer DC. Case-only analysis of treatment-covariate interactions in clinical trials. Biometrics. 2006;62:769–776. doi: 10.1111/j.1541-0420.2006.00511.x. [DOI] [PubMed] [Google Scholar]
  41. Weinshilboum R, Wang L. Pharmacogenomics: bench to bedside. Nat Rev Drug Discov. 2004;3(9):739–748. doi: 10.1038/nrd1497. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp MaterialS1

RESOURCES