Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Jan 1.
Published in final edited form as: Lifetime Data Anal. 2016 Jan 13;23(1):57–82. doi: 10.1007/s10985-015-9355-7

Recent progresses in outcome-dependent sampling with failure time data

Jieli Ding 1, Tsui-Shan Lu 2, Jianwen Cai 3, Haibo Zhou 3
PMCID: PMC4942414  NIHMSID: NIHMS751654  PMID: 26759313

Abstract

An outcome-dependent sampling (ODS) design is a retrospective sampling scheme where one observes the primary exposure variables with a probability that depends on the observed value of the outcome variable. When the outcome of interest is failure time, the observed data are often censored. By allowing the selection of the supplemental samples depends on whether the event of interest happens or not and oversampling subjects from the most informative regions, ODS design for the time-to-event data can reduce the cost of the study and improve the efficiency. We review recent progresses and advances in research on ODS designs with failure time data. This includes researches on ODS related designs like case–cohort design, generalized case–cohort design, stratified case–cohort design, general failure-time ODS design, length-biased sampling design and interval sampling design.

Keywords: Case–cohort design, ODS design, Failure time data

1 Introduction of original ODS design

Epidemiologic and biomedical observational studies that relate outcome of interest to individual exposure and other characteristics play a key role in understanding the determinants of diseases in humans. In many such studies, the major budget and cost typically arise from the assembling of primary exposure variables. Large cohort studies with simple random sampling are often too expensive to conduct for investigators with a limited budget. To reduce the cost and to achieve a prespecified power level, alternative cost-effective designs and procedures are thus desirable for studies with a limited budget.

An outcome-dependent sampling (ODS) design is a general term that describes a retrospective sampling scheme where one observes the primary exposure variables with a probability that depends on the observed value of the outcome variable. It is a useful, and more importantly, a cost-effective alternative to the more standard random sampling design. Under an ODS design, information on the primary exposure variables is assembled only for a sample that is selected from the underlying cohort population in a manner other than simple random sampling. The principal idea of an ODS design is to concentrate resources where there is the greatest amount of information. By allowing the selection probability of each individual in the ODS sample to depend on the outcome, the investigators attempt to enhance the efficiency and reduce the cost of a study. The well-known example is the case–control study in epidemiology, which is an ODS scheme for a binary outcome variable (Cornfield 1951).

Although in some applications the outcome event is intrinsically binary/categorical, there are many situations in which the outcome variable is actually measured continuously (e.g. failure time). One commonly used approach in epidemiologic studies is to dichotomize the continuous outcome and use the available methods for binary outcome (White 1982; Breslow and Cain 1988; Scott and Wild 1991; Weinberg and Wacholder 1993; Schill et al. 1993; Breslow and Holubkov 1997; Breslow et al. 2003). Another commonly used approach in these situations is to stratify the range of the continuous outcome variable and then sampling observations according to stratum-specific selection probabilities (Imbens and Lancaster 1996; Lawless et al. 1999).

Recent work has focused on a more general ODS design for continuous outcomes. Such an ODS design usually includes a simple random sample from the underlying population and some additional supplemental samples which are determined by the scales of outcome. The advantage of such an ODS design is that, while providing overall information about the population, it allows the investigators to target sample certain regions of the population that are believed to be more informative. There are very active researches on such sampling schemes. Weaver (2001) and Zhou et al. (2002) developed a semiparametric empirical likelihood inference procedures. Weaver and Zhou (2005) proposed a maximum estimated likelihood estimation approach. Chatterjee et al. (2003), Song et al. (2009), and Zhou et al. (2011b) developed inferential methodologies for the two-stage ODS design, which make efficient use of any additional outcome data that may be available for the entire study population. Qin and Zhou (2011) and Zhou et al. (2011a, d) studied the inference procedures for ODS design under the partial linear models. Schildcrout and Heagerty (2008), Schildcrout and Rathouz (2010), and Schildcrout et al. (2012) discussed ODS design and proposed analysis approaches for longitudinal data. Ding et al. (2012) developed a regression analysis under an ODS design for a missing data problem. A useful extension of ODS design is developed by allowing the selection probability to depend on not only the outcome but also an auxiliary variable, which is referred as outcome and auxiliary-dependent subsampling (OADS). Wang and Zhou (2006, 2010) and Zhou et al. (2011c) considered and proposed inference procedures for data from the OADS sampling scheme.

In this paper, we review recent progresses for ODS design with failure time data. In the rest of the paper, Sect. 2 reviews the research work in univariate failure time data case. Section 3 reviews the work in multivariate failure time case.

2 ODS designs with a univariate failure time

In the above literature, studies were based on that data on outcome variables are completely observed. When the outcome of interest is failure time, the observed data are often censored. In this section, we introduce some biased-sampling schemes for failure time data and review the developments of the inferential methodologies for these biased-sampling studies.

2.1 Case–cohort design and its variations

For the time-to-event data, the case–cohort design (Prentice 1986) is one of the most widely used biased-sampling scheme for censored failure-time data. The key idea of this study design is to obtain the measurements of primary exposure variables only on a subset of the entire cohort (subcohort) and all the subjects who experience the event of interest (cases) in the cohort. Thus, the case–cohort study designs are particularly useful for large-scale cohort studies with a low disease rate or for cohort studies with exposure expensive to measure.

The requirement of sampling all the cases in the original case–cohort design will limit the application of case–cohort study designs since some diseases might not be rare. In such cases, a generalized case–cohort design have been proposed where, in addition to a subcohort, the information on exposure is assembled only for a subset of the failures instead of all the failures to reduce the cost. In many applications, certain exposure variables, which are relatively easy and cheap to be measured, are observed on all of the subjects in the cohort. Such data are referred to as the first-phase covariate data. Complete measurements of primary exposure which are expensive to assemble are only collected for the subcohort and all cases. These data are referred to as the second-phase covariate data. To improve the efficiency of the original case–cohort design, a stratified case–cohort design suggests to select a subcohort by a stratified sampling scheme which depends on the available first-phase covariate data. Besides the generalized case–cohort design and the stratified case–cohort design, many variations of the sampling schemes based on the original case–cohort design have been developed, and we refer to such biased-sampling schemes as modified case–cohort designs.

The motivation, importance, and broad potential applications of case–cohort designs are widely discussed in the literature. Parametric models for case–cohort designs have been studied (Kalbfleisch and Lawless 1988; Nan et al. 2006). Statistical methods for fitting case–cohort data with semiparametric survival models have also been developed for the proportional hazards model (Prentice 1986; Self and Prentice 1988; Lin and Ying 1993; Barlow 1994; Chen and Lo 1999; Borgan et al. 2000; Chen 2001c; Cai and Zeng 2004, 2007; Kulich and Lin 2004; Qi et al. 2005; Breslow and Wellner 2007), the additive hazards model (Kulich and Lin 2000; Sun et al. 2004), the proportional odds model (Chen 2001a), the accelerated failure time model (Kong and Cai 2009), the semiparametric transformation models (Chen 2001b; Kong et al. 2004; Lu and Tsiatis 2006), among others. Various estimating procedures have been proposed for data from case–cohort studies. These have proceeded mainly along two lines, likelihood-based approaches and estimating-equation-based approaches.

Throughout this section, we suppose that there exists a study population of N independent individuals. Let i denote the potential failure time and Ci denote the censoring time for subject i (i = 1, . . . , N). The observed time is Ti = min(i, Ci). Let Δi = I(iCi) denote the failure indicator for subject i, Yi(t) = I(Tit) denote the at-risk process and Ni(t) = Δi I(Tit) denote the counting process, where I(·) is an indicator function. Let Zi(t) be a p-dimensional exposure variable for subject i at time t. Let β be a p-dimensional regression parameter of interest. Let τ denote the study end time.

2.1.1 Original case–cohort designs

In the landmark article of Prentice (1986), the case–cohort design was first formally proposed, in which a subcohort is selected randomly from the full cohort, and the complete information of exposure are only observed for the subcohort subjects and additional cases outside the subcohort. Under the proposed case–cohort design, Prentice (1986) considered a relative risk regression model:

λ(tZi(t))=λ0(t)r(βZi(t)),i=1,,N,

where λ0(t) is the baseline hazard function and r(·) is a known function with r(0) = 1, for example, r(x) = ex for the proportional hazards model. Prentice (1986) proposed a pseudo-likelihood approach for estimation of the parameter β by maximizing the following objective function:

L1(β)=i=1N[r(βZi(Ti))lR~(Ti)r(βZl(Ti))]Δi (1)

where (t) = {i : Ni(t) ≠ Ni(t–)}∪S0, and S0 denotes the index set of the subcohort. Note that the objective function (1) is a modification of the partial-likelihood function (Cox 1975) that weights the contributions of the cases and subcohort differently. Since the expression in (1) does not generally possess a partial-likelihood interpretation, it was termed as pseudo-likelihood. After the publish of Prentice (1986), the case–cohort design and related statistical methodologies have been extensively studied.

Self and Prentice (1988) further elaborated such pseudo-likelihood estimators by slightly modifying the risk set (t) used in (1) to S0. Estimators obtained from the modified pseudo-likelihood function was proved to be asymptotically equivalent to the pseudo-likelihood estimators defined in Prentice (1986). Asymptotic distribution theory for such pseudo-likelihood estimators and corresponding cumulative failure rate estimators were presented. Lin and Ying (1993) and Barlow (1994) further discussed the pseudo-likelihood methods and provided different ways to obtain easily computed variances for the estimators.

Chen and Lo (1999) improved the pseudo-likelihood estimators (Prentice 1986) by utilizing the information of all cases in constructing the risk set to derive estimating equations. Under the proportional hazards model,

λ(tZi(t))=λ0(t)exp{βZi(t)},i=1,,N, (2)

the pseudo-likelihood (1) of Prentice (1986) yields the score function:

U1(β)=i=1N0τ[Zi(t)lR~(t)Zl(t)eβZl(t)lR~(t)eβZl(t)]dNi(t)=0. (3)

Chen and Lo (1999) modified the risk set (t) used in the above score function by including more information of cases. Let N1 (n1) and N0 (n0) be the numbers of cases and controls in the cohort (subcohort), respectively. Let R1 (1) and R0 (0) be the index sets of all cases and all controls in the cohort (subcohort), respectively. Denote R1(t) = {i : Tit, iR1} and 0(t) = {i : Tit, i0}, i.e., the risk sets defined on R1 and 0, respectively. The following estimating equation was proposed by Chen and Lo (1999),

U2(β)=i=1N0τ[Zip^N1lR1(t)ZleβZl+1p^n0lR~0(t)ZleβZlp^N1lR1(t)eβZl+1p^n0lR~0(t)eβZl]dNi(t)=0, (4)

where is an estimator of the population case probability p = P(Δ = 1). They derived a class of estimating equations by using different estimators of p. Let = n1/n, then (4) becomes

U3(β)=i=1N0τ(Zin1nN1lR1(t)ZleβZl+1nlR~0(t)ZleβZln1nN1lR1(t)eβZl+1nlR~0(t)eβZl)dNi(t)=0.

Substitute = N1/N into (4), giving

U4(β)=i=1N0τ[Zi1NlR1(t)ZleβZl+N0n0NlR~0(t)ZleβZl1NlR1(t)eβZl+N0n0NlR~0(t)eβZl]dNi(t)=0.

Including more information of cases in constructing the estimating equations, the estimators proposed by Chen and Lo (1999) improve the pseudo-likelihood estimators of Prentice (1986) by achieving better efficiency.

Kulich and Lin (2000) proposed an inverse probability weighted estimating approach for the regression parameters of the additive hazards model, which has the form

λ(tZi(t))=λ0(t)+βZi(t),i=1,,N. (5)

Under the case–cohort design, let ξi be the subcohort indicator, having the value 1 if the ith subject being selected into the subcohort and 0 otherwise. Denote πi = P(ξi = 1) to be the selection probability of the ith subject. Applying the inverse probability weighted approach, Kulich and Lin (2000) defined the weights as

wi=Δi+(1Δi)ξiπi,i=1,,N,

and derived the weighted estimating equation as

U5(β)=i=1N0τwi{Zi(t)Z(t)}{dNi(t)Yi(t)βZi(t)dt}=0, (6)

where Z(t)=i=1NwiYi(t)Zi(t)i=1NwiYi(t). The resulting estimator has a closed form:

β^5[i=1N0τwiYi(t){Zi(t)Z(t)}2dt]1[i=1N0τ{Zi(t)Z(t)}dNi(t)],

where a⊗2 = aa′. Kulich and Lin (2000) studied how to fit the case–cohort data to the addictive hazards model, which is an important alternative to the proportional hazards model when researchers are interested in risk differences rather than risk ratios. Including the information of primary exposure from all the cases in (t) regardless of whether or not they belong to the subcohort, the proposed estimating procedure makes fuller use of the exposure information from both the cases and controls. Furthermore, the proposed method can also be applied to the situations that the subcohort is selected by Bernoulli sampling with arbitrary selection probabilities or possibly stratified simple random sampling.

Chen (2001a) proposed a weighted semiparametric likelihood method for case–cohort studies under the proportional odds model, in which,

P(T~i>tZi)=11+exp{βZi}Λ0(t),i=1,,N,

where Λ0(t) is the baseline cumulative hazard function. Let S0 be the index set of the subcohort and S1 be the index set of the cases outside the subcohort. Denote 0 = {iS0, Δi = 0} and 1 = {iS0S1, Δi = 1}. Let n0 and n1 denote the sample sizes of 0 and 1, respectively. Chen (2001a) derived the estimation of regression parameter β by maximizing the objective function:

L2(β)=iS~0[11+exp{βZi}Λ0(Ti)]NnS~1nS~0iS~1exp{βZi}dΛ0(Ti)[1+exp{βZi}Λ0(Ti)]2, (7)

where Λ0(t) is restricted to the class of monotonically increasing functions of the form Λ0(Ti)=jS~1Yi(Tj)λj, that is, with jumps at the observed failure times only. The proposed objective function (7) is a geometrically weighted version of the so-called complete-case likelihood function, hence it was called simply the weighted semiparametric likelihood. Chen (2001a) studies the case–cohort design under the proportional odds model, which is a potentially useful alternative in some applications when the proportional hazards model does not fit the data well. The proposed estimating procedure is applicable to the semiparametric transformation model. Particularly, the estimators of Chen and Lo (1999) can be generated by the approach of Chen (2001a) under appropriate weighting scheme under the proportional hazards model.

Kong et al. (2004) considered the following semiparametric transformation models:

H(T~i)=Ziβ+εi,i=1,,N, (8)

where H is an unspecified strictly increasing function and ε is a random error with a known distribution function F. The main idea of Kong et al. (2004) is to regard case–cohort design as a special case of general missing data problems. The exposure variables are missing by design in case–cohort studies, so the missing mechanism is clearly known. Following the inference of model (8) for complete data, Kong et al. (2004) introduced an extra parameter γ = H(t0), where t0 is a prespecified constant such that P(min(, C) > t0) > 0, and then obtained the parameter vector θ = (β′, γ)′. Suppose a subcohort of size n is selected randomly from the cohort. Let = ξi denote the indicator for the ith subject being selected into the subcohort. Assume P(ξi = 1) = π = n/N, which means each subject has the same probability of being selected into the subcohort.

Motivated by the idea of weighting the incomplete data by the inverse selection probabilities, Kong et al. (2004) defined a weight wi j to reflect the contribution of a pair of subjects i and j to the estimating function as

wij=wiwj,

where

wi=Δi+(1Δi)ξiπ.

For estimation of the parameter vector θ, the weighted estimating equation was proposed

U6(θ)=i=1Nj=1,jiwijρij(θ)η.ij(θ)[ΔjI{min(Ti,t0)Tj}G^n2(Tj)ηij(θ)]=0, (9)

where ρi j(θ) is a positive weight function, ηij(θ)=γ{1F(v+Ziβ)}dF(v+Zjβ),η.ij(θ)=ηij(θ)θ, and Ĝn(·) is the Kaplan–Meier estimator of the survival function for censoring time based on the subcohort data. Two types of weight were used in the estimating function (9), in which, the weight wi j was applied to take into account the sampling design effect, and ρi j(θ) was introduced to improve the efficiency of the estimating equations. In some applications, the proportional hazards model may not fit the data well, or researchers may be interested in modelling the association from different aspects. The semiparametric transformation models, incorporating a variety of nonproportional hazards models, can be a more flexible choice in such situations. Kong et al. (2004) established statistical methods for the case–cohort data under the semiparametric transformation models.

Lu and Tsiatis (2006) developed a way of weighted estimating equations for parameters of the semiparametric transformation models in (8) under the case–cohort design. Inspired by the methods of semiparametric transformation models for the complete data, Lu and Tsiatis (2006) considered a martingale process defined as

Mi(t)=Ni(t)0τYi(s)dΛ{H(s)+Ziβ},

where Λ(t) denotes the cumulative hazard function for ε in (8). Suppose a subcohort of size n is selected randomly from the cohort. Let ξi denote the indicator for the ith subject being selected into the subcohort. Assume each subject has the same probability P(ξi = 1) = π = n/N of being selected into the subcohort. Lu and Tsiatis (2006) adopted weights as

wi=Δi+(1Δi)ξiπ,i=1,,N,

and proposed to use the following estimating equations

U9(β)=i=1Nwi[dNi(t)Yi(t)dΛ{H(t)+Ziβ}]=0,(t0),H(0)=,
U10(β)=i=1N0τwiZi[dNi(t)Yi(t)dΛ{H(t)+Ziβ}]=0,

to estimate functions for H and β. As we mentioned before, Kong et al. (2004) studied the case–cohort design under semiparemetric transformation models by regarding the case–cohort data as a special missing data problem. In the model of Kong et al. (2004), the censoring time was assumed to be independent of exposure variables. Lu and Tsiatis (2006) developed a different way of estimating parameters. The proposed procedure makes use of a martingale integral representation and an inverse probability weighted approach in constructing the estimating equations. The proposed method of Lu and Tsiatis (2006) allows the censoring time to depend on exposure variables. Breslow and Wellner (2007) further studied a theory of inverse probability weighted methods under the semiparametric model with two-phase stratified samples.

Qi et al. (2005) considered weighted estimators for the proportional hazards model (2) with missing exposure variables. Suppose that some elements of the exposure variable Z are missing. Define Zi=(Zim,Zic), where Zic denotes the exposure variables for the ith subject that are always observed and Zim denotes the exposure variables that are sometimes missing. Let ξi denote the selection indicator, which equals 1 if Zim is available and 0 if Zim is missing for the ith subject. The missing-data mechanism is determined by the distribution of ξi given (Ti, Δi, Zic), which is Bernoulli with probability πi=P(ξi=1Ti,Δi,Zic). When the selection probability π = (π1, . . . , πN)′ is known, under the proportional hazards model, Qi et al. (2005) first proposed a weighted estimating function as

U7(β,π)=i=1Nwi0τ[Zil=1NwlYl(t)ZleβZll=1NwlYl(t)eβZl]dNi(t)=0, (10)

where

wi=ξiπi,i=1,,N.

The estimator obtained by the above estimating equation may not be efficient. In an attempt to improve efficiency, an estimator of π was used in the above estimating function (10). Qi et al. (2005) applied nonparametric kernel smoothing techniques to estimate π based on observed data, including complete and incomplete observations. Let W denote the variables on which an estimator of π is allowed to depend. The rth-order kernel function K is a piecewise smooth function, which satisfies K(u)du=1, umK(u)du=0, for m = 1, . . . , (r – 1), urK(u)du0, and K(u)2du<. Define Kh(·) = K(·/h), where h is the bandwidth. The following estimator

π^=i=1NξiKh(wWi)j=1NKh(wWj),

was proposed to estimate π. Replacing π in the weighted estimating function (10) with π^, Qi et al. (2005) derived a new estimating equation:

U8(β,π)=i=1N0τw^i[Zil=1Nw^lYl(t)ZleβZll=1Nw^lYl(t)eβZl]dNi(t)=0,

where

w^i=ξiπ^i,i=1,,N.

Under the proportional hazards model with missing exposure variables, Qi et al. (2005) presented both simple weighted and kernel-assisted fully augmented weighted estimators, and the latter one is more efficient than the former one. The proposed methods require neither a model for the missing-data mechanism nor specification of the conditional distribution of the missing exposure variable. The proposed methods allow the missing-data mechanism to depend on outcome variables and observed exposure variables, which makes the proposed estimating procedure applicable to various cohort sampling designs, including the case–cohort design.

Nan et al. (2006) studied how to fit case–cohort data to a linear regression model, which models the relationship of the failure time and the primary exposure variable directly as the form:

T~i=βZi+εi,i=1,,N, (11)

where, given (Zi, Ci), the εi's are independent and identically distributed with an unknown distribution. For the case–cohort study, the estimating equation was proposed

U11(β)=i=1N0τ[ZilS0ZlYl(u+βZl)lS0Yl(u+βZl)]dNi(u+βZi)=0, (12)

where S0 denotes the index set of the subcohort. Nan et al. (2006) developed the statistical methods for the case–cohort design under a linear regression model, which is an important alternative way of analyzing failure time data. The proposed weighted estimating equations are derived by modifying the linear ranks tests and estimating equations which arise from full-cohort data, using similar methods to those applied by Self and Prentice (1988) for the proportional hazards model.

Kong and Cai (2009) considered the accelerated failure time model:

log(T~i)=βZi+εi,i=1,,N, (13)

where εi's are independent and identically distributed random errors with an unspecified distribution. Define ei(β)=log(Ti)Ziβ, Ni(β;t)=ΔiI(ei(β)t), and Yi(β; t) = I(ei(β) ≥ t), for i = 1, . . . , N. For the case–cohort design, suppose a subcohort size n is selected randomly from the cohort. Let ξi denote the indicator for the ith subject being selected into the subcohort. Assume each subject has the same probability π = n/N of being selected into the subcohort. Kong and Cai (2009) adopted the weights as

wi=Δi+(1Δi)ξiπ,i=1,,N,

and developed a rank-based estimating equation approach by solving the score function

U12(β)=i=1NΔiϕ(β;ei(β)){ZiZ~(β;ei(β))}=0, (14)

where Z~(β;t)=l=1NwlZlYl(β;t)l=1NwlYl(β;t), and ϕ is a possibly data-dependent weight function. The choices of ϕ(β; t) = 1 and ϕ(β;t)=N1i=1NYi(β;t) correspond to the log-rank and Gehan statistics, respectively. By linearly relating the natural logarithm of the failure time to the exposure, the accelerated failure time model may be attractive to model failure time data in some applications. Kong and Cai (2009) developed a rank-based estimating approach for analyzing the case–cohort data under the accelerated failure time model. Furthermore, the proposed method is also valid for the usual linear model. Compared with the estimating function (12) used by Nan et al. (2006), the proposed estimating approach includes failures outside the subcohort in constructing (β; t) in Eq. (14). Therefore, the estimators of Kong and Cai (2009) may be more efficient.

2.1.2 Generalized case–cohort designs

The case–cohort design is used primarily to reduce the cost involved in the assembly of the exposure information. The censoring times of the subjects who are not included in the subcohort may be much less costly to obtain. Chen (2001b) studied the case–cohort design modified by considering the information of the censoring times of subjects not included in the subcohort to parameter estimation. They considered a more general specification of semiparametric transformation regression models, which assumes

P(T~i>tZi)=ϕ(β,Zi,H(t)),i=1,,N, (15)

where ϕ is assumed known. Model (15) reduces to the usual semiparametric models by choosing certain specified form of ϕ. Let φ be the derivative of −ϕ with respect to the third argument. Denote G(t)=ϕ(β,z,H(t))dQ(z), where Q is the marginal distribution of the exposure variable Z, and (t) = 1 – G(t). Let ν(β, Q, G(t)) denote the inverse transformation.

Chen (2001b) proposed a maximum conditional-profile-likelihood method to fit the above model (15) to data from the modified case–cohort design. Let S0 be the index set of the subcohort, S1 be the index set of the cases that are not selected in the subcohort, and S2 be the index set of the remaining subjects. With the observation (Ti, Δi, Zi), for iS0S1, and (Ti, Δi), for iS2, they proposed the conditional likelihood as:

L3(β)=iS0S1[φ(β,Zi,v(β,Q,G(Ti)))φ(β,z,v(β,Q,G(Ti)))dQ(z)]Δi×[φ(β,Zi,v(β,Q,G(Ti)))G(Ti)]1Δi×iS0S1S2G(Ti)Δi{dG(Ti)}1Δi.

Since neither G nor Q is known, G was replaced by the Kaplan–Meier estimator and Q was replaced by the empirical estimator based on the random subcohort S0 to obtain a conditional profile likelihood function. Then an estimator of β can be derived by maximizing the resulting profile likelihood. Chen (2001b) considered the problem of fitting a more flexible semiparametric transformation regression models to data from a modified case–cohort design, in which the efficiency gain may arise because the censoring times of all the censored subjects in the cohort are included.

Chen (2001c) defined a generalized case–cohort design, which consists of a number of sampling steps. Each step takes a random sample from a certain subset of the cohort, and the design of the sample size and subset at each step and of the total numbers of steps is independent of the observed exposure. Such generalized case–cohort design covers case–control design, nested case–control design and original case–cohort design. Under the proportional hazards model (2) for data from the proposed generalized case–cohort design, Chen (2001c) developed a weighted estimating equation approach, in which the weights were obtained from an idea of estimating each missing exposure variable by a local average. Let ξi denote the indicator, equaling 1 if the ith subject is sampled and 0 otherwise. Let 0 = t0t1 ≤ ··· ≤ taN = τ and 0 = s0s1 ≤ ··· ≤ snN = τ be two partitions of [0, τ). The partitions may be data dependent but should only depend on (Ti, Δi, ξi), i = 1, . . . , N. Let rN (t, d) be a step function defined on [0, τ) × {0, 1} such that

rN(t,d)={l=1NΔlξlI(Tl[ti1,ti))l=1NΔlI(Tl[ti1,ti)),ifd=1andt[ti1,ti),l=1N(1Δl)ξlI(Tl[sj1,sj))l=1N(1Δl)I(Tl[sj1,sj)),ifd=0andt[sj1,sj),}

for 1 ≤ iaN and 1 ≤ jbN. Defining the weights as

wi=ξirN(Ti,Δi),i=1,,N,

Chen (2001c) proposed the weighted estimating equation as

U13(β)=i=1N0τwi[h(Zi(t))l=1NwlYl(t)h(Zl(t))eβZl(t)l=1NwlYl(t)eβZl(t)]dNi(t)=0,

where h is an exposure-related process.

The sample reuse approach via local averaging proposed by Chen (2001c) is more efficient than the typical approach via inclusion probabilities. By choosing h(x) = x, Chen (2001c) improved the pseudo-likelihood estimators of Prentice (1986). Despite more complexity and difficulty, a semiparametric efficient estimator can be obtained by choosing h to be the exposure variable transformed by the inverse of an estimated linear integral operator. Samuelsen et al. (2007) further discussed Chen's approach and pointed out how it is related to stratified case–cohort analysis. They studied an extension of Chen's generalized case–cohort design to allow for surrogate-dependent sampling and showed how such data may be analyzed with the post-stratification method.

Cai and Zeng (2007) developed the case–cohort design to a generalized case–cohort design, where only a fraction of cases instead of all cases are sampled. The main difference of such generalized case–cohort design from the original case–cohort design is that not all the remaining cases are selected for assembling the exposure measurements. Cai and Zeng (2007) proposed a general log-rank test statistic, which was constructed by approximating the risk set and the event process of the complete data using the sampled data. Kang and Cai (2009) and Kang et al. (2013) further developed the statistical inference for generalized case–cohort studies with multiple disease outcomes. The methods can be easily reduced to the situation with univariate outcome. We will discuss in details later.

2.1.3 Stratified case–cohort designs

In order to improve the efficiency of the case–cohort study by making better use of the first-phase covariate data, several literature discussed stratified case–cohort designs.

Kulich and Lin (2004) developed a general class of weighted estimators under a stratified case–cohort designs. Consider a cohort of N subjects who can be divided into K mutually exclusive strata based on a discrete random variable V, which represents the first-phase covariate information. Let ξ denote the selection indicator of a subject into the subcohort. For each k = 1, . . . , K, let P(ξ = 1|V = k) = πk. Let Nk denote the number of subjects in the kth stratum. Under the stratified case–cohort design, complete observations (Tki, Δki, Zki(t), 0 ≤ tτ, Vki, ξki = 1) are available for all subcohort subjects, and at least (T, Δki = 1, Zki(Tki)) are observed for the cases, where the subscript {ki} denotes the ith subject in the kth stratum. Under the proportional hazards model in (2), Kulich and Lin (2004) proposed a weighted estimating approach for estimating the regression parameter β by solving the score function

U14(β)=k=1Ki=1Nk0τ[Zki(t)k=1Ki=1Nkwki(t)Yki(t)Zki(t)eβZki(t)k=1Ki=1Nkwki(t)Yki(t)eβZki(t)]dNki(t)=0. (16)

Various proposals for the potentially time-varying weight wki(t) yield different case–cohort estimators.

Kulich and Lin (2004) further extended the above method to a doubly weighted estimation method by incorporating arbitrary stochastic processes as time-varying weights into the empirical sampling probabilities. Let Aki(t) be a diagonal matrix with m potentially different random processes on the diagonal. Consider the following estimators of the subcohort sampling probabilities:

π^k(t)=[i=1Nk(1Δki)Aki(t)]1[i=1Nk(1Δki)ξkiAki(t)],

which yields m estimators of πk on the diagonal of π^k(t). Each estimator can be interpreted as an empirical sampling proportion based on the control, with the contribution of each control weighted by a component of Aki(t). The weight matrix was defined as

wki(t)=ΔkiIm+(1Δki)ξkiπ^k1(t),

where Im is an m × m identity matrix. Kulich and Lin (2004) considered the estimating equation in (16) as

U15(β)=k=1Ki=1Nk0τ{Zki(t)ZDW(t,β)}dNki(t)=0,

where

ZDW(t,β)={k=1Ki=1Nkwki(t)Yki(t)eβZki(t)}1×{k=1Ki=1Nkwki(t)Yki(t)Zki(t)eβZki(t)}.

The estimators proposed by Borgan et al. (2000) can be regarded as the special cases for the above doubly weighted estimators. To reduce the efficiency loss caused by misspecification of model, Kulich and Lin (2004) combined the doubly weighted estimator with the estimator of Borgan et al. (2000) to obtain a combined doubly weighted estimator. The proposed estimators Kulich and Lin (2004) may be more efficient than the estimators of Chen and Lo (1999), Borgan et al. (2000) and Chen (2001c) by choosing appropriate weight functions.

Nan et al. (2006) also developed weighted estimating equation methods for a stratified case–cohort designs under the linear regression model (11). If a variable Z* that is highly correlated with the primary exposure variable Z is available for all the subjects in the cohort, selecting the subcohort using a stratified sampling scheme based on Z* can improve efficiency. An independent Bernoulli sampling method, in which

π(Zi)=P(iS0Zi),i=1,,N,

was considered to select the subcohort. Nan et al. (2006) proposed the following estimating equation for such stratified case–cohort design as

U16(β)=i=1N0τ[Zil=1NwlZlYl(u+βZl)l=1NwlYl(u+βZl)]dNi(u+βZi)=0, (17)

where

wi=I(iS0)π(Zi),i=1,,N,

and S0 denotes the index set of the subcohort. Nan et al. (2006) extended their works on the original case–cohort design (see 12) to the stratified case–cohort design. By selecting the subcohort using a stratified sampling scheme, which makes better use of the first-phase covariate information, the estimator obtained from (17) can improve efficiency.

Kong and Cai (2009) also extended the proposed estimating procedure under the accelerated failure time model (13) for the original case–cohort design (see 14) to the stratified case–cohort design. For the stratified case–cohort design, the full cohort is supposed to consist of K strata of sizes N1, . . . , NK. Let nk denote the size of samples selected from the kth stratum into the subcohort. Let πk = nk/Nk be the sampling proportion of the subcohort in the kth stratum. Kong and Cai (2009) proposed the following rank-based estimating equation

U17(β)=k=1Ki=1NkΔkiϕ(β;eki(β)){ZkiZ~k(β;ei(β))}=0, (18)

where Z~k(β;t)=l=1NkwklZklYkl(β;t)l=1NkwklYkl(β;t), and

wki=Δki+(1Δki)ξkiπk,i=1,,N.

Since the stratified sampling design further improves the efficiency when the stratification variable is a good surrogate of the primary exposure, the proposed method of Kong and Cai (2009) can further enhance the efficiency. The proposed methods are also valid for the usual semiparametric linear model.

2.2 General failure-time ODS design

While case–cohort is an efficient design, especially when failure is rare, i.e., in the high censoring situations, this design may not be practically feasible to implement if the failure is not rare. In case–cohort design and generalized case–cohort design, the selection of the supplemental samples depends on whether the event of interest happens or not. As ODS design with continuous outcome, if an exposure variable is related to the outcome, then the subjects, whose observed failure time are very long or short, should be of more information about the exposure-response relationship.

To take advantage of the ODS scheme for right-censored data to yield more powerful and efficient inferences, Ding et al. (2014) proposed a general failure-time ODS sampling design. In such a general failure-time ODS design, a random sample (SRS) from the full cohort is selected. In addition, the range of observed failure time of all the cases is partitioned into mutually exclusive and exhaustive strata, and a supplemental sample from each stratum is selected. The measurements of primary exposure variables are only assembled for these two components. Like case–cohort designs, the general failure-time ODS design enriches the observed sample by selectively including certain failure subjects. The development of such a general ODS design for failure time data is interesting and significant in building a cost-effective sampling design in survival analysis related studies. Several authors studied the statistical inference methodologies for data from the general failure-time ODS design.

To further highlight the general ODS design with failure time data, suppose that there exists a study population of N independent individuals. Assume that the range of observed failure time of all the cases is partitioned into K mutually exclusive and exhaustive strata: Ak = (ak–1, ak], k = 1, . . . , K, by some known constants {ai, i = 1, . . . , K} which satisfy 0 = a0 < a1 < ··· < ak–1 < ak = τ. The general failure-time ODS design: First, a random sample of size n0 from the full cohort, denoted by the SRS sample, is selected. In addition, a supplemental sample of size nk is selected from each of the above kth stratum of cases. The samples from these two components constitute the ODS sample. Suppose that nk is fixed by design for k = 1, . . . , K. Denote n=k=0Knk to be the total size of the ODS sample. Let V, S0 and Sk be the index set of the total ODS sample, the SRS sample and the supplemental sample from the kth stratum, respectively. Hence, the observed data for the failure-time ODS design can be summarized as: (Ti, Δi, Zi) when iS0, and (Ti, Δi, Zi| Δi = 1, TiAk) when iSk, k = 1, . . . , K.

Ding et al. (2014) developed such a general failure-time ODS scheme and established estimation procedures under the proportional hazards model in (2). The likelihood function based on the observed data from the general failure-time ODS design is proportional to

L4(β,QZ,Λ0,SC)=[iS0{fβ,Λ0(TiZi)}Δi{F¯β,Λ0(TiZi)}1Δi]×[k=1KiSkfβ,Λ0(TiZi)]×[k=0KiSkqZ(Zi)]×[k=1K{ZAkfβ,Λ0(tZ)SC(t)dtdQZ(Z)}nk], (19)

where fβ,Λ0 (t|Z) and β,Λ0 (t|Z) are the conditional density function and survival function of given Z with the baseline cumulative hazard function Λ0(t), respectively, QZ (·) and qZ (·) denote the cumulative distribution and density function of Z, respectively, and SC(t) are the survival function of the censoring time C. Because the nonparametric portion (QZ, Λ0, SC) cannot be separated from the above likelihood function (19) that combines both the conditional parametric likelihood and the marginal semiparametric likelihood, Ding et al. (2014) developed an estimated maximum semiparametric empirical likelihood approach for estimation of the regression parameter.

By replacing Λ0 with the Breslow-Aalen estimator Λ^0 and SC with the Nelson-Aalen estimator ŜC based on the SRS data in the above joint likelihood, an estimated likelihood function was obtained as

l5(β,QZ)=iS0loghβ(Ti,Δi,Zi)+k=1KiSklogfβ,Λ^0(TiZi)+k=0KiSklogqZ(Zi)k=1Knklogπk, (20)

where hβ(Ti,Δi,Zi)=(eβZilS0Yl(Ti)eβZl)Δi, and for k = 1, . . . , K,

πkZPk(Z;β)dQZ(Z)Z(Akfβ,Λ^0(tZ)S^C(t)dt)dQZ(Z),

which are the stratum-specific estimated probabilities of the failure time across all cases. Maximizing the estimated likelihood (20) with respect to (β, QZ) by a semiparametric empirical approach without specifying QZ, the resulting profile likelihood function was obtained

l6(β,π)=iS0loghβ(Ti,Δi,Zi)+k=1KiSklogfβ,Λ^0(TiZi)iVlog[n0{1+k=1Knkn0πkPk(Zi;β)}]k=1Knklogπk, (21)

where π′ = (π1, . . . , πK). The proposed estimator is the β that maximizes (21).

Yu et al. (2015) developed a weighted pseudo-score estimator for the regression parameters of the additive hazards model (5) for data from the general failure-time ODS design. Let ξi indicate, by the values 1 or 0, whether or not the ith subject is selected into SRS portion. Let ηik denote whether or not the ith subject from the stratum Ak is selected into the supplemental sample. For estimating the regression parameter β, the following weighted pseudo-score equation was proposed by applying the inverse probability weighted approach,

U18(β)=i=1N0τwi{Zi(t)Z(t)}{dNi(t)Yi(t)βZi(t)dt}=0,

where Z(t)=i=1NwiYi(t)Zi(t)i=1NwiYi(t), and the weights wi were defined as

wi=(1Δi)ξi(ρ^0ρ^V)1+Δiξi(1ζi)(ρ^0ρ^V)1+Δiξiζi+Δi(1ξi)k=1Kπk(1ρ^0ρ^V)ζikηikρ^kρ^V,

where ρ^0=n0n, ρ^k=nkn, ρ^V=nN, and ζi=k=1Kζik and ζik = I(TiAk).

The general failure-time ODS design proposed by Ding et al. (2014) is an improvement over the case–cohort design and the generalized case–cohort design, because the general failure-time ODS design allows the sample selection of cases to depend on the timing of disease endpoints, i.e., by oversampling subjects from the most informative regions. To reap in the benefit of such a general failure-time ODS design, Ding et al. (2014) developed a new inferential method and provided an estimated maximum semiparametric empirical likelihood estimator for the parameters of primary interest under the proportional hazards model. For the additive hazards model, which focuses on risk differences rather than risk ratios, Yu et al. (2015) studied a weighted pseudo-score estimating procedure for estimation of regression parameter. The proposed estimators have a closed form and are easy to compute. Some suggestions for using the proposed method by evaluating the relative efficiency of the proposed method against simple random sampling design and the optimal allocation of the subsamples for the proposed design were derived. The above researches suggest that the general failure-time ODS design is a biased-sampling design which can enhance study efficiency and reduce study cost. Such a general failure-time ODS design can be an important alternative to the case–cohort design and the generalized case–cohort design in survival-data related studies. Further developments on the general failure-time ODS design are desirable.

2.3 Other biased-sampling designs with failure time data

Other important biased-sampling designs with failure time data include length-biased sampling and interval sampling. When survival data arise from prevalent cases ascertained through a cross-sectional study, it is well known that the survivor function corresponding to these data is length biased. Length-biased sampling is frequently a convenient and economical sampling scheme for analyzing failure time data. The phenomenon of length bias has been first noticed in the context of anatomy by Wicksell (1925). Later systematically studied by Vardi (1982, 1989), Wang (1991), Correa and Wolfson (1999), Asgharian et al. (2002), and Asgharian and Wolfson (2005), among others. When analyzing prevalent cohort survival data with exposure variables, failure times are not a random sample from the study population. Thus, the corresponding exposure variables are also biased because they are associated with the long-term survivors. Related sampling issues have been discussed, e.g. Patil and Rao (1978), Patil et al. (1988), and Bergeron et al. (2008).

Wang (1996) first adopted the proportional hazards model to fit length-biased failure time data. The proposed estimation procedures used a bias-adjusted risk set sampling for the construction of the pseudo-likelihood. Ghosh (2008) proposed an estimating equation approach, which allows the length-biased data are subject to right censoring. Tsai (2009) obtained a pseudo-partial likelihood for proportional hazards models with biased-sampling data by embedding the biased-sampling data into left-truncated data. Shen et al. (2009) studied how to model exposure effects for length-biased data under transformation and accelerated failure time model. Qin and Shen (2010) proposed inverse weighted equation methods for estimating the regression parameter of the proportional hazards model. Qin et al. (2011) proposed new EM algorithms for the maximum likelihood estimators of the nonparametric and semiparametric proportional hazards models for right-censored length-biased data.

Often in practice, instead of right censored, the event time is interval censored, that is, the event time for a subject falls into some random time interval. Under the proportional hazards model, Li et al. (2008) considered case–cohort data with interval censoring, where the inspection time intervals were assumed to be fixed. Current status data are a special type of interval censored data in which the inspection time interval are random. Li and Nan (2011) considered a family of semiparametric regression models for the current status data in two-phase sampling designs, which include case–cohort designs as special cases. A weighted likelihood method was proposed by regarding two-phase sampling designs as a special missing data problem.

In many applications, interests often lie on the occurrence of two or more consecutive failure events and the relationship between event times. In such situations, data are often collected conditional on the first failure event which occurs within a specific time interval, and this fact induces bias. This type of sampling is referred to as interval sampling, where the first event is retrospectively identified and the subsequent failure events are observed during follow-up. Interval sampling occurs because only subjects with disease within a specific time interval can be included, and the data represent a nonrandomly sampled subset of the population.

Recent researches include that, among others, Zhu and Wang (2012) developed the statistical features and bias of observed data in relation to interval sampling. Semiparametric methods were proposed under semi-stationarity and stationarity. Zhu and Wang (2014) proposed nonparametric estimation of the association between bivariate failure times based on Kendall's tau for interval sampling data. A nonparametric estimator was derived, where the contribution of each comparable and order able pair was weighted by the inverse of the associated selection probability. Zhu and Wang (2015) obtained bias-corrected estimators of marginal survival functions and estimated association parameter of copula model by a two-stage procedure. Inference of association measure in copula model was developed, where exposure variables were incorporated into the survival distribution via the proportional hazards model.

3 ODS designs for multivariate failure time

An advantage of the case–cohort study design is that the subcohort can be used for multiple disease outcomes. Taking this advantage, in many studies, multiple case–cohort studies are conducted for different diseases using the same subcohort. A commonly used method for dealing with multiple disease outcomes is to analyze each disease separately. However, this approach does not allow comparison of the effects of risk factors for different diseases, because it does not account for the repeated use of the subcohort as well as the correlation between outcomes. Recently, several methodologies have been developed to analyze case–cohort and generalized case–cohort data with multiple disease outcomes.

Suppose that there are N independent subjects in a cohort study and there are K diseases outcomes of interest. Consider independent failure time response vectors i = (i1, . . . , iK)′ for i = 1, . . . , N. Let Cik denote the potential censoring time for outcome k of subject i. The observed time is Tik = min(ik, Cik). Let Δik = I(ikCik) denote the right censoring indicator for outcome k of subject i, Yik(t) = I(Tikt) denote the at-risk process and Nik(t) = Δik I (Tikt) denote the counting process. Let Zik(t) be a p-dimensional exposure variable corresponding to the kth disease outcome for subject i at time t. Let β be a p-dimensional parameter of interest. Let τ denote the study end time.

Kang and Cai (2009) proposed to fit data from the case–cohort design with multiple disease outcomes with a marginal intensity process model:

λik(tZik(t))=Yik(t)λ0k(t)exp{βZik(t)},i=1,,N;k=1,,K, (22)

where λ0k(t) is an unspecified baseline hazard function for disease outcome k. A subject may experience all, only some, or even none of the K diseases. Model (22) can incorporate failure-type-specific effects and includes the model

λik(tZik(t))=Yik(t)λ0k(t)exp{βkZik(t)},

as a special case. By defining β=(β1,,βk,,βK) and Zik(t)=(0,,0,Zik(t),0,,0,), disease-specific effects can be obtained by model (22).

Under the case–cohort design, suppose a subcohort of size n is selected from the cohort by simple random sampling. Let ξi denote the indicator for the ith subject being selected into the subcohort and P(ξi = 1) = π = n/N denote the selection probability of the ith subject. The observed data structure for the kth disease outcome of the ith subject is (Tik, Δik, ξi, Zik(t), 0 ≤ tTik) when ξi = 1 or Δik = 1 and (Tik, Δik, ξi) when ξi = 0 and Δik = 0. Kang and Cai (2009) developed the following pseudo-partial-likelihood score equation for the estimation of β,

U19(β)=i=1Nk=1K0τ[Zik(t)i=1Nwik(t)Yik(t)Zik(t)eβZik(t)i=1Nwik(t)Yik(t)eβZik(t)]dNki(t)=0, (23)

where wik(t) is a time-varying weight function which has the form:

wik(t)=Δik+(1Δik)ξiπ^k1(t),

where

π^k(t)=i=1N(1Δik)ξiYik(t)i=1N(1Δik)Yik(t).

This weight function equals to 1 for the cases regardless of wether they belong to the subcohort or not, and π^k1(t) for the sampled censored subjects, where π^k(t) is the estimator of the true sampling probability π. π^k(t) denotes the number of sampled censored subjects divided by the number of censored subjects remaining in the risk set at time t, which means π^k(t) is constructed using only censored subjects. This type of time-varying weight function, as it was discussed under the univariate failure time context, may enhance the efficiency.

Kim et al. (2013) further improved efficiency for the case–cohort studies with multiple disease outcomes under the marginal proportional hazards regression model (22). The new weights

w~ik(t)={1j=1K(1Δij)}+j=1K(1Δij)ξiπ~k1(t),

where

π~k(t)=i=1Nξi{j=1K(1Δij)}Yik(t)i=1N{j=1K(1Δij)}Yik(t),

were used to replace wik(t) in the score function (23) to obtain the proposed pseudo-likelihood estimator. The weight function ik(t) takes the failure status of the other diseases into consideration, and thus the proposed estimator will use the available exposure information for other diseases, which makes the estimators proposed by Kim et al. (2013) are more efficient than the estimators of Kang and Cai (2009).

Under the generalized case–cohort design, suppose a subcohort of size n is sampled from the cohort by simple random sampling. After the sampling of a subcohort, subsequent samplings of cases outside the subcohort follow. For the kth disease, let nc,k denote the number of cases that are outside the subcohort. Let ηik denote the indicator for the ith subject outside the subcohort with the kth disease being selected into the sample. Denote by qk = P(ηik = 1|Δik = 1, ξi = 0) = nc,k/(Nknk) the selection probability of subjects who have the kth disease but are outside the subcohort, where Nk and nk denote the number of the kth disease cases in the cohort and in the subcohort, respectively. The observed data structure for the kth disease outcome of the ith subject is (Tik, Δik, ξi, ηik, Zik(t), 0 ≤ tTik) when ξi = 1 or ηik = 1 and (Tik, Δik, ξi, ηik) when ξi = 0 and ηik = 0. When qk = 1 for all k, it reduces to the original case–cohort design that samples all the cases outside the subcohort.

For the generalized case–cohort study with multiple disease outcomes, Kang and Cai (2009) also fitted data to the marginal proportional hazards model (22), and constructed the following weighted estimating functions for the estimation of the hazards regression parameter β:

U20(β)=i=1Nk=1K0τwik(t)[Zik(t)i=1Nwik(t)Yik(t)Zik(t)eβZik(t)i=1Nwik(t)Yik(t)eβZik(t)]dNki(t),

where

wik(t)=Δikξi+(1Δik)ξiπ^k1(t)+Δik(1ξi)ηikq^k1(t), (24)

and

π^k(t)=i=1N(1Δik)ξiYik(t)i=1N(1Δik)Yik(t),q^k(t)=i=1NΔik(1ξi)ηikYik(t)i=1NΔik(1ξi)Yik(t). (25)

This idea is similar to their proposed method for the original case–cohort design (see 23). Subcohort cases are weighted by 1, and subjects censored for disease k in the subcohort are weighted by π^k1(t). The sampled non-subcohort cases are weighted by the inverse of their estimated sampling probabilities, q^k1(t), where q^k(t) denotes the number of sampled non-subcohort cases with the kth disease outcome divided by the number of non-subcohort cases with the kth disease outcome remaining in the risk set at time t.

Kang et al. (2013) considered fitting marginal additive hazards regression model for the generalized case–cohort designs with multiple disease outcomes. The model is

λik(tZik(t))=λ0k(t)+βZik(t),i=1,,N;k=1,,K, (26)

where λ0k(t) is an unspecified baseline hazard function for disease outcome k. Model (26) also incorporates disease-specific effects like model (22). Kang et al. (2013) proposed the weighted estimating equation:

U21(β)=i=1Nk=1K0τwik(t){Zki(t)Zk(t)}{dNik(t)Yik(t)βZik(t)dt}=0,

where Zk(t)=i=1NwikYik(t)Zik(t)i=1NwikYik(t), and wik(t) and π^k(t) have the same definitions as (24) and (25). The resulting estimator possesses a closed form:

β^21=[i=1Nk=1K0τwik(t)Yik(t){Zik(t)Zk(t)}2dt]1×[i=1Nk=1K0τwik(t){Zik(t)Zk(t)}dNik(t)].

Besides the multiple disease outcome data, other kinds of multivariate failure time data have become increasingly common in practice as a result of growing interest in studying disease incidence and clustering due to environmental factors and genetics. Lu and Shih (2006) proposed case–cohort designs adapted to clustered failure time data. The main principle of their proposed case–cohort designs is to determine the random subcohort from which the exposure data are assembled in addition to those from all cases. Lu and Shih (2006) considered several sampling schemes and developed the estimation procedures by fitting the proposed case–cohort design with clustered data under the proportional hazards model. The proposed approaches were derived by the principle similar to that of the pseudo-likelihood function of Self and Prentice (1988), but extended to accommodate the proposed subcohort selection procedures and to account for intracluster association.

The above literature on ODS designs with a univariate and multivariate failure time also studied the asymptotic properties, e.g. consistency and asymptotic normality, of the proposed estimators. Feasible estimators of variance, usually the small-sample expression of the large-sample formula, are naturally derived from the asymptotic variance of the proposed estimators. Some technical challenges arise from the biased-sampling designs because the observations are not independent. For example, asymptotic properties have been established by using techniques such as empirical likelihood method, empirical process theory, martingale convergence theory, and U-statistics theory, etc. The key for the studies of theoretical results is to address the challenges introduced by the counterpart data of the subcohort or supplemental samples in the ODS designs with survival data.

4 Discussion

Epidemiologic studies often require a long follow-up of subjects in order to observe meaningful outcome results. The cost for a large cohort study and a long period of follow-up time could be very expensive. Efficient sampling designs and statistical methods, which can reduce the study cost and improve the study power under a limited budget, are always desirable. Several cost-effective biased-sampling designs for failure time data have been developed and various estimating procedures have been proposed. This paper reviewed recent progresses in ODS designs with failure time data.

One advantage of the general failure-time ODS design is, while providing an overall information, to allow the sample selection of cases to depend on the timing of disease endpoints. The general failure-time ODS design is an improvement of the simple random sampling design, the case–cohort design and the generalized case–cohort design, especially in the situations that the disease rate is not low or investigators have not enough budget to sample all cases. Despite the progresses in the development of analyzing failure time data from a biased-sampling design, the methodologies to address data from such a general failure-time ODS design have been limited.

Extensions of the constructions of weighted estimating equations or likelihood functions would be worthwhile to consider. One extension is to adopt time-varying weights instead of weights based on simple sampling probabilities. Another extension is to include information available from the first-phase data in estimating equations or likelihood functions. For example, if the observed times are available for all the subjects in the cohort, incorporating failure times or censoring times of those who do not belong to the ODS samples in constructing estimating equations or likelihood functions could enhance the efficiency. Due to the fact that applying a stratified sampling scheme for selecting the subcohort could improve the efficiency of case–cohort designs, future developments of a stratified failure-time ODS design is justified, where the SRS portion is selected by a stratified sampling scheme.

In more and more applications, investigators tend to take interests in multivariate failure-time outcomes. Future researches include incorporating information of some always observed auxiliary variables in the weight functions of the estimating equations to improve efficiency further. For example, similar idea of Kulich and Lin (2004) could be modified to fit case–cohort data with multiple disease outcomes. Recent works of case–cohort design with multivariate failure times have focused on estimating equation approaches. Specifying the joint distribution of the correlated failure times from the same subject, nonparametric maximum likelihood estimations based on the joint likelihood function for case–cohort data will derive more efficient estimators. In order to make use of the advantage of an ODS design, which could oversample from the regions of most information, the development of a multivariate failure-time ODS design will be an interesting topic.

Acknowledgments

This research is supported in part by U.S. National Institutes of Health (R01ES021900, P01CA142538 to H.Z. and J.C.), National Science Foundation of China (11101314 to J.D.).

References

  1. Asgharian M, M'Lan CE, Wolfson DB. Length-biased sampling with right censoring: an unconditional approach. J Am Stat Assoc. 2002;97:201–209. [Google Scholar]
  2. Asgharian M, Wolfson DB. Asymptotic behaviour of the npmle of the survivor function when the data are length-biased and subject to right censoring. Ann Stat. 2005;33:2109–2131. [Google Scholar]
  3. Barlow W. Robust variance estimation for the case-cohort design. Biometrics. 1994;50:1064–1072. [PubMed] [Google Scholar]
  4. Bergeron PJ, Asgharian M, Wolfson DB. Covariate bias induced by length-biased sampling of failure times. J Am Stat Assoc. 2008;103:737–742. [Google Scholar]
  5. Borgan O, Langholz B, Samuelsen SO, Goldstein L, Pogoda J. Exposure stratified case-cohort designs. Lifetime Data Anal. 2000;6:39–58. doi: 10.1023/a:1009661900674. [DOI] [PubMed] [Google Scholar]
  6. Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20. [Google Scholar]
  7. Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J R Stat Soc B. 1997;59:447–461. [Google Scholar]
  8. Breslow NE, McNeney B, Wellner JA. Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. Ann Stat. 2003;31:1110–1139. [Google Scholar]
  9. Breslow NE, Wellner JA. Weighted likelihood for semiparametric models and two-phase stratified samples, with application to cox regression. Scand J Stat. 2007;34:86–102. doi: 10.1111/j.1467-9469.2007.00574.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cai J, Zeng D. Sample size/power calculation for case-cohort studies. Biometrics. 2004;60:1015–1024. doi: 10.1111/j.0006-341X.2004.00257.x. [DOI] [PubMed] [Google Scholar]
  11. Cai J, Zeng D. Power calculation for case-cohort studies with nonrare events. Biometrics. 2007;63:1288–1295. doi: 10.1111/j.1541-0420.2007.00838.x. [DOI] [PubMed] [Google Scholar]
  12. Chatterjee N, Chen YH, Breslow NE. A pseudo-score estimator for regression problems with two-phase sampling. J Am Stat Assoc. 2003;98:158–168. [Google Scholar]
  13. Chen HY. Weighted semiparametric likelihood method for fitting a proportional odds regression model to data from the case-cohort design. J Am Stat Assoc. 2001a;96:1446–1458. [Google Scholar]
  14. Chen HY. Fitting semiparametric transformation regression models to data from a modified case-cohort design. Biometrika. 2001b;88:255–268. [Google Scholar]
  15. Chen K. Generalized case-cohort sampling. J R Stat Soc B. 2001c;63:791–809. [Google Scholar]
  16. Chen K, Lo S. Case-cohort and case-control analysis with Coxs model. Biometrika. 1999;86:755–764. [Google Scholar]
  17. Cornfield J. A method of estimating comparative rates from clinical data: applications to cancer of lung, breast, and cervix. J Natl Cancer I. 1951;11:1269–1275. [PubMed] [Google Scholar]
  18. Correa JA, Wolfson DB. Length-bias: some characterizations and applications. J Stat Comput Sim. 1999;64:209–219. [Google Scholar]
  19. Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
  20. Ding J, Liu L, Peden DB, Kleeberger SR, Zhou H. Regression analysis for a summed missing data problem under an outcome-dependent sampling scheme. Can J Stat. 2012;40:282–303. [Google Scholar]
  21. Ding J, Zhou H, Liu L, Cai J, Longnecker MP. Estimating effect of environmental contaminants on women's subfecundity for the MoBa study data with an outcome-dependent sampling scheme. Biostatistics. 2014;15:636–650. doi: 10.1093/biostatistics/kxu016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Ghosh D. Proportional hazards regression for cancer studies. Biometrics. 2008;64:141–148. doi: 10.1111/j.1541-0420.2007.00830.x. [DOI] [PubMed] [Google Scholar]
  23. Imbens GW, Lancaster T. Efficient estimation and stratified sampling. J Econ. 1996;74:289–318. [Google Scholar]
  24. Kalbfleisch JD, Lawless JF. Likelihood analysis of multi-state models for disease incidence and mortality. Stat Med. 1988;7:147–160. doi: 10.1002/sim.4780070116. [DOI] [PubMed] [Google Scholar]
  25. Kang S, Cai J. Marginal hazards model for case-cohort studies with multiple disease outcomes. Biometrika. 2009;96:887–901. doi: 10.1093/biomet/asp059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kang S, Cai J, Chambless L. Marginal additive hazards model for case-cohort studies with multiple disease outcomes: an application to the Atherosclerosis Risk in Communities (ARIC) study. Biostatistics. 2013;14:28–41. doi: 10.1093/biostatistics/kxs025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kim S, Cai J, Lu W. More efficient estimators for case-cohort studies. Biometrika. 2013;100:695–708. doi: 10.1093/biomet/ast018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kong L, Cai J. Case-cohort analysis with accelerated failure time model. Biometrics. 2009;65:135–142. doi: 10.1111/j.1541-0420.2008.01055.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Kong L, Cai J, Sen PK. Weighted estimating equations for semiparametric transformation models with censored data from a case-cohort design. Biometrika. 2004;91:305–319. [Google Scholar]
  30. Kulich M, Lin DY. Additive hazards regression for case-cohort studies. Biometrika. 2000;87:73–87. [Google Scholar]
  31. Kulich M, Lin DY. Improving the efficiency of relative-risk estimation in case-cohort studies. J Am Stat Assoc. 2004;99:832–844. [Google Scholar]
  32. Lawless JF, Wild CJ, Kalbfleisch JD. Semiparametric methods for response-selective and missing data problems in regression. J R Stat Soc B. 1999;61:413–438. [Google Scholar]
  33. Li Z, Gilbert P, Nan B. Weighted likelihood method for grouped survival data in case-cohort studies with application to HIV vaccine trials. Biometrics. 2008;64:1247–1255. doi: 10.1111/j.1541-0420.2008.00998.x. [DOI] [PubMed] [Google Scholar]
  34. Li Z, Nan B. Relative risk regression for current status data in case-cohort studies. Can J Stat. 2011;39:557–577. [Google Scholar]
  35. Lin DY, Ying Z. Cox regression with incomplete covariate measurements. J Am Stat Assoc. 1993;88:1341–1349. [Google Scholar]
  36. Lu S, Shih JH. Case-cohort designs and analysis for clustered failure time data. Biometrics. 2006;62:1138–1148. doi: 10.1111/j.1541-0420.2006.00584.x. [DOI] [PubMed] [Google Scholar]
  37. Lu W, Tsiatis AA. Semiparametric transformation models for the case-cohort study. Biometrika. 2006;93:207–214. [Google Scholar]
  38. Nan B, Yu M, Kalbfleisch JD. Censored linear regression for case-cohort studies. Biometrika. 2006;93:747–762. [Google Scholar]
  39. Patil GP, Rao CR. Weighted distributions and size-biased sampling with applications to wildlife population and human families. Biometrics. 1978;34:179–189. [Google Scholar]
  40. Patil GP, Rao CR, Zelen M. Weighted distributions. In: Kotz S, Johnson NL, editors. Encyclopedia of statistical sciences. Wiley; New York: 1988. pp. 565–571. [Google Scholar]
  41. Prentice RL. A case-cohort design for epidemiologic studies and disease prevention trials. Biometrika. 1986;73:1–11. [Google Scholar]
  42. Qi L, Wang CY, Prentice RL. Weighted estimators for proportional hazards regression with missing covariates. J Am Stat Assoc. 2005;100:1250–1263. [Google Scholar]
  43. Qin J, Ning J, Liu H, Shen Y. Maximum likelihood estimations and EM algorithms with length-biased data. J Am Stat Assoc. 2011;106:1434–1449. doi: 10.1198/jasa.2011.tm10156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Qin J, Shen Y. Statistical methods for analyzing right-censored length-biased data under Cox model. Biometrics. 2010;66:382–392. doi: 10.1111/j.1541-0420.2009.01287.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Qin G, Zhou H. Partial linear inference for a 2-stage outcome-dependent sampling design with a continuous outcome. Biostatistics. 2011;12:506–520. doi: 10.1093/biostatistics/kxq070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Samuelsen SO, Anestad H, Skrondal A. Stratified case-cohort analysis of general cohort sampling designs. Scand J Stat. 2007;34:103–119. [Google Scholar]
  47. Schildcrout JS, Heagerty PJ. On outcome dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics. 2008;9:735–749. doi: 10.1093/biostatistics/kxn006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Schildcrout JS, Mumford SL, Chen Z, Heagerty PJ, Rathouz PJ. Outcome dependent sampling for longitudinal binary response data based on a time-varying auxiliary variable. Stat Med. 2012;31:2441–2456. doi: 10.1002/sim.4359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Schildcrout JS, Rathouz PJ. Longitudinal studies of binary response data following case-control and stratified case-control sampling: design and analysis. Biometrics. 2010;66:365–373. doi: 10.1111/j.1541-0420.2009.01306.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Schill W, Jockel KH, Drescher K, Timm J. Logistic analysis in case-control studies under validation sampling. Biometrika. 1993;80:339–352. [Google Scholar]
  51. Scott AJ, Wild CJ. Fitting logistic regression models in stratified case-control studies. Biometrics. 1991;47:497–510. [Google Scholar]
  52. Self SG, Prentice RL. Asymptotic distribution theory and efficiency results for case-cohort studies. Ann Stat. 1988;16:64–81. [Google Scholar]
  53. Shen Y, Ning J, Qin J. Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J Am Stat Assoc. 2009;104:1192–1202. doi: 10.1198/jasa.2009.tm08614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Song R, Zhou H, Kosorok MR. On semiparametric efficient inference for two-stage outcome dependent sampling with a continuous outcome. Biometrics. 2009;96:221–228. doi: 10.1093/biomet/asn073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Sun J, Sun L, Flournoy N. Addictive hazards model for competing risks analysis of the case-cohort design. Commun Stat Theor M. 2004;33:351–366. [Google Scholar]
  56. Tsai WY. Pseudo-partial likelihood for proportional hazards models with biased-sampling data. Biometrika. 2009;96:601–615. doi: 10.1093/biomet/asp026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Vardi Y. Nonparametric estimation in the presence of length bias. Ann Stat. 1982;10:616–620. [Google Scholar]
  58. Vardi Y. Multiplicative censoring, renewal processes, deconvolution and decreasing density. Biometrika. 1989;76:751–761. [Google Scholar]
  59. Wang MC. Nonparametric estimation from cross-sectional survival data. J Am Stat Assoc. 1991;86:130–143. [Google Scholar]
  60. Wang MC. Hazards regression analysis for length-biased data. Biometrika. 1996;83:343–354. [Google Scholar]
  61. Wang X, Zhou H. A semiparametric empirical likelihood method for biased sampling schemes with auxiliary covariates. Biometrics. 2006;62:1149–1160. doi: 10.1111/j.1541-0420.2006.00612.x. [DOI] [PubMed] [Google Scholar]
  62. Wang X, Zhou H. Design and inference for cancer biomarker study with an outcome and auxiliary-dependent subsampling. Biometrics. 2010;66:502–511. doi: 10.1111/j.1541-0420.2009.01280.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Weaver MA. PhD Thesis. University of North Carolina; Chapel Hill: 2001. Semiparametric methods for continuous outcome regression models with covariate data from an outcome dependent subsample. [Google Scholar]
  64. Weaver MA, Zhou H. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. J Am Stat Assoc. 2005;100:459–469. [Google Scholar]
  65. Weinberg CR, Wacholder S. Prospective analysis of case-control data under general multiplicative-intercept risk models. Biometrika. 1993;80:461–465. [Google Scholar]
  66. White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol. 1982;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]
  67. Wicksell SD. The corpuscle problem: a mathematical study of a biometric problem. Biometrika. 1925;17:84–99. [Google Scholar]
  68. Yu J, Liu Y, Sandler DP, Zhou H. Statistical inference for the additive hazards model under outcome-dependent sampling. Can J Stat. 2015;43(3):436–453. doi: 10.1002/cjs.11257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Zhou H, Weaver MA, Qin J, Longnecker M, Wang MC. A semiparametric empirical likelihood method for data from an outcome dependent sampling scheme with a continuous outcome. Biometrics. 2002;58:413–421. doi: 10.1111/j.0006-341x.2002.00413.x. [DOI] [PubMed] [Google Scholar]
  70. Zhou H, Qin G, Longnecker MP. A partial linear model in the outcome-dependent sampling setting to evaluate the effect of prenatal PCB exposure on cognitive function in children. Biometrics. 2011a;67:876–885. doi: 10.1111/j.1541-0420.2010.01500.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Zhou H, Song R, Qin J. Statistical inference for a two-stage outcome dependent sampling design with a continuous outcome. Biometrics. 2011b;67:194–202. doi: 10.1111/j.1541-0420.2010.01446.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Zhou H, Wu Y, Liu Y, Cai J. Semiparametric inference for a 2-stage outcome-auxiliary-dependent sampling design with continuous outcome. Biostatistics. 2011c;12:521–534. doi: 10.1093/biostatistics/kxq080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Zhou H, You J, Qin G, Longnecker MP. A partially linear regression model for data from an outcome-dependent sampling design. J R Stat Soc C. 2011d;60:559–574. doi: 10.1111/j.1467-9876.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Zhu H, Wang MC. Analysing bivariate survival data with interval sampling and application to cancer epidemiology. Biometrika. 2012;99:345–361. doi: 10.1093/biomet/ass009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Zhu H, Wang MC. Nonparametric inference on bivariate survival data with interval sampling: association estimation and testing. Biometrika. 2014;101:519–533. doi: 10.1093/biomet/asu005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Zhu H, Wang MC. A semi-stationary Copula model approach for bivariate survival data with interval sampling. Int J Biostat. 2015;11:151–173. doi: 10.1515/ijb-2013-0060. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES