Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2020 Apr 14;48(5):846–865. doi: 10.1080/02664763.2020.1752633

Regression analysis of case-cohort studies in the presence of dependent interval censoring

Mingyue Du a, Qingning Zhou b, Shishun Zhao a, Jianguo Sun c,CONTACT
PMCID: PMC7986575  NIHMSID: NIHMS1641804  PMID: 33767519

Abstract

The case-cohort design is widely used as a means of reducing the cost in large cohort studies, especially when the disease rate is low and covariate measurements may be expensive, and has been discussed by many authors. In this paper, we discuss regression analysis of case-cohort studies that produce interval-censored failure time with dependent censoring, a situation for which there does not seem to exist an established approach. For inference, a sieve inverse probability weighting estimation procedure is developed with the use of Bernstein polynomials to approximate the unknown baseline cumulative hazard functions. The proposed estimators are shown to be consistent and the asymptotic normality of the resulting regression parameter estimators is established. A simulation study is conducted to assess the finite sample properties of the proposed approach and indicates that it works well in practical situations. The proposed method is applied to an HIV/AIDS case-cohort study that motivated this investigation.

Keywords: Case-cohort design, dependent interval censoring, inverse probability weighting, proportional hazards model

1. Introduction

The case-cohort design is widely used as a means of reducing the cost in large cohort studies, especially when the disease rate is low and covariate measurements may be expensive (Prentice [27]; Scheike and Martinussen [30], Self and Prentice [32]). For the situation, instead of collecting the covariate information on all study subjects, it collects the covariate information only on the subjects whose failures are observed and on a subsample of the remaining subjects. Among others, one area where the design is often used is epidemiological cohort studies in which the outcomes of interest are times to failure events such as AIDS, cancer, heart disease and HIV infection. For such studies, in addition to the incomplete nature on covariate information, another feature is that the observations are usually interval-censored rather than right-censored due to the periodic follow-up nature of the study (Sun [34]).

By interval-censored data, we usually mean that the failure time of interest is known or observed only to belong to an interval instead of being observed exactly. It is easy to see that interval-censored data include right-censored data as a special case. Furthermore, sometimes one may also face informative censoring, meaning that the failure time of interest and the censoring mechanism are correlated (Huang and Wolfe [13]; Wang et al. [37]). An example of informatively interval-censored data may arise in a periodic follow-up study of certain disease where study subjects may not follow the pre-specified visit schedules and instead pay clinical visits according to their disease status or how they feel with respect their treatments. Among others, Huang and Wolfe [13] and Sun [33] discussed the issue and pointed out that in the presence of informative censoring, the analysis that ignores it may result in biased or misleading results or conclusions. More discussion on informatively interval-censored data can be found in Sun [34].

One real study that motivated this investigation is the HVTN 505 Trial to assess the efficacy of a DNA prime-recombinant adenovirus type 5 boost (DNA/rAd5) vaccine to prevent human immunodeficiency virus type 1 (HIV-1) infection (Fong et al. [8]; Hammer et al. [10]; Janes et al. [14]). It is well-known that HIV-1 infection is deadly as it causes AIDS for which there is no cure and thus it is important and essential to develop a safe and effective vaccine for the prevention of the infection. The original study consists of 2504 men or transgender women who had sex with men were examined periodically, thus yielding only interval-censored data on the time to HIV-1 infection. For each subject, the information on four demographic covariates, age, race, BMI and behavioural risk, was collected, and in addition, for a subgroup of HIV infection cases and non-cases, a number of T cell response biomarkers and anti-body response biomarkers were also measured. One goal of the study is to determine or identify the important or relevant covariates or biomarkers for HIV-1 infection.

Many authors have discussed the analysis of case-cohort studies but most of the existing methods are for right-censored failure time data. For example, some of the early work on this was given by Prentice [27] and Self and Prentice [32], who proposed some pseudolikelihood approaches based on the modification of the commonly used partial likelihood method under the proportional hazards model. By following them, Chen and Lo [3] proposed an estimating equation approach that yields more efficient estimators than the pseudolikelihood estimator proposed in Prentice [27], and Chen [2] developed an estimating equation approach that applies to a class of cohort sampling designs, including the case-cohort design with the key estimating function constructed by a sample reuse method via local averaging. Also Marti and Chavance [25] and Keogh and White [18] proposed some multiple imputation methods and in particular, the latter method extended the former by considering more complex imputation models that include time and interaction or nonlinear terms. In addition, Kang and Cai [17] and Kim et al. [19] developed weighted estimating equation approaches for case-cohort studies with multiple disease outcomes, where the latter method improved the efficiency upon the former by utilizing more information in constructing the weights.

Interval-censored failure time data naturally occur in many areas, especially in the studies with periodic follow-ups, and a great deal of literature has been developed for their analysis (Chen et al. [5]; Finkelstein [7]; Sun [34]; Zhou et al. [40]). In particular, Sun [34] and Bogaerts et al. [1] provided comprehensive reviews of the existing literature on interval-censored data. Although there also exist some methods for either informatively interval-censored data or the interval-censored arising from case-cohort studies, there does not seem to exist an established procedure for informatively interval-censored data arising from case-cohort studies. In particular, for the analysis of informatively interval-censored data, two types of approaches are commonly used and they are the frailty model approach and the copula model approach. For example, Zhang et al. (2005, 2007) and Wang et al. [36,38] gave some frailty model estimation procedures, while Ma et al. [23,24] and Zhao et al. (2015) proposed some copula model methods. For the analysis of the interval-censored data arising from case-cohort studies, Gilbert et al. [9] presented a midpoint imputation procedure and Li and Nan [20] considered a special case of interval-censored data, current status data, where the failure time of interest is either left- or right-censored (Jewell and van der Laan [15]). Also Zhou et al. [41] proposed a likelihood-based approach. However, all of the three methods above assume that the interval censoring mechanism is non-informative or independent of the failure time of interest. As discussed by many authors and above, the informative censoring is a serious and difficult issue and the use of the methods that do not take it into account can yield biased or misleading results and conclusions (Huang and Wolfe [13]; Ma et al. [23]). In the following, we will develop a frailty model approach, a generalization of the method proposed in Zhou et al. [41], for the analysis of the case-cohort studies yielding interval-censored data with informative censoring.

The remainder of the paper is organized as follows. We will begin in Section 2 with introducing some notation and models to be used throughout the paper and in particular, we will present joint frailty models for the failure time of interest and the underlying censoring mechanism. To estimate regression parameters, a sieve inverse probability weighting estimation procedure is then presented in Section 3 and in the method, Bernstein polynomials are employed to approximate unknown functions. Furthermore, we establish the consistency and asymptotic normality of the resulting estimators of regression parameters and provide a weighted bootstrap procedure for variance estimation. Section 4 presents some results obtained from an extensive simulation study conducted to assess the finite sample properties of the proposed methodology and they suggest that the method works well in practical situations. In Section 5, we apply the proposed method to the HIV/AIDS study described above and Section 6 gives some discussion and concluding remarks.

2. Notation and models

Consider a failure time study that consists of n independent subjects. For subject i, let Ti denote the failure time of interest and suppose that there exists a p-dimensional vector of covariates denoted by Zi that may affect Ti, i=1,,n. Also for subject i, suppose that there exist two examination times denoted by Ui and Vi with UiVi and one only observes Δ1i=I(TiUi) and Δ2i=I(Ui<TiVi), indicating if the failure time Ti is left-censored and interval-censored, respectively. Note that here Ui and Vi are random variables and assumed to be observed and they together with Δ1i and Δ2i give the observed interval-censored data on the Ti's (Sun [34]; Zhou et al. [41]).

For the case-cohort studies, as mentioned above, the information on covariates is available only for the subjects who either have experienced the failure event of interest or with Δ1i=1 or Δ2i=1 or are from the sub-cohort that is a random sample of the entire cohort. Define ξi=1 if the covariate Zi is available or observed and 0 otherwise, i=1,,n. For the selection of the subcohort, by following Zhou et al. [40] and others, we will consider the independent Bernoulli sampling with the selection probability q(0,1). Then under the assumption above, the probability that the covariate Zi is observed is given by

Pr(ξi=1)=πq(Δ1i,Δ2i)=Δ1i+Δ2i+(1Δ1iΔ2i)q,

i=1,,n, and the observed data have the form

Oξ={Oiξ=(Ui,Vi,Δ1i,Δ2i,ξi,ξiZi);i=1,,n}.

In contrast, if all covariates were observed, the full cohort data would be

O={Oi=(Ui,Vi,Δ1i,Δ2i,Zi);i=1,,n}.

To describe the covariate effects and dependent interval censoring, define Wi=ViUi, i=1,,n. By following Ma et al. [23], we will focus on the situation where the dependent censoring can be characterized by the correlation between the Ti's and Wi's. As mentioned in Ma et al. [23], one example where this may be the case is follow-up studies where some study subjects may tend to pay more or less clinical visits than the scheduled ones. More comments on this will be given below. For the covariate effects, we assume that there exists a latent variable bi with mean one and known distribution but unknown variance η and given Zi and bi, the hazard functions of Ti and Wi have the forms

λi(T)(t|Zi,bi)=λt(t)exp(βtZi)bi, (1)

and

λi(W)(t|Zi,bi)=λw(t)exp(βwZi)bi, (2)

respectively. In the above, λt(t) and λw(t) are unknown baseline hazard functions and βt and βw are p×1 vectors of unknown regression parameters. Also it will be assumed that given Zi and bi, Wi is independent of Ui and Ti and Wi are independent. In other words, the correlation between Ti and Wi is measured by the parameter η. More comments on this are given below.

Define Δi=(Δ1i,Δ2i) and θ=(βt,βw,Λt,Λw,η), where Λt(t)=0tλt(u)du and Λw(t)=0tλw(u)du. Assume that bi is independent of (Ui,Zi) and the joint distribution of (Ui,Zi) does not involve the parameters of interest. To motivate the proposed estimation procedure, note that conditional on (Wi,Ui,Zi,bi), the likelihood of the observation from subject i has the form

LΔi|Wi,Ui,bi(θ)=[1exp{Λt(Ui)exp(βtZi)bi}]Δ1i[exp{Λt(Ui)exp(βtZi)bi}exp{Λt(Vi)exp(βtZi)bi}]Δ2i[exp{Λt(Vi)exp(βtZi)bi}]1Δ1iΔ2i

Also note that conditional on (Zi,bi), the likelihood of the observation on Wi is given by

LWi|bi={λw(Wi)exp{βwZi}biexp{Λw(Wi)exp(βwZi)bi}}Ψi.

where Ψi=I(Wi<). This motivates the following inverse probability weighted log-likelihood function

lOξ(θ)=i=1nli(θ;Oiξ)=i=1npili(θ;Oi)=i=1npilog{LΔi|Wi,Ui,bi(θ)LWi|bi(θ)f(bi;η)dbi} (3)

for estimation of θ, where f(bi;η) denotes the the density function of the bi's and

pi=ξiπq(Δ1i,Δ2i)=ξiΔ1i+Δ2i+(1Δ1iΔ2i)q.

If f is the gamma distribution, the function lOξ(θ) has a closed form as

lOξ(θ)=i=1npilog{(λwexp(βwZi))Ψi[(1+(ηΛw(Wi)exp(βwZi))Ψi)η1Ψi(1+ηΛt(Ui)exp(βtZi)+(ηΛw(Wi)exp(βwZi))Ψi)η1Ψi]Δ1i×[(1+ηΛt(Ui)exp(βtZi)+(ηΛw(Wi)exp(βwZi))Ψi)η1Ψi(1+ηΛt(Ui+Wi)exp(βtZi)+(ηΛw(Wi)exp(βwZi))Ψi)η1Ψi]Δ2i×[(1+ηΛt(Ui+Wi)exp(βtZi)+(ηΛw(Wi)exp(βwZi))Ψi)η1Ψi]1Δ1iΔ2i}. (4)

In the next section, for estimation of θ, we will discuss the maximization of the inverse probability weighted log-likelihood function lOξ(θ).

3. Sieve inverse probability weighting estimation

Define the parameter space of θ

Θ={θ=(βt,βw,η,ψ):ψ=(Λt(t),Λw(t))}=BM1M2,

where B={(βt,βw,η)R2p×R+,βt+βw+η∥≤M} with M being a positive constant and Mj denotes the collection of all bounded and continuous nondecreasing, nonnegative functions over the interval [σj,τj], j = 1, 2. In practice, [σ1,τ1] is usually taken to be the range of the Ui's and Vi's and [σ2,τ2] the range of the Wi's. More comments on this are given below. For the maximization of the inverse probability weighted log-likelihood function lOξ(θ), it is easy to see that this would not be straightforward since lOξ(θ) involves unknown functions Λt(t) and Λw(t). To deal with this and by following Ma et al. [24], Zhou et al. [40] and others, we propose first to approximate the two functions by Bernstein polynomials.

More specifically, define the sieve space

Θn={θn=(βt,βw,η,ψn):ψn=(Λtn(t),Λwn(t))}=BMn1Mn2.

with

Mn1={Λtn:Λtn(t)=k=0mϕk1Bk(t,m,σ1,τ1),ϕm1ϕ11ϕ010,k=0m|ϕk1|Mn},

and

Mn2={Λwn:Λwn(t)=k=0mϕk2Bk(w,m,σ2,τ2),ϕm2ϕ12ϕ020,k=0m|ϕk2|Mn}.

In the above,

Bk(t,m,σ1,τ1)=Cmk(tσ1τ1σ1)k(1tσ1τ1σ1)mk,

and

Bk(w,m,σ2,τ2)=Cmk(wσ2τ2σ2)k(1wσ2τ2σ2)mk,

k=0,,m, which Bernstein polynomials of degree m=o(nν) for some ν(0,1). Note that some restrictions are needed above on the parameters since Λt(t) and Λw(w) are nonnegative and nondecreasing functions. However, this can be easily removed by some reparameterization. For example, one can reparameterize the parameters {ϕ0j,,ϕmj} by the cumulative sums of the parameters {exp(ϕ0j),,exp(ϕmj)}, j = 1, 2.

Let θ^n=(β^tn,β^wn,η^n,Λ^tn,Λ^wn) denote the estimator of θ given by the value of θ that maximizes the inverse probability weighted log-likelihood function lOξ(θ) over the sieve space Θn. Also let θ0=(βt0,βw0,η0,Λt0,Λw0) denote the true value of θ, ϑ^n=(β^tn,β^wn,η^n), ϑ0=(βt0,βw0,η0), and for any θ1=(βt1,βw1,η1,Λt1,Λw1) and θ2=(βt2,βw2,η2,Λt2,Λw2) in the parameter space Θ, define the distance

d(θ1,θ2)={βt1βt22+βw1βw22+η1η22+Λt1Λt222+Λw1Λw222}1/2.

Here v denotes the Euclidean norm for a vector v, Λt1Λt222=[(Λt1(u)Λt2(u))2+ψ(Λt1(u+w)Λt2(u+w))2]dG(u,w), and Λw1Λw222=ψ[Λw1(w)Λw2(w)]2dG(u,w) with G(u,w) denoting the joint distribution function of U and W. The following two theorems establish the asymptotic properties of θ^n.

Theorem 3.1

Suppose that the regularity conditions (C1)–(C4) given in the Appendix hold. Then as n, we have that d(θ^n,θ0)0 almost surely and d(θ^n,θ0)=Op(nmin{(1ν)/2,νr/2}), where ν(0,1) is defined in m=o(nν) and r in the regularity condition (C3).

Theorem 3.2

Suppose that the regularity conditions (C1)–(C5) given in the Appendix hold. Then as n and if ν>1/2r, we have that

n1/2(ϑ^nϑ0)=I1(ϑ0)n1/2i=1npil(ϑ0,Oi)+op(1)N(0,Σ)

in distribution, where

Σ=I1(ϑ0)+I1(ϑ0)E{1πq(Δ1,Δ2)πq(Δ1,Δ2){l(ϑ0,O)}2}I1(ϑ0)

with v2=vv for a vector v and I(ϑ) and l(ϑ,O), given in the Appendix, denoting the information matrix and efficient score for ϑ=(βt,βw,η) based on the complete data.

The proof of the results given above is sketched in the Appendix. For the determination of the proposed estimator θ^n, different methods can be used and in the numerical studies below, the Matlab function fmincon is used. Also for the determination of θ^n, one needs to choose or specify the degree m of Bernstein polynomials, which controls the smoothness of the approximation. For this, one common approach is to perform the grid search by considering different values of m and choosing the one that minimizes

AIC=2lOξ(θ^n)+2(2p+2m+3)

based on the AIC criterion. Note that instead of this, one may employ other criteria such as the BIC criterion and the numerical results indicate that they give similar performance. Also note that in the approximation of Λt and Λw, we used the same degree m and in practice, different m could be used too.

For inference about ϑ0=(βt0,βw0,η0), of course, one needs to estimate the covariance matrix of ϑ^n=(β^tn,β^wn,η^n). For this, a natural way would be to derive a consistent estimator of Σ. On the other hand, one could see from the Appendix that Σ involves the information matrix I(ϑ0) and the efficient score l(ϑ0,O) and both of them do not have closed forms. Thus, it would be difficult to derive a consistent estimator and instead we propose to employ the weighted bootstraps procedure discussed in Ma and Kosorok [22], which is easy to implement and seems to work well in the numerical studies described below. Specifically, let {u1,,un} denote n independent realizations of a bounded positive random variable u satisfying E(u)=1 and var(u)=ϵ0< and define the new weights pi=uipi, i=1,,n. Also let ϑ^n denote the estimator of ϑ proposed above with replacing the pi's by the pi's. Then if we repeat this B times, one can estimate the covariance matrix of ϑ^n by the sample covariance matrix of the ϑ^n's. By following Ma and Kosorok [22], it can be shown that this weighted bootstrap variance estimator is consistent.

4. A simulation study

In this section, we report some results obtained from a simulation study conducted to evaluate the finite sample performance of the inverse probability weighted estimation procedure proposed in the previous sections. In the study, it was assumed that the covariate Z followed the Bernoulli distribution with the success probability of 0.5 and to generate the subcohort, as mentioned above, we considered the independent Bernoulli sampling with the selection probability being 0.1. For the proportion of the observed failure events or the event rate, we studied several cases including pe=0.05, 0.1 and 0.2. To generate interval-censored data, we first generated the Ui's from the uniform distribution over (0,a) with a being a positive constant and the latent variable bi's. Then the Ti's and Wi's were generated based on models (2.1) and (2.2) with λt=0.2t, 0.1t or 4t/9, λw=12t and the Vi's were defined as Vi=Ui+Wi for all i. The results given below are based on the full cohort size n = 1000 or 2000 with 1000 replications.

Table 1 presents the results obtained on the proposed estimators β^tn, β^wn and η^n with n = 1000, the true values of the parameters being βt0=βw0=0, 0.2 or 0.5 and η0=0.8, and the bi's following the gamma distribution. The results include the estimated bias (Bias) given by the average of the proposed estimates minus the true value, the sample standard error (SSE), the average of the estimated standard errors (ESE) and the 95% empirical coverage probability (CP). Here we took the degree of Bernstein polynomials being m = 3 and the weighted bootstrap sample size B = 100 for variance estimation. Also for the variance estimation, we generated the random sample {u1,,un} repeatedly from the exponential distribution. Table 2 gives the estimation results obtained under the same set-up as above except n = 2000. One can see from the two tables that the results indicate that the proposed estimator seems to be unbiased and the weighted bootstrap variance estimation procedure seems to work well. Also they indicate that the normal approximation to the distribution of the proposed estimator appears to be reasonable. In addition, as expected, the estimation results became better when the percentage of the observed failure events or the full cohort size increased. We also considered other set-ups including different values for m and B and obtained similar results.

Table 1. Estimation of regression parameters with n = 1000.

pe Parameter Bias SSE ESE CP
5% βt=0 −0.0107 0.3920 0.3903 0.9510
  βw=0 −0.0026 0.3180 0.3247 0.9460
  η=0.8 −0.0034 0.2778 0.2863 0.9330
  βt=0.2 0.0074 0.4045 0.3958 0.9450
  βw=0.2 0.0163 0.3259 0.3273 0.9480
  η=0.8 0.0180 0.2722 0.2852 0.9520
  βt=0.5 0.0272 0.4070 0.4032 0.9490
  βw=0.5 0.0203 0.3469 0.3330 0.9380
  η=0.8 0.0200 0.2789 0.2813 0.9470
10% βt=0 −0.0163 0.3428 0.3377 0.9330
  βw=0 −0.0047 0.2976 0.3106 0.9540
  η=0.8 0.0347 0.2475 0.2544 0.9420
  βt=0.2 −0.0003 0.3413 0.3394 0.9530
  βw=0.2 0.0158 0.3127 0.3096 0.9400
  η=0.8 0.0347 0.2509 0.2493 0.9410
  βt=0.5 0.0117 0.3447 0.3438 0.9470
  βw=0.5 0.0291 0.3146 0.3130 0.9410
  η=0.8 0.0386 0.2500 0.2440 0.9370
20% βt=0 −0.0067 0.3058 0.3022 0.9480
  βw=0 0.0074 0.2766 0.2740 0.9400
  η=0.8 0.0071 0.2202 0.2159 0.9240
  βt=0.2 −0.0022 0.3066 0.3027 0.9410
  βw=0.2 0.0134 0.2757 0.2756 0.9480
  η=0.8 0.0132 0.2178 0.2150 0.9340
  βt=0.5 0.0021 0.3089 0.3058 0.9410
  βw=0.5 0.0223 0.2786 0.2791 0.9440
  η=0.8 0.0165 0.2220 0.2155 0.9230

Table 2. Estimation of regression parameters with n = 2000.

pe Parameter Bias SSE ESE CP
5% βt=0 −0.0106 0.2718 0.2707 0.9480
  βw=0 0.0007 0.2255 0.2270 0.9470
  η=0.8 0.0117 0.1945 0.2094 0.9620
  βt=0.2 0.0080 0.2772 0.2751 0.9480
  βw=0.2 0.0112 0.2299 0.2289 0.9510
  η=0.8 0.0089 0.1872 0.2085 0.9730
  βt=0.5 0.0170 0.2813 0.2781 0.9440
  βw=0.5 0.0052 0.2299 0.2313 0.9510
  η=0.8 0.0205 0.1877 0.2015 0.9660
10% βt=0 −0.0046 0.2345 0.2344 0.9440
  βw=0 0.0081 0.2097 0.2148 0.9500
  η=0.8 0.0346 0.1722 0.1847 0.9660
  βt=0.2 0.0058 0.2404 0.2371 0.9400
  βw=0.2 0.0060 0.2130 0.2172 0.9580
  η=0.8 0.0371 0.1827 0.1830 0.9410
  βt=0.5 0.0083 0.2371 0.2407 0.9540
  βw=0.5 0.0151 0.2208 0.2193 0.9450
  η=0.8 0.0372 0.1705 0.1830 0.9570
20% βt=0 −0.0023 0.2083 0.2106 0.9500
  βw=0 −0.0096 0.1897 0.1928 0.9550
  η=0.8 0.0060 0.1609 0.1671 0.9520
  βt=0.2 0.0011 0.2114 0.2122 0.9600
  βw=0.2 0.0099 0.1918 0.1938 0.9540
  η=0.8 0.0075 0.1575 0.1645 0.9560
  βt=0.5 −0.0034 0.2197 0.2137 0.9370
  βw=0.5 0.0040 0.1953 0.1958 0.9430
  η=0.8 0.0036 0.1597 0.1628 0.9430

In the proposed estimation procedure, it has been assumed that the distribution of the latent variables bi's is known up to a variance parameter. Hence in practice, one question of interest may be the robustness of the estimation procedure with respect to the distribution. To investigate this, we repeated the simulation study above giving the results in Table 1 with pe=0.1 except that we generated the bi's from the log-normal distribution instead of the gamma distribution but assumed that they followed the gamma distribution. Table 3 presents the results obtained on the proposed estimators β^tn and β^wn, including the Bias, the SSE, the ESE and the 95% empirical CP. As before, they suggest that the proposed methodology seems to work well or the estimators β^tn and β^wn appear to be robust with respect to the distribution of the latent variables.

Table 3. Estimation of regression parameters with n = 1000, pe=0.1 and misspecified frailty distribution.

Parameter Bias SSE ESE CP
βt=0 0.0040 0.3192 0.3221 0.9500
βw=0 −0.0017 0.2682 0.2600 0.9470
βt=0.2 0.0025 0.3218 0.3238 0.9500
βw=0.2 0.0014 0.2711 0.2617 0.9410
βt=0.5 −0.0035 0.3356 0.3289 0.9410
βw=0.5 0.0090 0.2728 0.2678 0.9510

For the problem discussed here, instead of the inverse probability weighting method proposed above, there exist two commonly used naive approaches that estimate regression parameters by using the regular likelihood approaches. One is to base the estimation only on the selected sub-cohort and the other is to base the estimation on a simple random sample that has the same size as the case-cohort sample. Let β^tsub and β^tsrs denote the estimators of βt given by the two naive methods above, respectively, and here we only focus on the estimation of βt. Table 4 gives the estimation results given by the proposed method and the two naive approaches under the set-up similar to that for Table 2 with pe=0.1. Note that here for comparison, we also considered the approach given by Zhou et al. [40], which treated the observation process to be independent of the failure time of interest or ignored the correlation between the failure time and the observation process. The resulting estimator of βt is denoted by β^tin in the table. One can see from Table 4 that the proposed estimate clearly gave better performance than the two naive estimates and one would get biased results if ignoring the correlation between the failure time of interest and the observation process.

Table 4. Comparison of the proposed and naive estimators for βt with n=2000 and pe=0.1.

Parameter   Bias SSE ESE CP
βt=0.8 β^t -0.0129 0.2544 0.2546 0.9550
  β^tsub −0.0379 0.5391 0.5308 0.9470
  β^tsrs −0.0031 0.3677 0.3697 0.9600
  β^tin −0.1840 0.2142 0.2161 0.8610
βt=1 β^t −0.0027 0.2534 0.2595 0.9540
  β^tsub 0.0151 0.5655 0.5635 0.9560
  β^tsrs −0.0249 0.3959 0.3853 0.9390
  β^tin −0.2282 0.2117 0.2206 0.8040

As pointed out by a reviewer and motivated by the real data discussed below, we also repeated the study that gave the results in Table 1 with pe=0.05 in which we generated the subcohort in the same way as before but only from none-case subjects instead of all subjects as above. In other words, the goal here is to assess the performance of the proposed approach for case–control studies. The obtained estimation results are presented in Table 5 and one can see that they are similar to those given in Table 1. In other words, it seems that the proposed estimation approach seems to give good performance for and can be applied to case–control studies too.

Table 5. Estimation of regression parameters for case–control studies with n=1000 and pe=0.05.

Parameter Bias SSE ESE CP
βt=0 -0.0048 0.4015 0.4086 0.9560
βw=0 0.0060 0.3196 0.3283 0.9570
η=0.8 0.0022 0.3048 0.3189 0.9390
βt=0.2 −0.0177 0.3930 0.4108 0.9570
βw=0.2 −0.0158 0.3176 0.3290 0.9590
η=0.8 0.0143 0.3076 0.3226 0.9550
βt=0.5 −0.0015 0.3960 0.4003 0.9570
βw=0.5 0.0055 0.3287 0.3354 0.9610
η=0.8 0.0250 0.3266 0.3179 0.9450

5. An application

In this section, we will apply the methodology proposed in the previous sections to the HVTN 505 Trial discussed above. It is a randomized, multiple-sites clinical trial of men or transgender women who had sex with men for assessing the efficacy of the DNA/rAd5 vaccine for HIV-1 infection (Fong et al. [8]; Hammer et al. [10]; Janes et al. [14]). As mentioned above, the original study consists of the subjects randomly assigned to receive either the DNA/rAd5 vaccine or placebo, and in the following, we will focus only on the 1253 subjects in the vaccine group. It is well-known that HIV-1 infection is deadly as it causes AIDS for which there is no cure and thus it is important and essential to develop a safe and effective vaccine for the prevention of the infection. For each subject, four demographic covariates were observed and they are age, race, BMI and behavioural risk. In addition, to assess their relationship with the HIV infection, a number of T cell response biomarkers and antibody response biomarkers were measured for a cohort of 150 subjects consisting of all HIV infection cases (25) and other 125 randomly selected subjects among the vaccine recipients. The failure time of interest here is the time to true HIV-1 infection and for which, only interval-censored data are available.

In all previous analyses, the authors simplified the observed data into right-censored data and also did not consider the possibility of informative censoring (Fong et al. [8]; Hammer et al. [10]; Janes et al. [14]). They identified the T cell response biomarker Env CD8+ polyfunctionality score and the antibody response biomarker IgG.Cconenv03140CF.avi that may have significant effects on the HIV infection time. For simplicity, below we will refer these two biomarkers as to Env CD8 Score and IgG, respectively. For the analysis below, by following Fong et al. [8] and Janes et al. [14], we will focus on the cohort of 150 vaccine recipients, which can be treated as a case–control design with the full cohort being all subjects in the vaccine group, and investigate the relationship between the HIV infection time and the four demographic covariates plus the two biomarkers.

Table 6 presents the estimation results given by the application of the methodology proposed in the previous sections to the HVTN 505 Trial, including the estimated covariate effects β^tn and β^wn, the estimated standard errors (ESE) and the p-values for testing the covariate effect being zero. Here for the degree of Bernstein polynomials, we tried several values, including m = 2, 3, 4, 5, 6 and 7, and the results above were obtained based on m = 3, which gave the smallest AIC defined above, and B = 500. One can see from Table 6 that the proposed estimation procedure suggests that among the six covariates considered here, two demographic covariates, race and behavioural risk, seem to be correlated with the HIV infection time and the two biomarkers also appear to have significant prognostic effects on the development of HIV infection. On the other hand, the age and BMI did not seem to have any effects on the HIV infection. In addition, the race and behavioural risk appear to have significant effects on the observation process too.

Table 6. Estimated covariate effects for the HVTN 505 Trial.

  Proposed method
Covariate β^tn SSE p-value β^wn SSE p-value
age -0.2116 0.2523 0.4018 0.0287 0.3174 0.9279
race −0.7962 0.4676 0.0886 1.7492 0.6204 0.0048
BMI −0.1560 0.3020 0.6055 0.1813 0.3621 0.6166
behavioural risk 1.1079 0.5677 0.0510 2.2763 0.6781 0.0008
Env CD8 Score −0.9575 0.2286 0.0000 0.2661 0.4628 0.5652
IgG −0.5085 0.1610 0.0016 0.2744 0.1611 0.0886
η 0.0030 2.0820 0.9989      
  Method given in Zhou et al. [40]
Covariate β^tin SSE p-value      
age -0.2114 0.2580 0.4125      
race −0.7985 0.4996 0.1100      
BMI −0.1561 0.2832 0.5814      
behavioural risk 1.1086 0.7482 0.1385      
Env CD8 Score −0.9574 0.2846 0.0008      
IgG −0.5089 0.1620 0.0017      

For comparison, we also applied the method given in Zhou et al. [40], which assumed that the HIV infection time and the observation process were independent, to the data and included the estimated covariate effects, which are denoted by β^tin, in the table along with the estimated standard errors and the p-values. One can see from the table that one difference between the results given by the two methods is on the estimation of the effect of the behavioural risk factor, which did not see to have any effect on the development of HIV infection based on the method given in Zhou et al. [40]. One explanation for this may be due to the fact that the method given in Zhou et al. [40] ignored the existence of informative censoring.

6. Discussion and concluding remarks

This paper discussed the analysis of case-cohort studies that yield informatively interval-censored failure time data arising from the proportional hazards model. As discussed above, a great deal of literature has been developed for the analysis of case-cohort studies that give right-censored data. In practice, however, the observed information on the failure time is more likely and naturally given in the form of interval-censored data, which is especially the case for longitudinal or periodic follow-up studies. One major difference between right-censored data and interval-censored data is that the latter has a much more complex structure than the former, which makes the analysis of the latter much more difficult. Although a large amount of literature has also been established for the analysis of either interval-censored data or case-cohort studies, there is no method available for the informative censoring situation discussed above. As pointed out before and seen in Section 5, informative censoring often occurs naturally and for the situation, the analysis that ignores it could result in biased or misleading results and conclusions.

As discussed in Sections 4 and 5, a type of studies that is similar to case-cohort studies is the case–control study and the key difference between the two is the generation of the subcohort. With the case-cohort design, the subcohort is sampled from all study subjects, while the case–control design samples the subcohort only from the subjects who do not experience the failure event of interest during the follow-up. It is apparent that the data structures under the two designs are different but on the other hand, the simulation study suggested that the proposed estimation approach seems to be valid too for the case–control design. A possible explanation for this is that the resulting data may carry similar information about the model and the regression parameters of interest given the low percentage of the event rate.

In practice, interval-censored data may be given in different forms (Sun [34]). For example, instead of the form discussed here, one may have case K or mixed interval-censored data (Wang et al. [37]). Note that for the analysis, one can still apply the proposed estimation procedure to these situations by expressing the data using the format described here. However, the derivation or establishment of the asymptotic properties may be different and one may need some other assumptions similar to those described in Huang [11] and Wang et al. [37]. In the previous sections, the focus has been on the informative censoring that can be characterized by models (2.1) and (2.2) or through latent variables. More specifically, it has been assumed that the magnitude of the informative censoring can be measured by the parameter η. It is apparent that as with most of frailty model approaches, a natural question would be if one can test η=0. Unfortunately it does not seem to exist an established procedure for it in the literature. Another related question is the possibility of performing the goodness-of-fit tests on models (2.1) and (2.2). For this, if η=0, one may apply the test procedures given in Ren and He [28] and McKeague and Utikal [26], respectively, to test them separately. However, it would be difficult or not straightforward to generalize either of them to the situation discussed here.

As mentioned above, to deal with the informative censoring, another commonly used method is the copula model approach, which directly models the joint distribution of the failure time of interest and censoring variables (Sun [34]). For example, Cui et al. [6] and Ma et al. [24] developed two such methods for regression analysis of current status data with informative censoring, a special case of interval-censored data where each subject is observed only once. Among others, Ma et al. [23] proposed a copula model approach for regression analysis of general interval-censored data. An advantage of the copula model approach is that it allows one to work or model the marginal distribution and the association parameter separately but it has the limitation that one needs to assume that the underlying copula function is known.

It is well-known that although the proportional hazards model is one of the most commonly used models for regression analysis of failure time data, sometimes one may prefer a different model or a different model may fit the data or describe the problem of interest better (Kalbfleisch and Prentice [16]). For example, the additive hazards model is usually preferred if the excess risk is of interest and one may want to consider the linear transformation model if the model flexibility is more important. Some literature has been developed for these and other models for regression analysis of general interval-censored data or the analysis of case-cohort studies that yield right-censored data. However, there does not seem to exist an established estimation procedure for the problem discussed here under other models. In other words, it would be useful to generalize the proposed method to the situation under the additive hazards or linear transformation model.

Acknowledgments

The authors wish to thank the Editor-in-Chief, the Associate Editor and three reviewers for their many critical and constructive comments and suggestions that greatly improved the paper. Also the authors want to thank Dr Peter Gibert for providing the HIV example data.

Appendix Proofs of the asymptotic properties of θ^n.

In this appendix, we will sketch the proof of the asymptotic properties of the proposed estimator θ^n. Let τ denote the length of study. Then a single observation can be written as

Oξ={U,ΨW,Ψ=I(W<τU),Δ1=I(TU),Δ2=I(U<TU+ΨW),ξZ,ξ}.

To establish the asymptotic properties, we need the following regularity conditions, which are commonly used in the studies of interval-censored data and usually satisfied in practice (Huang and Rossini [12]; Zhang et al. [39]; Ma et al. [23]; Zhou et al. [40]).

  • (C1)

    The distribution of the covariate Z has a bounded support in Rp and is not concentrated on any proper subspace of Rp.

  • (C2)

    The true parameters (βt0,βw0,η0) lie in the interior of a compact set B in R2p×R+.

  • (C3)

    The first derivative of Λt0() and Λw0(), denoted by Λt0(1)() and Λw0(1)(), is Holder continuous with exponent γ(0,1]. That is, there exists a constant K>0 such that |Λt0(1)(t1)Λt0(1)(t2)|K|t1t2|γ for all t1,t2[σ,τ], where 0<σ<τ<. Let r=1+γ.

  • (C4)

    There exists a constant K>0 such that Pl(θ,Oξ)Pl(θ0,Oξ)Kd(θ,θ0)2 for every θ in a neighbourhood of θ0, where l(θ,Oξ) is the weighted log-likelihood function based on a single observation Oξ.

  • (C5)

    The matrix E({l(ϑ0,O)}2) is finite and positive definite, where v2=vv for a vector v, and l(ϑ,O) is the efficient score for ϑ=(βt,βw,η) based on the complete observation O={U,ΨW,Ψ,Δ1,Δ2,Z} and will be given in the proof of Theorem 2.

For the proof, we will mainly employ the empirical process theory and some nonparametric techniques. Let Pf=f(y)dP denote the expectation of f(Y) under the probability measure P, and Pnf=n1i=1nf(Yi), the expectation of f(Y) under the empirical measure Pn. Define the covering number of the class Ln={l(θ,Oξ):θΘn}, where l(θ,Oξ) is the weighted log-likelihood function based on a single observation Oξ. Also for any ϵ>0, define the covering number N(ϵ,Ln,L1(Pn)) as the smallest positive integer κ for which there exists {θ(1),,θ(κ)} such that

minj{1,,κ}1ni=1n|l(θ,Oiξ)l(θ(j),Oiξ)|<ϵ

for all θΘn, where {O1ξ,,Onξ} represent the observed data and for j=1,,κ, θ(j)=(βt(j),βw(j),η(j),Λt(j),Λw(j))Θn. If no such κ exists, define N(ϵ,Ln,L1(Pn))=. Also for the proof, we need the following two lemmas, whose proofs are similar to those for Lemmas 1 & 2 in Zhou et al. [40] and thus omitted.

Lemma A.1

Assume that the regularity conditions (C1)–(C3) given above hold. Then we have that the covering number of the class Ln={l(θ,Oξ):θΘn} satisfies

N(ϵ,Ln,L1(Pn))KMn2(m+1)ϵ(2p+2m+3)

for a constant K, where m=o(nν) with ν(0,1) is the degree of Bernstein polynomials, and Mn=O(na) with a>0 controls the size of the sieve space Θn.

Lemma A.2

Assume that the regularity conditions (C1)–(C3) given above hold. Then we have that

supθΘn|Pnl(θ,Oξ)Pl(θ,Oξ)|0

almost surely.

Proof Proof of Theorem 3.1 —

We first prove the strong consistency of θ^n. Let l(θ,Oξ) denote the weighted log-likelihood function based on a given single observation Oξ and consider the class of functions Ln={l(θ,Oξ):θΘn}. By Lemma A.1, the covering number of Ln satisfies

N(ϵ,Ln,L1(Pn))KMn2(m+1)ϵ(2p+2m+3).

Furthermore, by Lemma A.2, we have

supθΘn|Pnl(θ,Oξ)Pl(θ,Oξ)|0almostsurely. (A1)

Note that E(p|O)=1, then Pl(θ,Oξ)=P{pl(θ,O)}=Pl(θ,O) and θ0 maximizes Pl(θ,Oξ). Let M(θ,Oξ)=l(θ,Oξ), and define Kϵ={θ:d(θ,θ0)ϵ,θΘn} for ϵ>0 and

ζ1n=supθΘn|PnM(θ,Oξ)PM(θ,Oξ)|,ζ2n=PnM(θ0,Oξ)PM(θ0,Oξ).

Then

infKϵPM(θ,Oξ)=infKϵ{PM(θ,Oξ)PnM(θ,Oξ)+PnM(θ,Oξ)}ζ1n+infKϵPnM(θ,Oξ). (A2)

If θ^nKϵ, then we have

infKϵPnM(θ,Oξ)=PnM(θ^n,Oξ)PnM(θ0,Oξ)=ζ2n+PM(θ0,Oξ). (A3)

Define δϵ=infKϵPM(θ,Oξ)PM(θ0,Oξ). Under Condition (C4), we have δϵ>0. It follows from A2 and A3 that

infKϵPM(θ,Oξ)ζ1n+ζ2n+PM(θ0,Oξ)=ζn+PM(θ0,Oξ)

with ζn=ζ1n+ζ2n, and hence ζnδϵ.This gives {θ^nKϵ}{ζnδϵ}, and by A1 and the strong law of large numbers, we have both ζ1n0 and ζ2n0 almost surely. Therefore, k=1n=k{θ^nKϵ}k=1n=k{ζnδϵ}, which proves that d(θ^n,θ0)0 almost surely.

Now we will show the convergence rate of θ^n by using Theorem 3.4.1 of van der Vaart and Wellner [35]. Below we use K~ to denote a universal positive constant which may differ from place to place. First note from Theorem 1.6.2 of Lorentz [21] that there exists a Bernstein polynomial Λtn0 and Λwn0 such that Λtn0Λt0=O(mr/2) and Λwn0Λw0=O(mr/2). Define θn0=(βt0,βw0,η0,Λtn0,Λwn0). Then we have d(θn0,θ0)=O(nrν/2). For any ρ>0, define the class of functions Fρ={l(θ,Oξ)l(θn0,Oξ):θΘn,ρ/2<d(θ,θn0)ρ} for a given single observation Oξ. One can easily show that P(l(θ0,Oξ)l(θn0,Oξ))K~d(θ0,θn0)K~nrν/2. From Condition (C4), for large n, we have

P(l(θ,Oξ)l(θn0,Oξ))=P(l(θ,Oξ)l(θ0,Oξ))+P(l(θ0,Oξ)l(θn0,Oξ))K~ρ2+K~nrν/2=K~ρ2,

for any l(θ,Oξ)l(θn0,Oξ)Fρ.

Following the calculations in Shen and Wong [32](p. 597), we can establish that for 0<ε<ρ, logN[](ε,Fρ,L2(P))K~Nlog(ρ/ε) with N=2(m+1). Moreover, some algebraic manipulations yield that P(l(θ,Oξ)l(θn0,Oξ))2K~ρ2 for any l(θ,Oξ)l(θn0,Oξ)Fρ. Under Conditions (C1)–(C3), it is easy to see that Fρ is uniformly bounded. Therefore, by Lemma 3.4.2 of van der Vaart and Wellner [35], we obtain

EPn1/2(PnP)FρK~J[]{ρ,Fρ,L2(P)}[1+J[]{ρ,Fρ,L2(P)}ρ2n1/2]

where J[]{ρ,Fρ,L2(P)}=0ρ[1+logN[]{ε,Fρ,L2(P)}]1/2dεK~N1/2ρ. This yields ϕn(ρ)=N1/2ρ+N/n1/2. It is easy to see that ϕn(ρ)/ρ is decreasing in ρ, and rn2ϕn(1/rn)=rnN1/2+rn2N/n1/2K~n1/2, where rn=N1/2n1/2=n(1ν)/2.

Finally note that Pn{l(θ^n,Oξ)l(θn0,Oξ)}0 and d(θ^n,θn0)d(θ^n,θ0)+d(θ0,θn0)0 in probability. Thus by applying Theorem 3.4.1 of van der Vaart and Wellner [35], we have n(1ν)/2d(θ^n,θn0)=Op(1). This together with d(θn0,θ0)=O(nrν/2) yields that d(θ^n,θ0)=Op(n(1ν)/2+nrν/2) and the proof is completed.

Proof Proof of Theorem 3.2 —

Now we will prove the asymptotic normality of ϑ^n=(β^tn,β^wn,η^n). First we will establish the asymptotic normality for the estimator based on the complete observation O={U,ΨW,Ψ,Δ1,Δ2,Z}. With a little abuse of notation, we still denote the complete-data estimator as ϑ^n.

Let V denote the linear span of Θθ0 and define the Fisher inner product for v,v~V as <v,v~>=P{l˙(θ0,O)[v]l˙(θ0,O)[v~]} and the Fisher norm for vV as v2=<v,v>, where

l˙(θ0,O)[v]=dl(θ0+sv,O)ds|s=0

denotes the first order directional derivative of l(θ,O) at the direction vV (evaluated at θ0). Also let V¯ be the closed linear span of V under the Fisher norm. Then (V¯,) is a Hilbert space. Furthermore, for a vector of (2p+1) dimension b=(b1,b2,b3) with b1 and any vV, define a smooth functional of θ as h(θ)=b1β1+b2β2+b3η and

h˙(θ0)[v]=dh(θ0+sv)ds|s=0

whenever the right hand-side limit is well defined. Then by the Riesz representation theorem, there exists vV¯ such that h˙(θ0)[v]=<v,v> for all vV¯ and v=h˙(θ0). Also note that h(θ)h(θ0)=h˙(θ0)[θθ0]. It thus follows from the Cram e´r-Wold device that to prove the asymptotic normality for ϑ^n, i.e. n1/2(ϑ^nϑ0)N(0,I1(ϑ0)) in distribution, it suffices to show that

n1/2<θ^nθ0,v>dN(0,bI1(ϑ0)b) (A4)

since b(ϑ^nϑ0)=h(θ^n)h(θ0)=h˙(θ0)[θ^nθ0]=<θ^nθ0,v>. In fact, A4 holds since one can show that n1/2<θ^nθ0,v>dN(0,v2) and v2=bI1(ϑ0)b.

We first prove that n1/2<θ^nθ0,v>dN(0,v2). Let δn=nmin{(1ν)/2,rν/2} denote the rate of convergence obtained in Theorem 3.1, and for any θΘ such that d(θ,θ0)δn, define the first order directional derivative of l(θ,O) at the direction vV as

l˙(θ,O)[v]=dl(θ+sv,O)ds|s=0

and the second-order directional derivative at the directions v,v~V as

l¨(θ,O)[v,v~]=d2l(θ+sv+s~v~,O)ds~ds|s=0|s~=0=dl˙(θ+s~v~,O)[v~]ds~|s~=0.

Note that by Condition (C3) and Theorem 1.6.2 of Lorentz [21], there exists ΠnvΘnθ0 such that Πnvv=O(nνr/2). Furthermore, under the assumption ν>1/2r, we have δnΠnvv=o(n1/2). Define r[θθ0,O]=l(θ,O)l(θ0,O)l˙(θ0,O)[θθ0] and let εn be any positive sequence satisfying εn=o(n1/2). Then by the definition of θ^n, we have

0Pn[l(θ^n,O)l(θ^n±εnΠnv,O)]=εnPnl˙(θ0,O)[Πnv]+(PnP){r[θ^nθ0,O]r[θ^n±εnΠnvθ0,O]}+P{r[θ^nθ0,O]r[θ^n±εnΠnvθ0,O]}=εnPnl˙(θ0,O)[v]εnPnl˙(θ0,O)[Πnvv]+(PnP){r[θ^nθ0,O]r[θ^n±εnΠnvθ0,O]}+P{r[θ^nθ0,O]r[θ^n±εnΠnvθ0,O]}=εnPnl˙(θ0,O)[v]I1+I2+I3.

We will investigate the asymptotic behaviour of I1, I2 and I3. For I1, it follows from Conditions (C1)–(C3), Chebyshev inequality and Πnvv=o(1) that I1=εn×op(n1/2). For I2, by the mean value theorem, we obtain that

I2=(PnP){l(θ^n,O)l(θ^n±εnΠnv,O)±εnl˙(θ0,O)[Πnv]}=εn(PnP){(l˙(θ~,O)l˙(θ0,O))[Πnv]},

where θ~ lies between θ^n and θ^n±εnΠnv. By Theorem 2.8.3 of van der Vaart and Wellner [35], we know that {l˙(θ,O)[Πnv]:θθ0δn} is Donsker class. Therefore, by Theorem 2.11.23 of van der Vaart and Wellner [35], we have I2=εn×op(n1/2). For I3, note that

P(r[θθ0,O])=P{l(θ,O)l(θ0,O)l˙(θ0,O)[θθ0]}=21P{l¨(θ~,O)[θθ0,θθ0]l¨(θ0,O)[θθ0,θθ0]}+21P{l¨(θ0,O)[θθ0,θθ0]}=21P{l¨(θ0,O)[θθ0,θθ0]}+εn×op(n1/2),

where θ~ lies between θ0 and θ and the last equation follows from Taylor expansion and Conditions (C1)–(C3). Therefore,

I3=21{θ^nθ02θ^n±εnΠnvθ02}+εn×op(n1/2)=±εn<θ^nθ0,Πnv>+21εnΠnv2+εn×op(n1/2)=±εn<θ^nθ0,v>+21εnΠnv2+εn×op(n1/2)=±εn<θ^nθ0,v>+εn×op(n1/2),

where the last equality holds due to the facts δnΠnvv=o(n1/2), Cauchy-Schwartz inequality, and Πnv2v2. Combining the above facts, together with Pl˙(θ0,O)[v]=0, we can establish that

0Pn{l(θ^n,O)l(θ^n±εnΠnv,O)}=εnPnl˙(θ0,O)[v]±εn<θ^nθ0,v>+εn×op(n1/2)=εn(PnP){l˙(θ0,O)[v]}±εn<θ^nθ0,v>+εn×op(n1/2).

Therefore, we obtain n1/2(PnP){l˙(θ0,O)[v]}±n1/2<θ^nθ0,v>+op(1)0 and then n1/2<θ^nθ0,v>=n1/2(PnP){l˙(θ0,O)[v]}+op(1)dN(0,v2) by the central limit theorem and v2=l˙(θ0,O)[v]2.

Next we will prove that v2=bI1(ϑ0)b. For each component ϑq, q=1,2,,(2p+1), we denote by ψq=(b1q,b2q) the value of ψq=(b1q,b2q) minimizing

E{lϑeqlb1[b1q]lb2[b2q]}2,

where lϑ is the score function for ϑ, lbj is the score operator for Λj, j = 1, 2, and eq is a (2p+1)-dimensional vector of zeros except the q-th element equal to 1.

Define the q-th element of l(ϑ,O) as lϑeqlb1[b1q]lb2[b2q], q=1,,(2p+1), and I(ϑ) as E({l(ϑ,O)}2). By Condition (C5), the matrix I(ϑ0) is positive definite. Furthermore, by following similar calculations in Chen et al. [4](sec. 3.2), we obtain

v2=h˙(θ0)2=supvV¯:v>0|h˙(θ0)[v]|2v2=b[E({l(ϑ0,O)}2)]1b=bI1(ϑ0)b.

Thus, we have shown that n1/2(ϑ^nϑ0)N(0,I1(ϑ0)) in distribution for the estimator ϑ^n based on the complete data.

Now consider the estimator ϑ^n based only on the case-cohort data. Note that the weight p=ξ/πq(Δ1,Δ2) is bounded and does not depend on θ, and E{p|O}=1. By Theorem 3.2 of Saegusa and Wellner [29], we have

n1/2(ϑ^nϑ0)=I1(ϑ0)n1/2i=1npil(ϑ0,Oi)+op(1),

where I(ϑ) and l(ϑ,O), defined above, are the information and efficient score for ϑ based on the complete data. Note that

var{pl(ϑ0,O)}=var{E{pl(ϑ0,O)|O}}+E{var{pl(ϑ0,O)|O}}=var{l(ϑ0,O)}+E{var(ξ|O){l(ϑ0,O)}2πq2(Δ1,Δ2)}=I(ϑ0)+E{1πq(Δ1,Δ2)πq(Δ1,Δ2){l(ϑ0,O)}2}.

Thus, we have

n1/2(ϑ^nϑ0)N(0,Σ)

in distribution, where

Σ=I1(ϑ0)+I1(ϑ0)E{1πq(Δ1,Δ2)πq(Δ1,Δ2){l(ϑ0,O)}2}I1(ϑ0).

Funding Statement

The work was partially supported by the National Science Foundation of USA grant DMS-1916170, the National Natural Science Foundation of China grant 11671168, the Science and Technology Developing Plan of Jilin Province of China grant 20170101061JC, and the National Institute of Allergy and Infectious Disease of USA grant 1 R56 AI140953-01.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Bogaerts K., Komarek A., and Lesaffre E, Survival Analysis with Interval-Censored Data: A Practical Approach with Examples in R, SAS, and BUGS, CRC Press, 2017. [Google Scholar]
  • 2.Chen K., Generalized case cohort sampling, J. R. Statist. Soc. B 63 (2001), pp. 791–809. doi: 10.1111/1467-9868.00313 [DOI] [Google Scholar]
  • 3.Chen K. and Lo S.H., Case-cohort and case-control analysis with Cox's model, Biometrika 86 (1999), pp. 755–764. doi: 10.1093/biomet/86.4.755 [DOI] [Google Scholar]
  • 4.Chen X., Fan Y., and Tsyrennikov V., Efficient estimation of semiparametric multivariate copula models, J. Am. Stat. Assoc. 101 (2006), pp. 1228–1240. doi: 10.1198/016214506000000311 [DOI] [Google Scholar]
  • 5.Chen D.G., Sun J., and Peace K, Interval-Censored Time-to-Event Data: Methods and Applications, CRC Press, 2012. [Google Scholar]
  • 6.Cui Q., Zhao H., and Sun J., A new copula model-based method for regression analysis of dependent current status data, Stat. Interface. 11 (2018), pp. 463–471. doi: 10.4310/SII.2018.v11.n3.a9 [DOI] [Google Scholar]
  • 7.Finkelstein D.M., A proportional hazards model for interval-censored failure time data, Biometrics 42 (1986), pp. 845–854. doi: 10.2307/2530698 [DOI] [PubMed] [Google Scholar]
  • 8.Fong Y., Shen X., Ashley V.C., Deal A., Seaton K.E., Yu C., Grant S.P., Ferrari G., Bailer R.T., Koup R.A., Montefiori D., Haynes B.F., Sarzotti-Kelsoe M., Graham B.S., Carpp L.N., Hammer S.M., Sobieszczyk M., Karuna S., Swann E., DeJesus E., Mulligan M., Frank I., Buchbinder S., Novak R.M., McElrath M.J., Kalams S., Keefer M., Frahm N.A., Janes H.E., Gilbert P.B., and Tomaras G.D., Modification of the association between T-Cell immune responses and human immunodeficiency virus type 1 infection risk by vaccine-induced antibody responses in the HVTN 505 trial, J. Infect. Dis. 217 (2018), pp. 1280–1288. doi: 10.1093/infdis/jiy008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Gilbert P.B., Peterson M.L., Follmann D., Hudgens M.G., Francis D.P., Gurwith M., Heyward W.L., Jobes D.V., Popovic V., Self S.G., Sinangil F., Burke D., and Berman P.W., Correlation between immunologic responses to a recombinant glycoprotein 120 vaccine and incidence of HIV-1 infection in a phase 3 HIV-1 preventive vaccine trial, J. Infect. Dis. 191 (2005), pp. 666–677. doi: 10.1086/428405 [DOI] [PubMed] [Google Scholar]
  • 10.Hammer S.M., Sobieszczyk M.E., Janes H., Mulligan M.J., Karuna S.T., Grove D., Koblin B.A., Buchbinder S.P., Keefer M.C., Tomaras G.D., Frahm N., Hural J., Anude C., Graham B.S., Enama M.E., Adams E., DeJesus E., Novak R.M., Frank I., Bentley C., Ramirez S., Fu R., Koup R.A., Mascola J.R., Nabel G.J., Montefiori D.C., Kublin J., McElrath M.J., Corey L., and Gilbert P.B., HVTN 505 Study Team , Efficacy trial of a DNA/rAd5 HIV-1 preventive vaccine, N. Engl. J. Med. 369 (2013), pp. 2083–2092. doi: 10.1056/NEJMoa1310566 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Huang J., Asymptotic properties of nonparametric estimation based on partly interval-censored data, Stat. Sin. 9 (1999), pp. 501–519. [Google Scholar]
  • 12.Huang J. and Rossini A.J., Sieve estimation for the proportional-odds failure-time regression model with interval censoring, J. Am. Stat. Assoc. 92 (1997), pp. 960–967. doi: 10.1080/01621459.1997.10474050 [DOI] [Google Scholar]
  • 13.Huang X. and Wolfe R., A frailty model for informative censoring, Biometrics 58 (2002), pp. 510–520. doi: 10.1111/j.0006-341X.2002.00510.x [DOI] [PubMed] [Google Scholar]
  • 14.Janes H.E., Cohen K.W., Frahm N., De Rosa S.C., Sanchez B., Hural J., Magaret C.A., Karuna S., Bentley C., Gottardo R., Finak G., Grove D., Shen M., Graham B.S., Koup R.A., Mulligan M.J., Koblin B., Buchbinder S.P., Keefer M.C., Adams E., Anude C., Corey L., Sobieszczyk M., Hammer S.M., Gilbert P.B., and McElrath M.J., Higher T-cell responses induced by DNA/rAd5 HIV-1 preventive vaccine are associated with lower HIV-1 infection risk in an efficacy trial, J. Infect. Dis. 215 (2017), pp. 1376–1385. doi: 10.1093/infdis/jix086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jewell N.P. and van der Laan M., Current status data: review, recent development and open problems, Adv. Survival Anal. 35 (2004), pp. 625–643. [Google Scholar]
  • 16.Kalbfleisch J.D. and Prentice R.L., The Statistical Analysis of Failure Time Data, 2nd ed., Wiley, New York, 2002. [Google Scholar]
  • 17.Kang S. and Cai J., Marginal hazards model for case-cohort studies with multiple disease outcomes, Biometrika 96 (2009), pp. 887–901. doi: 10.1093/biomet/asp059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Keogh R.H. and White I.R., Using full-cohort data in nested case-control and case-cohort studies by multiple imputation, Stat. Med. 32 (2013), pp. 4021–4043. doi: 10.1002/sim.5818 [DOI] [PubMed] [Google Scholar]
  • 19.Kim S., Cai J., and Lu W., More efficient estimators for case-cohort studies, Biometrika 100 (2013), pp. 695–708. doi: 10.1093/biomet/ast018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Li Z. and Nan B., Relative risk regression for current status data in case-cohort studies, Can. J. Stat. 39 (2011), pp. 557–577. doi: 10.1002/cjs.10111 [DOI] [Google Scholar]
  • 21.Lorentz G.G., Bernstein Polynomials, Chelsea Publishing Co, New York, 1986. [Google Scholar]
  • 22.Ma S. and Kosorok M.R., Robust semiparametric M-estimation and the weighted bootstrap, J. Multivar. Anal. 96 (2005), pp. 190–217. doi: 10.1016/j.jmva.2004.09.008 [DOI] [Google Scholar]
  • 23.Ma L., Hu T., and Sun J., Cox regression analysis of dependent interval-censored failure time data, Comput. Stat. Data Anal. 103 (2016), pp. 79–90. doi: 10.1016/j.csda.2016.04.011 [DOI] [Google Scholar]
  • 24.Ma L., Hu T., and Sun J., Sieve maximum likelihood regression analysis of dependent current status data, Biometrika 102 (2015), pp. 731–738. doi: 10.1093/biomet/asv020 [DOI] [Google Scholar]
  • 25.Marti H. and Chavance M., Multiple imputation analysis of case-cohort studies, Stat. Med. 30 (2011), pp. 1595–1607. doi: 10.1002/sim.4130 [DOI] [PubMed] [Google Scholar]
  • 26.McKeague I.W. and Utikal K.J., Goodness-of-fit tests for additive hazards and proportional hazards models, Scand. J. Stat. 18 (1991), pp. 177–195. [Google Scholar]
  • 27.Prentice R.L., A case-cohort design for epidemiologic cohort studies and disease prevention trials, Biometrika 73 (1986), pp. 1–11. doi: 10.1093/biomet/73.1.1 [DOI] [Google Scholar]
  • 28.Ren J.J. and He B., Estimation and goodness-of-fit for the Cox model with various types of censored data, J. Stat. Plan. Inference. 141 (2011), pp. 961–971. doi: 10.1016/j.jspi.2010.09.006 [DOI] [Google Scholar]
  • 29.Saegusa T. and Wellner J.A., Weighted likelihood estimation under two-phase sampling, Ann. Stat. 41 (2013), pp. 269–295. doi: 10.1214/12-AOS1073 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Scheike T.H. and Martinussen T., Maximum likelihood estimation for Cox's regression model under case-cohort sampling, Scand. J. Stat. 31 (2004), pp. 283–293. doi: 10.1111/j.1467-9469.2004.02-064.x [DOI] [Google Scholar]
  • 31.Shen X., On methods of sieves and penalization, Ann. Stat. 25 (1997), pp. 2555–2591. doi: 10.1214/aos/1030741085 [DOI] [Google Scholar]
  • 32.Shen X. and Wong W.H., Convergence rate of sieve estimates, Ann. Stat. 22 (1994), pp. 580–615. doi: 10.1214/aos/1176325486 [DOI] [Google Scholar]
  • 33.Sun J., A nonparametric test for current status data with unequal censoring, J. R. Stat. Soc. B 61 (1999), pp. 243–250. doi: 10.1111/1467-9868.00174 [DOI] [Google Scholar]
  • 34.Sun J, The Statistical Analysis of Interval-censored Failure Time Data, New York, Springer, 2006. [Google Scholar]
  • 35.van der Vaart A.W. and Wellner J.A, Weak Convergence and Empirical Processes: With Applications to Statistics, Springer, New York, 1996. [Google Scholar]
  • 36.Wang P., Zhao H., and Sun J., Regression analysis of case K interval-censored failure time data in the presence of informative censoring, Biometrics 72 (2016), pp. 1103–1112. doi: 10.1111/biom.12527 [DOI] [PubMed] [Google Scholar]
  • 37.Wang L.M., McMahan C.S, Hudgens M.G, and Qureshi Z.P., A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data, Biometrics 72 (2016), pp. 222–231. doi: 10.1111/biom.12389 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Wang S., Wang C., Wang P., and Sun J., Semiparametric analysis of the additive hazards model with informatively interval-censored failure time data, Comput. Stat. Data Anal. 125 (2018), pp. 1–9. doi: 10.1016/j.csda.2018.03.011 [DOI] [Google Scholar]
  • 39.Zhang Y., Hua L., and Huang J., A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data, Scand. J. Stat. 37 (2010), pp. 338–354. doi: 10.1111/j.1467-9469.2009.00680.x [DOI] [Google Scholar]
  • 40.Zhou Q., Hu T., and Sun J., A sieve semiparametric maximum likelihood approach for regression analysis of bivariate interval-censored failure time data, J. Am. Stat. Assoc. 112 (2017a), pp. 664–672. doi: 10.1080/01621459.2016.1158113 [DOI] [Google Scholar]
  • 41.Zhou Q., Zhou H., and Cai J., Case-cohort studies with interval-censored failure time data, Biometrika 104 (2017b), pp. 17–29. doi: 10.1093/biomet/asw067 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES