Abstract
The case-cohort design is widely used as a means of reducing the cost in large cohort studies, especially when the disease rate is low and covariate measurements may be expensive, and has been discussed by many authors. In this paper, we discuss regression analysis of case-cohort studies that produce interval-censored failure time with dependent censoring, a situation for which there does not seem to exist an established approach. For inference, a sieve inverse probability weighting estimation procedure is developed with the use of Bernstein polynomials to approximate the unknown baseline cumulative hazard functions. The proposed estimators are shown to be consistent and the asymptotic normality of the resulting regression parameter estimators is established. A simulation study is conducted to assess the finite sample properties of the proposed approach and indicates that it works well in practical situations. The proposed method is applied to an HIV/AIDS case-cohort study that motivated this investigation.
Keywords: Case-cohort design, dependent interval censoring, inverse probability weighting, proportional hazards model
1. Introduction
The case-cohort design is widely used as a means of reducing the cost in large cohort studies, especially when the disease rate is low and covariate measurements may be expensive (Prentice [27]; Scheike and Martinussen [30], Self and Prentice [32]). For the situation, instead of collecting the covariate information on all study subjects, it collects the covariate information only on the subjects whose failures are observed and on a subsample of the remaining subjects. Among others, one area where the design is often used is epidemiological cohort studies in which the outcomes of interest are times to failure events such as AIDS, cancer, heart disease and HIV infection. For such studies, in addition to the incomplete nature on covariate information, another feature is that the observations are usually interval-censored rather than right-censored due to the periodic follow-up nature of the study (Sun [34]).
By interval-censored data, we usually mean that the failure time of interest is known or observed only to belong to an interval instead of being observed exactly. It is easy to see that interval-censored data include right-censored data as a special case. Furthermore, sometimes one may also face informative censoring, meaning that the failure time of interest and the censoring mechanism are correlated (Huang and Wolfe [13]; Wang et al. [37]). An example of informatively interval-censored data may arise in a periodic follow-up study of certain disease where study subjects may not follow the pre-specified visit schedules and instead pay clinical visits according to their disease status or how they feel with respect their treatments. Among others, Huang and Wolfe [13] and Sun [33] discussed the issue and pointed out that in the presence of informative censoring, the analysis that ignores it may result in biased or misleading results or conclusions. More discussion on informatively interval-censored data can be found in Sun [34].
One real study that motivated this investigation is the HVTN 505 Trial to assess the efficacy of a DNA prime-recombinant adenovirus type 5 boost (DNA/rAd5) vaccine to prevent human immunodeficiency virus type 1 (HIV-1) infection (Fong et al. [8]; Hammer et al. [10]; Janes et al. [14]). It is well-known that HIV-1 infection is deadly as it causes AIDS for which there is no cure and thus it is important and essential to develop a safe and effective vaccine for the prevention of the infection. The original study consists of 2504 men or transgender women who had sex with men were examined periodically, thus yielding only interval-censored data on the time to HIV-1 infection. For each subject, the information on four demographic covariates, age, race, BMI and behavioural risk, was collected, and in addition, for a subgroup of HIV infection cases and non-cases, a number of T cell response biomarkers and anti-body response biomarkers were also measured. One goal of the study is to determine or identify the important or relevant covariates or biomarkers for HIV-1 infection.
Many authors have discussed the analysis of case-cohort studies but most of the existing methods are for right-censored failure time data. For example, some of the early work on this was given by Prentice [27] and Self and Prentice [32], who proposed some pseudolikelihood approaches based on the modification of the commonly used partial likelihood method under the proportional hazards model. By following them, Chen and Lo [3] proposed an estimating equation approach that yields more efficient estimators than the pseudolikelihood estimator proposed in Prentice [27], and Chen [2] developed an estimating equation approach that applies to a class of cohort sampling designs, including the case-cohort design with the key estimating function constructed by a sample reuse method via local averaging. Also Marti and Chavance [25] and Keogh and White [18] proposed some multiple imputation methods and in particular, the latter method extended the former by considering more complex imputation models that include time and interaction or nonlinear terms. In addition, Kang and Cai [17] and Kim et al. [19] developed weighted estimating equation approaches for case-cohort studies with multiple disease outcomes, where the latter method improved the efficiency upon the former by utilizing more information in constructing the weights.
Interval-censored failure time data naturally occur in many areas, especially in the studies with periodic follow-ups, and a great deal of literature has been developed for their analysis (Chen et al. [5]; Finkelstein [7]; Sun [34]; Zhou et al. [40]). In particular, Sun [34] and Bogaerts et al. [1] provided comprehensive reviews of the existing literature on interval-censored data. Although there also exist some methods for either informatively interval-censored data or the interval-censored arising from case-cohort studies, there does not seem to exist an established procedure for informatively interval-censored data arising from case-cohort studies. In particular, for the analysis of informatively interval-censored data, two types of approaches are commonly used and they are the frailty model approach and the copula model approach. For example, Zhang et al. (2005, 2007) and Wang et al. [36,38] gave some frailty model estimation procedures, while Ma et al. [23,24] and Zhao et al. (2015) proposed some copula model methods. For the analysis of the interval-censored data arising from case-cohort studies, Gilbert et al. [9] presented a midpoint imputation procedure and Li and Nan [20] considered a special case of interval-censored data, current status data, where the failure time of interest is either left- or right-censored (Jewell and van der Laan [15]). Also Zhou et al. [41] proposed a likelihood-based approach. However, all of the three methods above assume that the interval censoring mechanism is non-informative or independent of the failure time of interest. As discussed by many authors and above, the informative censoring is a serious and difficult issue and the use of the methods that do not take it into account can yield biased or misleading results and conclusions (Huang and Wolfe [13]; Ma et al. [23]). In the following, we will develop a frailty model approach, a generalization of the method proposed in Zhou et al. [41], for the analysis of the case-cohort studies yielding interval-censored data with informative censoring.
The remainder of the paper is organized as follows. We will begin in Section 2 with introducing some notation and models to be used throughout the paper and in particular, we will present joint frailty models for the failure time of interest and the underlying censoring mechanism. To estimate regression parameters, a sieve inverse probability weighting estimation procedure is then presented in Section 3 and in the method, Bernstein polynomials are employed to approximate unknown functions. Furthermore, we establish the consistency and asymptotic normality of the resulting estimators of regression parameters and provide a weighted bootstrap procedure for variance estimation. Section 4 presents some results obtained from an extensive simulation study conducted to assess the finite sample properties of the proposed methodology and they suggest that the method works well in practical situations. In Section 5, we apply the proposed method to the HIV/AIDS study described above and Section 6 gives some discussion and concluding remarks.
2. Notation and models
Consider a failure time study that consists of n independent subjects. For subject i, let denote the failure time of interest and suppose that there exists a p-dimensional vector of covariates denoted by that may affect , . Also for subject i, suppose that there exist two examination times denoted by and with and one only observes and , indicating if the failure time is left-censored and interval-censored, respectively. Note that here and are random variables and assumed to be observed and they together with and give the observed interval-censored data on the 's (Sun [34]; Zhou et al. [41]).
For the case-cohort studies, as mentioned above, the information on covariates is available only for the subjects who either have experienced the failure event of interest or with or or are from the sub-cohort that is a random sample of the entire cohort. Define if the covariate is available or observed and 0 otherwise, . For the selection of the subcohort, by following Zhou et al. [40] and others, we will consider the independent Bernoulli sampling with the selection probability . Then under the assumption above, the probability that the covariate is observed is given by
, and the observed data have the form
In contrast, if all covariates were observed, the full cohort data would be
To describe the covariate effects and dependent interval censoring, define , . By following Ma et al. [23], we will focus on the situation where the dependent censoring can be characterized by the correlation between the 's and 's. As mentioned in Ma et al. [23], one example where this may be the case is follow-up studies where some study subjects may tend to pay more or less clinical visits than the scheduled ones. More comments on this will be given below. For the covariate effects, we assume that there exists a latent variable with mean one and known distribution but unknown variance η and given and , the hazard functions of and have the forms
(1) |
and
(2) |
respectively. In the above, and are unknown baseline hazard functions and and are vectors of unknown regression parameters. Also it will be assumed that given and , is independent of and and are independent. In other words, the correlation between and is measured by the parameter η. More comments on this are given below.
Define and , where and . Assume that is independent of and the joint distribution of does not involve the parameters of interest. To motivate the proposed estimation procedure, note that conditional on , the likelihood of the observation from subject i has the form
Also note that conditional on , the likelihood of the observation on is given by
where . This motivates the following inverse probability weighted log-likelihood function
(3) |
for estimation of θ, where denotes the the density function of the 's and
If f is the gamma distribution, the function has a closed form as
(4) |
In the next section, for estimation of θ, we will discuss the maximization of the inverse probability weighted log-likelihood function .
3. Sieve inverse probability weighting estimation
Define the parameter space of θ
where with M being a positive constant and denotes the collection of all bounded and continuous nondecreasing, nonnegative functions over the interval , j = 1, 2. In practice, is usually taken to be the range of the 's and 's and the range of the 's. More comments on this are given below. For the maximization of the inverse probability weighted log-likelihood function , it is easy to see that this would not be straightforward since involves unknown functions and . To deal with this and by following Ma et al. [24], Zhou et al. [40] and others, we propose first to approximate the two functions by Bernstein polynomials.
More specifically, define the sieve space
with
and
In the above,
and
, which Bernstein polynomials of degree for some . Note that some restrictions are needed above on the parameters since and are nonnegative and nondecreasing functions. However, this can be easily removed by some reparameterization. For example, one can reparameterize the parameters by the cumulative sums of the parameters , j = 1, 2.
Let denote the estimator of θ given by the value of θ that maximizes the inverse probability weighted log-likelihood function over the sieve space . Also let denote the true value of θ, , , and for any and in the parameter space Θ, define the distance
Here denotes the Euclidean norm for a vector v, , and with denoting the joint distribution function of U and W. The following two theorems establish the asymptotic properties of .
Theorem 3.1
Suppose that the regularity conditions (C1)–(C4) given in the Appendix hold. Then as we have that almost surely and where is defined in and r in the regularity condition (C3).
Theorem 3.2
Suppose that the regularity conditions (C1)–(C5) given in the Appendix hold. Then as and if we have that
in distribution, where
with for a vector v and and given in the Appendix, denoting the information matrix and efficient score for based on the complete data.
The proof of the results given above is sketched in the Appendix. For the determination of the proposed estimator , different methods can be used and in the numerical studies below, the Matlab function fmincon is used. Also for the determination of , one needs to choose or specify the degree m of Bernstein polynomials, which controls the smoothness of the approximation. For this, one common approach is to perform the grid search by considering different values of m and choosing the one that minimizes
based on the AIC criterion. Note that instead of this, one may employ other criteria such as the BIC criterion and the numerical results indicate that they give similar performance. Also note that in the approximation of and , we used the same degree m and in practice, different m could be used too.
For inference about , of course, one needs to estimate the covariance matrix of . For this, a natural way would be to derive a consistent estimator of Σ. On the other hand, one could see from the Appendix that Σ involves the information matrix and the efficient score and both of them do not have closed forms. Thus, it would be difficult to derive a consistent estimator and instead we propose to employ the weighted bootstraps procedure discussed in Ma and Kosorok [22], which is easy to implement and seems to work well in the numerical studies described below. Specifically, let denote n independent realizations of a bounded positive random variable u satisfying and and define the new weights , . Also let denote the estimator of ϑ proposed above with replacing the 's by the 's. Then if we repeat this B times, one can estimate the covariance matrix of by the sample covariance matrix of the 's. By following Ma and Kosorok [22], it can be shown that this weighted bootstrap variance estimator is consistent.
4. A simulation study
In this section, we report some results obtained from a simulation study conducted to evaluate the finite sample performance of the inverse probability weighted estimation procedure proposed in the previous sections. In the study, it was assumed that the covariate Z followed the Bernoulli distribution with the success probability of 0.5 and to generate the subcohort, as mentioned above, we considered the independent Bernoulli sampling with the selection probability being 0.1. For the proportion of the observed failure events or the event rate, we studied several cases including , 0.1 and 0.2. To generate interval-censored data, we first generated the 's from the uniform distribution over with a being a positive constant and the latent variable 's. Then the 's and 's were generated based on models (2.1) and (2.2) with , 0.1t or , and the 's were defined as for all i. The results given below are based on the full cohort size n = 1000 or 2000 with 1000 replications.
Table 1 presents the results obtained on the proposed estimators , and with n = 1000, the true values of the parameters being , 0.2 or 0.5 and , and the 's following the gamma distribution. The results include the estimated bias (Bias) given by the average of the proposed estimates minus the true value, the sample standard error (SSE), the average of the estimated standard errors (ESE) and the empirical coverage probability (CP). Here we took the degree of Bernstein polynomials being m = 3 and the weighted bootstrap sample size B = 100 for variance estimation. Also for the variance estimation, we generated the random sample repeatedly from the exponential distribution. Table 2 gives the estimation results obtained under the same set-up as above except n = 2000. One can see from the two tables that the results indicate that the proposed estimator seems to be unbiased and the weighted bootstrap variance estimation procedure seems to work well. Also they indicate that the normal approximation to the distribution of the proposed estimator appears to be reasonable. In addition, as expected, the estimation results became better when the percentage of the observed failure events or the full cohort size increased. We also considered other set-ups including different values for m and B and obtained similar results.
Table 1. Estimation of regression parameters with n = 1000.
Parameter | Bias | SSE | ESE | CP | |
---|---|---|---|---|---|
5% | −0.0107 | 0.3920 | 0.3903 | 0.9510 | |
−0.0026 | 0.3180 | 0.3247 | 0.9460 | ||
−0.0034 | 0.2778 | 0.2863 | 0.9330 | ||
0.0074 | 0.4045 | 0.3958 | 0.9450 | ||
0.0163 | 0.3259 | 0.3273 | 0.9480 | ||
0.0180 | 0.2722 | 0.2852 | 0.9520 | ||
0.0272 | 0.4070 | 0.4032 | 0.9490 | ||
0.0203 | 0.3469 | 0.3330 | 0.9380 | ||
0.0200 | 0.2789 | 0.2813 | 0.9470 | ||
10% | −0.0163 | 0.3428 | 0.3377 | 0.9330 | |
−0.0047 | 0.2976 | 0.3106 | 0.9540 | ||
0.0347 | 0.2475 | 0.2544 | 0.9420 | ||
−0.0003 | 0.3413 | 0.3394 | 0.9530 | ||
0.0158 | 0.3127 | 0.3096 | 0.9400 | ||
0.0347 | 0.2509 | 0.2493 | 0.9410 | ||
0.0117 | 0.3447 | 0.3438 | 0.9470 | ||
0.0291 | 0.3146 | 0.3130 | 0.9410 | ||
0.0386 | 0.2500 | 0.2440 | 0.9370 | ||
20% | −0.0067 | 0.3058 | 0.3022 | 0.9480 | |
0.0074 | 0.2766 | 0.2740 | 0.9400 | ||
0.0071 | 0.2202 | 0.2159 | 0.9240 | ||
−0.0022 | 0.3066 | 0.3027 | 0.9410 | ||
0.0134 | 0.2757 | 0.2756 | 0.9480 | ||
0.0132 | 0.2178 | 0.2150 | 0.9340 | ||
0.0021 | 0.3089 | 0.3058 | 0.9410 | ||
0.0223 | 0.2786 | 0.2791 | 0.9440 | ||
0.0165 | 0.2220 | 0.2155 | 0.9230 |
Table 2. Estimation of regression parameters with n = 2000.
Parameter | Bias | SSE | ESE | CP | |
---|---|---|---|---|---|
5% | −0.0106 | 0.2718 | 0.2707 | 0.9480 | |
0.0007 | 0.2255 | 0.2270 | 0.9470 | ||
0.0117 | 0.1945 | 0.2094 | 0.9620 | ||
0.0080 | 0.2772 | 0.2751 | 0.9480 | ||
0.0112 | 0.2299 | 0.2289 | 0.9510 | ||
0.0089 | 0.1872 | 0.2085 | 0.9730 | ||
0.0170 | 0.2813 | 0.2781 | 0.9440 | ||
0.0052 | 0.2299 | 0.2313 | 0.9510 | ||
0.0205 | 0.1877 | 0.2015 | 0.9660 | ||
10% | −0.0046 | 0.2345 | 0.2344 | 0.9440 | |
0.0081 | 0.2097 | 0.2148 | 0.9500 | ||
0.0346 | 0.1722 | 0.1847 | 0.9660 | ||
0.0058 | 0.2404 | 0.2371 | 0.9400 | ||
0.0060 | 0.2130 | 0.2172 | 0.9580 | ||
0.0371 | 0.1827 | 0.1830 | 0.9410 | ||
0.0083 | 0.2371 | 0.2407 | 0.9540 | ||
0.0151 | 0.2208 | 0.2193 | 0.9450 | ||
0.0372 | 0.1705 | 0.1830 | 0.9570 | ||
20% | −0.0023 | 0.2083 | 0.2106 | 0.9500 | |
−0.0096 | 0.1897 | 0.1928 | 0.9550 | ||
0.0060 | 0.1609 | 0.1671 | 0.9520 | ||
0.0011 | 0.2114 | 0.2122 | 0.9600 | ||
0.0099 | 0.1918 | 0.1938 | 0.9540 | ||
0.0075 | 0.1575 | 0.1645 | 0.9560 | ||
−0.0034 | 0.2197 | 0.2137 | 0.9370 | ||
0.0040 | 0.1953 | 0.1958 | 0.9430 | ||
0.0036 | 0.1597 | 0.1628 | 0.9430 |
In the proposed estimation procedure, it has been assumed that the distribution of the latent variables 's is known up to a variance parameter. Hence in practice, one question of interest may be the robustness of the estimation procedure with respect to the distribution. To investigate this, we repeated the simulation study above giving the results in Table 1 with except that we generated the 's from the log-normal distribution instead of the gamma distribution but assumed that they followed the gamma distribution. Table 3 presents the results obtained on the proposed estimators and , including the Bias, the SSE, the ESE and the empirical CP. As before, they suggest that the proposed methodology seems to work well or the estimators and appear to be robust with respect to the distribution of the latent variables.
Table 3. Estimation of regression parameters with n = 1000, and misspecified frailty distribution.
Parameter | Bias | SSE | ESE | CP |
---|---|---|---|---|
0.0040 | 0.3192 | 0.3221 | 0.9500 | |
−0.0017 | 0.2682 | 0.2600 | 0.9470 | |
0.0025 | 0.3218 | 0.3238 | 0.9500 | |
0.0014 | 0.2711 | 0.2617 | 0.9410 | |
−0.0035 | 0.3356 | 0.3289 | 0.9410 | |
0.0090 | 0.2728 | 0.2678 | 0.9510 |
For the problem discussed here, instead of the inverse probability weighting method proposed above, there exist two commonly used naive approaches that estimate regression parameters by using the regular likelihood approaches. One is to base the estimation only on the selected sub-cohort and the other is to base the estimation on a simple random sample that has the same size as the case-cohort sample. Let and denote the estimators of given by the two naive methods above, respectively, and here we only focus on the estimation of . Table 4 gives the estimation results given by the proposed method and the two naive approaches under the set-up similar to that for Table 2 with . Note that here for comparison, we also considered the approach given by Zhou et al. [40], which treated the observation process to be independent of the failure time of interest or ignored the correlation between the failure time and the observation process. The resulting estimator of is denoted by in the table. One can see from Table 4 that the proposed estimate clearly gave better performance than the two naive estimates and one would get biased results if ignoring the correlation between the failure time of interest and the observation process.
Table 4. Comparison of the proposed and naive estimators for with and .
Parameter | Bias | SSE | ESE | CP | |
---|---|---|---|---|---|
-0.0129 | 0.2544 | 0.2546 | 0.9550 | ||
−0.0379 | 0.5391 | 0.5308 | 0.9470 | ||
−0.0031 | 0.3677 | 0.3697 | 0.9600 | ||
−0.1840 | 0.2142 | 0.2161 | 0.8610 | ||
−0.0027 | 0.2534 | 0.2595 | 0.9540 | ||
0.0151 | 0.5655 | 0.5635 | 0.9560 | ||
−0.0249 | 0.3959 | 0.3853 | 0.9390 | ||
−0.2282 | 0.2117 | 0.2206 | 0.8040 |
As pointed out by a reviewer and motivated by the real data discussed below, we also repeated the study that gave the results in Table 1 with in which we generated the subcohort in the same way as before but only from none-case subjects instead of all subjects as above. In other words, the goal here is to assess the performance of the proposed approach for case–control studies. The obtained estimation results are presented in Table 5 and one can see that they are similar to those given in Table 1. In other words, it seems that the proposed estimation approach seems to give good performance for and can be applied to case–control studies too.
Table 5. Estimation of regression parameters for case–control studies with and .
Parameter | Bias | SSE | ESE | CP |
---|---|---|---|---|
-0.0048 | 0.4015 | 0.4086 | 0.9560 | |
0.0060 | 0.3196 | 0.3283 | 0.9570 | |
0.0022 | 0.3048 | 0.3189 | 0.9390 | |
−0.0177 | 0.3930 | 0.4108 | 0.9570 | |
−0.0158 | 0.3176 | 0.3290 | 0.9590 | |
0.0143 | 0.3076 | 0.3226 | 0.9550 | |
−0.0015 | 0.3960 | 0.4003 | 0.9570 | |
0.0055 | 0.3287 | 0.3354 | 0.9610 | |
0.0250 | 0.3266 | 0.3179 | 0.9450 |
5. An application
In this section, we will apply the methodology proposed in the previous sections to the HVTN 505 Trial discussed above. It is a randomized, multiple-sites clinical trial of men or transgender women who had sex with men for assessing the efficacy of the DNA/rAd5 vaccine for HIV-1 infection (Fong et al. [8]; Hammer et al. [10]; Janes et al. [14]). As mentioned above, the original study consists of the subjects randomly assigned to receive either the DNA/rAd5 vaccine or placebo, and in the following, we will focus only on the 1253 subjects in the vaccine group. It is well-known that HIV-1 infection is deadly as it causes AIDS for which there is no cure and thus it is important and essential to develop a safe and effective vaccine for the prevention of the infection. For each subject, four demographic covariates were observed and they are age, race, BMI and behavioural risk. In addition, to assess their relationship with the HIV infection, a number of T cell response biomarkers and antibody response biomarkers were measured for a cohort of 150 subjects consisting of all HIV infection cases (25) and other 125 randomly selected subjects among the vaccine recipients. The failure time of interest here is the time to true HIV-1 infection and for which, only interval-censored data are available.
In all previous analyses, the authors simplified the observed data into right-censored data and also did not consider the possibility of informative censoring (Fong et al. [8]; Hammer et al. [10]; Janes et al. [14]). They identified the T cell response biomarker Env CD8+ polyfunctionality score and the antibody response biomarker IgG.Cconenv03140CF.avi that may have significant effects on the HIV infection time. For simplicity, below we will refer these two biomarkers as to Env CD8 Score and IgG, respectively. For the analysis below, by following Fong et al. [8] and Janes et al. [14], we will focus on the cohort of 150 vaccine recipients, which can be treated as a case–control design with the full cohort being all subjects in the vaccine group, and investigate the relationship between the HIV infection time and the four demographic covariates plus the two biomarkers.
Table 6 presents the estimation results given by the application of the methodology proposed in the previous sections to the HVTN 505 Trial, including the estimated covariate effects and , the estimated standard errors (ESE) and the p-values for testing the covariate effect being zero. Here for the degree of Bernstein polynomials, we tried several values, including m = 2, 3, 4, 5, 6 and 7, and the results above were obtained based on m = 3, which gave the smallest AIC defined above, and B = 500. One can see from Table 6 that the proposed estimation procedure suggests that among the six covariates considered here, two demographic covariates, race and behavioural risk, seem to be correlated with the HIV infection time and the two biomarkers also appear to have significant prognostic effects on the development of HIV infection. On the other hand, the age and BMI did not seem to have any effects on the HIV infection. In addition, the race and behavioural risk appear to have significant effects on the observation process too.
Table 6. Estimated covariate effects for the HVTN 505 Trial.
Proposed method | ||||||
---|---|---|---|---|---|---|
Covariate | SSE | p-value | SSE | p-value | ||
age | -0.2116 | 0.2523 | 0.4018 | 0.0287 | 0.3174 | 0.9279 |
race | −0.7962 | 0.4676 | 0.0886 | 1.7492 | 0.6204 | 0.0048 |
BMI | −0.1560 | 0.3020 | 0.6055 | 0.1813 | 0.3621 | 0.6166 |
behavioural risk | 1.1079 | 0.5677 | 0.0510 | 2.2763 | 0.6781 | 0.0008 |
Env CD8 Score | −0.9575 | 0.2286 | 0.0000 | 0.2661 | 0.4628 | 0.5652 |
IgG | −0.5085 | 0.1610 | 0.0016 | 0.2744 | 0.1611 | 0.0886 |
η | 0.0030 | 2.0820 | 0.9989 | |||
Method given in Zhou et al. [40] | ||||||
Covariate | SSE | p-value | ||||
age | -0.2114 | 0.2580 | 0.4125 | |||
race | −0.7985 | 0.4996 | 0.1100 | |||
BMI | −0.1561 | 0.2832 | 0.5814 | |||
behavioural risk | 1.1086 | 0.7482 | 0.1385 | |||
Env CD8 Score | −0.9574 | 0.2846 | 0.0008 | |||
IgG | −0.5089 | 0.1620 | 0.0017 |
For comparison, we also applied the method given in Zhou et al. [40], which assumed that the HIV infection time and the observation process were independent, to the data and included the estimated covariate effects, which are denoted by , in the table along with the estimated standard errors and the p-values. One can see from the table that one difference between the results given by the two methods is on the estimation of the effect of the behavioural risk factor, which did not see to have any effect on the development of HIV infection based on the method given in Zhou et al. [40]. One explanation for this may be due to the fact that the method given in Zhou et al. [40] ignored the existence of informative censoring.
6. Discussion and concluding remarks
This paper discussed the analysis of case-cohort studies that yield informatively interval-censored failure time data arising from the proportional hazards model. As discussed above, a great deal of literature has been developed for the analysis of case-cohort studies that give right-censored data. In practice, however, the observed information on the failure time is more likely and naturally given in the form of interval-censored data, which is especially the case for longitudinal or periodic follow-up studies. One major difference between right-censored data and interval-censored data is that the latter has a much more complex structure than the former, which makes the analysis of the latter much more difficult. Although a large amount of literature has also been established for the analysis of either interval-censored data or case-cohort studies, there is no method available for the informative censoring situation discussed above. As pointed out before and seen in Section 5, informative censoring often occurs naturally and for the situation, the analysis that ignores it could result in biased or misleading results and conclusions.
As discussed in Sections 4 and 5, a type of studies that is similar to case-cohort studies is the case–control study and the key difference between the two is the generation of the subcohort. With the case-cohort design, the subcohort is sampled from all study subjects, while the case–control design samples the subcohort only from the subjects who do not experience the failure event of interest during the follow-up. It is apparent that the data structures under the two designs are different but on the other hand, the simulation study suggested that the proposed estimation approach seems to be valid too for the case–control design. A possible explanation for this is that the resulting data may carry similar information about the model and the regression parameters of interest given the low percentage of the event rate.
In practice, interval-censored data may be given in different forms (Sun [34]). For example, instead of the form discussed here, one may have case K or mixed interval-censored data (Wang et al. [37]). Note that for the analysis, one can still apply the proposed estimation procedure to these situations by expressing the data using the format described here. However, the derivation or establishment of the asymptotic properties may be different and one may need some other assumptions similar to those described in Huang [11] and Wang et al. [37]. In the previous sections, the focus has been on the informative censoring that can be characterized by models (2.1) and (2.2) or through latent variables. More specifically, it has been assumed that the magnitude of the informative censoring can be measured by the parameter η. It is apparent that as with most of frailty model approaches, a natural question would be if one can test . Unfortunately it does not seem to exist an established procedure for it in the literature. Another related question is the possibility of performing the goodness-of-fit tests on models (2.1) and (2.2). For this, if , one may apply the test procedures given in Ren and He [28] and McKeague and Utikal [26], respectively, to test them separately. However, it would be difficult or not straightforward to generalize either of them to the situation discussed here.
As mentioned above, to deal with the informative censoring, another commonly used method is the copula model approach, which directly models the joint distribution of the failure time of interest and censoring variables (Sun [34]). For example, Cui et al. [6] and Ma et al. [24] developed two such methods for regression analysis of current status data with informative censoring, a special case of interval-censored data where each subject is observed only once. Among others, Ma et al. [23] proposed a copula model approach for regression analysis of general interval-censored data. An advantage of the copula model approach is that it allows one to work or model the marginal distribution and the association parameter separately but it has the limitation that one needs to assume that the underlying copula function is known.
It is well-known that although the proportional hazards model is one of the most commonly used models for regression analysis of failure time data, sometimes one may prefer a different model or a different model may fit the data or describe the problem of interest better (Kalbfleisch and Prentice [16]). For example, the additive hazards model is usually preferred if the excess risk is of interest and one may want to consider the linear transformation model if the model flexibility is more important. Some literature has been developed for these and other models for regression analysis of general interval-censored data or the analysis of case-cohort studies that yield right-censored data. However, there does not seem to exist an established estimation procedure for the problem discussed here under other models. In other words, it would be useful to generalize the proposed method to the situation under the additive hazards or linear transformation model.
Acknowledgments
The authors wish to thank the Editor-in-Chief, the Associate Editor and three reviewers for their many critical and constructive comments and suggestions that greatly improved the paper. Also the authors want to thank Dr Peter Gibert for providing the HIV example data.
Appendix Proofs of the asymptotic properties of .
In this appendix, we will sketch the proof of the asymptotic properties of the proposed estimator . Let τ denote the length of study. Then a single observation can be written as
To establish the asymptotic properties, we need the following regularity conditions, which are commonly used in the studies of interval-censored data and usually satisfied in practice (Huang and Rossini [12]; Zhang et al. [39]; Ma et al. [23]; Zhou et al. [40]).
-
(C1)
The distribution of the covariate Z has a bounded support in and is not concentrated on any proper subspace of .
-
(C2)
The true parameters lie in the interior of a compact set in .
-
(C3)
The first derivative of and , denoted by and , is Holder continuous with exponent . That is, there exists a constant K>0 such that for all , where . Let .
-
(C4)
There exists a constant K>0 such that for every θ in a neighbourhood of , where is the weighted log-likelihood function based on a single observation .
-
(C5)
The matrix is finite and positive definite, where for a vector v, and is the efficient score for based on the complete observation and will be given in the proof of Theorem 2.
For the proof, we will mainly employ the empirical process theory and some nonparametric techniques. Let denote the expectation of under the probability measure P, and , the expectation of under the empirical measure . Define the covering number of the class , where is the weighted log-likelihood function based on a single observation . Also for any , define the covering number as the smallest positive integer κ for which there exists such that
for all , where represent the observed data and for . If no such κ exists, define . Also for the proof, we need the following two lemmas, whose proofs are similar to those for Lemmas 1 2 in Zhou et al. [40] and thus omitted.
Lemma A.1
Assume that the regularity conditions (C1)–(C3) given above hold. Then we have that the covering number of the class satisfies
for a constant K, where with is the degree of Bernstein polynomials, and with a>0 controls the size of the sieve space .
Lemma A.2
Assume that the regularity conditions (C1)–(C3) given above hold. Then we have that
almost surely.
Proof Proof of Theorem 3.1 —
We first prove the strong consistency of . Let denote the weighted log-likelihood function based on a given single observation and consider the class of functions . By Lemma A.1, the covering number of satisfies
Furthermore, by Lemma A.2, we have
(A1) Note that , then and maximizes . Let , and define for and
Then
(A2) If then we have
(A3) Define . Under Condition (C4), we have . It follows from A2 and A3 that
with and hence This gives , and by A1 and the strong law of large numbers, we have both and almost surely. Therefore, , which proves that almost surely.
Now we will show the convergence rate of by using Theorem 3.4.1 of van der Vaart and Wellner [35]. Below we use to denote a universal positive constant which may differ from place to place. First note from Theorem 1.6.2 of Lorentz [21] that there exists a Bernstein polynomial and such that and Define Then we have . For any define the class of functions for a given single observation . One can easily show that From Condition (C4), for large n, we have
for any
Following the calculations in Shen and Wong [32](p. 597), we can establish that for , with . Moreover, some algebraic manipulations yield that for any Under Conditions (C1)–(C3), it is easy to see that is uniformly bounded. Therefore, by Lemma 3.4.2 of van der Vaart and Wellner [35], we obtain
where This yields It is easy to see that is decreasing in ρ, and where .
Finally note that and in probability. Thus by applying Theorem 3.4.1 of van der Vaart and Wellner [35], we have . This together with yields that and the proof is completed.
Proof Proof of Theorem 3.2 —
Now we will prove the asymptotic normality of . First we will establish the asymptotic normality for the estimator based on the complete observation . With a little abuse of notation, we still denote the complete-data estimator as .
Let V denote the linear span of and define the Fisher inner product for as and the Fisher norm for as where
denotes the first order directional derivative of at the direction (evaluated at ). Also let be the closed linear span of V under the Fisher norm. Then is a Hilbert space. Furthermore, for a vector of dimension with and any , define a smooth functional of θ as and
whenever the right hand-side limit is well defined. Then by the Riesz representation theorem, there exists such that for all and Also note that It thus follows from the Cram r-Wold device that to prove the asymptotic normality for , i.e. in distribution, it suffices to show that
(A4) since In fact, A4 holds since one can show that and .
We first prove that . Let denote the rate of convergence obtained in Theorem 3.1, and for any such that , define the first order directional derivative of at the direction as
and the second-order directional derivative at the directions as
Note that by Condition (C3) and Theorem 1.6.2 of Lorentz [21], there exists such that . Furthermore, under the assumption , we have . Define and let be any positive sequence satisfying Then by the definition of , we have
We will investigate the asymptotic behaviour of , and . For , it follows from Conditions (C1)–(C3), Chebyshev inequality and that For by the mean value theorem, we obtain that
where lies between and By Theorem 2.8.3 of van der Vaart and Wellner [35], we know that is Donsker class. Therefore, by Theorem 2.11.23 of van der Vaart and Wellner [35], we have For , note that
where lies between and θ and the last equation follows from Taylor expansion and Conditions (C1)–(C3). Therefore,
where the last equality holds due to the facts Cauchy-Schwartz inequality, and Combining the above facts, together with we can establish that
Therefore, we obtain and then by the central limit theorem and
Next we will prove that . For each component we denote by the value of minimizing
where is the score function for ϑ, is the score operator for , j = 1, 2, and is a -dimensional vector of zeros except the q-th element equal to 1.
Define the q-th element of as , , and as . By Condition (C5), the matrix is positive definite. Furthermore, by following similar calculations in Chen et al. [4](sec. 3.2), we obtain
Thus, we have shown that in distribution for the estimator based on the complete data.
Now consider the estimator based only on the case-cohort data. Note that the weight is bounded and does not depend on θ, and . By Theorem 3.2 of Saegusa and Wellner [29], we have
where and , defined above, are the information and efficient score for ϑ based on the complete data. Note that
Thus, we have
in distribution, where
Funding Statement
The work was partially supported by the National Science Foundation of USA grant DMS-1916170, the National Natural Science Foundation of China grant 11671168, the Science and Technology Developing Plan of Jilin Province of China grant 20170101061JC, and the National Institute of Allergy and Infectious Disease of USA grant 1 R56 AI140953-01.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Bogaerts K., Komarek A., and Lesaffre E, Survival Analysis with Interval-Censored Data: A Practical Approach with Examples in R, SAS, and BUGS, CRC Press, 2017. [Google Scholar]
- 2.Chen K., Generalized case cohort sampling, J. R. Statist. Soc. B 63 (2001), pp. 791–809. doi: 10.1111/1467-9868.00313 [DOI] [Google Scholar]
- 3.Chen K. and Lo S.H., Case-cohort and case-control analysis with Cox's model, Biometrika 86 (1999), pp. 755–764. doi: 10.1093/biomet/86.4.755 [DOI] [Google Scholar]
- 4.Chen X., Fan Y., and Tsyrennikov V., Efficient estimation of semiparametric multivariate copula models, J. Am. Stat. Assoc. 101 (2006), pp. 1228–1240. doi: 10.1198/016214506000000311 [DOI] [Google Scholar]
- 5.Chen D.G., Sun J., and Peace K, Interval-Censored Time-to-Event Data: Methods and Applications, CRC Press, 2012. [Google Scholar]
- 6.Cui Q., Zhao H., and Sun J., A new copula model-based method for regression analysis of dependent current status data, Stat. Interface. 11 (2018), pp. 463–471. doi: 10.4310/SII.2018.v11.n3.a9 [DOI] [Google Scholar]
- 7.Finkelstein D.M., A proportional hazards model for interval-censored failure time data, Biometrics 42 (1986), pp. 845–854. doi: 10.2307/2530698 [DOI] [PubMed] [Google Scholar]
- 8.Fong Y., Shen X., Ashley V.C., Deal A., Seaton K.E., Yu C., Grant S.P., Ferrari G., Bailer R.T., Koup R.A., Montefiori D., Haynes B.F., Sarzotti-Kelsoe M., Graham B.S., Carpp L.N., Hammer S.M., Sobieszczyk M., Karuna S., Swann E., DeJesus E., Mulligan M., Frank I., Buchbinder S., Novak R.M., McElrath M.J., Kalams S., Keefer M., Frahm N.A., Janes H.E., Gilbert P.B., and Tomaras G.D., Modification of the association between T-Cell immune responses and human immunodeficiency virus type 1 infection risk by vaccine-induced antibody responses in the HVTN 505 trial, J. Infect. Dis. 217 (2018), pp. 1280–1288. doi: 10.1093/infdis/jiy008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Gilbert P.B., Peterson M.L., Follmann D., Hudgens M.G., Francis D.P., Gurwith M., Heyward W.L., Jobes D.V., Popovic V., Self S.G., Sinangil F., Burke D., and Berman P.W., Correlation between immunologic responses to a recombinant glycoprotein 120 vaccine and incidence of HIV-1 infection in a phase 3 HIV-1 preventive vaccine trial, J. Infect. Dis. 191 (2005), pp. 666–677. doi: 10.1086/428405 [DOI] [PubMed] [Google Scholar]
- 10.Hammer S.M., Sobieszczyk M.E., Janes H., Mulligan M.J., Karuna S.T., Grove D., Koblin B.A., Buchbinder S.P., Keefer M.C., Tomaras G.D., Frahm N., Hural J., Anude C., Graham B.S., Enama M.E., Adams E., DeJesus E., Novak R.M., Frank I., Bentley C., Ramirez S., Fu R., Koup R.A., Mascola J.R., Nabel G.J., Montefiori D.C., Kublin J., McElrath M.J., Corey L., and Gilbert P.B., HVTN 505 Study Team , Efficacy trial of a DNA/rAd5 HIV-1 preventive vaccine, N. Engl. J. Med. 369 (2013), pp. 2083–2092. doi: 10.1056/NEJMoa1310566 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Huang J., Asymptotic properties of nonparametric estimation based on partly interval-censored data, Stat. Sin. 9 (1999), pp. 501–519. [Google Scholar]
- 12.Huang J. and Rossini A.J., Sieve estimation for the proportional-odds failure-time regression model with interval censoring, J. Am. Stat. Assoc. 92 (1997), pp. 960–967. doi: 10.1080/01621459.1997.10474050 [DOI] [Google Scholar]
- 13.Huang X. and Wolfe R., A frailty model for informative censoring, Biometrics 58 (2002), pp. 510–520. doi: 10.1111/j.0006-341X.2002.00510.x [DOI] [PubMed] [Google Scholar]
- 14.Janes H.E., Cohen K.W., Frahm N., De Rosa S.C., Sanchez B., Hural J., Magaret C.A., Karuna S., Bentley C., Gottardo R., Finak G., Grove D., Shen M., Graham B.S., Koup R.A., Mulligan M.J., Koblin B., Buchbinder S.P., Keefer M.C., Adams E., Anude C., Corey L., Sobieszczyk M., Hammer S.M., Gilbert P.B., and McElrath M.J., Higher T-cell responses induced by DNA/rAd5 HIV-1 preventive vaccine are associated with lower HIV-1 infection risk in an efficacy trial, J. Infect. Dis. 215 (2017), pp. 1376–1385. doi: 10.1093/infdis/jix086 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Jewell N.P. and van der Laan M., Current status data: review, recent development and open problems, Adv. Survival Anal. 35 (2004), pp. 625–643. [Google Scholar]
- 16.Kalbfleisch J.D. and Prentice R.L., The Statistical Analysis of Failure Time Data, 2nd ed., Wiley, New York, 2002. [Google Scholar]
- 17.Kang S. and Cai J., Marginal hazards model for case-cohort studies with multiple disease outcomes, Biometrika 96 (2009), pp. 887–901. doi: 10.1093/biomet/asp059 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Keogh R.H. and White I.R., Using full-cohort data in nested case-control and case-cohort studies by multiple imputation, Stat. Med. 32 (2013), pp. 4021–4043. doi: 10.1002/sim.5818 [DOI] [PubMed] [Google Scholar]
- 19.Kim S., Cai J., and Lu W., More efficient estimators for case-cohort studies, Biometrika 100 (2013), pp. 695–708. doi: 10.1093/biomet/ast018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Li Z. and Nan B., Relative risk regression for current status data in case-cohort studies, Can. J. Stat. 39 (2011), pp. 557–577. doi: 10.1002/cjs.10111 [DOI] [Google Scholar]
- 21.Lorentz G.G., Bernstein Polynomials, Chelsea Publishing Co, New York, 1986. [Google Scholar]
- 22.Ma S. and Kosorok M.R., Robust semiparametric M-estimation and the weighted bootstrap, J. Multivar. Anal. 96 (2005), pp. 190–217. doi: 10.1016/j.jmva.2004.09.008 [DOI] [Google Scholar]
- 23.Ma L., Hu T., and Sun J., Cox regression analysis of dependent interval-censored failure time data, Comput. Stat. Data Anal. 103 (2016), pp. 79–90. doi: 10.1016/j.csda.2016.04.011 [DOI] [Google Scholar]
- 24.Ma L., Hu T., and Sun J., Sieve maximum likelihood regression analysis of dependent current status data, Biometrika 102 (2015), pp. 731–738. doi: 10.1093/biomet/asv020 [DOI] [Google Scholar]
- 25.Marti H. and Chavance M., Multiple imputation analysis of case-cohort studies, Stat. Med. 30 (2011), pp. 1595–1607. doi: 10.1002/sim.4130 [DOI] [PubMed] [Google Scholar]
- 26.McKeague I.W. and Utikal K.J., Goodness-of-fit tests for additive hazards and proportional hazards models, Scand. J. Stat. 18 (1991), pp. 177–195. [Google Scholar]
- 27.Prentice R.L., A case-cohort design for epidemiologic cohort studies and disease prevention trials, Biometrika 73 (1986), pp. 1–11. doi: 10.1093/biomet/73.1.1 [DOI] [Google Scholar]
- 28.Ren J.J. and He B., Estimation and goodness-of-fit for the Cox model with various types of censored data, J. Stat. Plan. Inference. 141 (2011), pp. 961–971. doi: 10.1016/j.jspi.2010.09.006 [DOI] [Google Scholar]
- 29.Saegusa T. and Wellner J.A., Weighted likelihood estimation under two-phase sampling, Ann. Stat. 41 (2013), pp. 269–295. doi: 10.1214/12-AOS1073 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Scheike T.H. and Martinussen T., Maximum likelihood estimation for Cox's regression model under case-cohort sampling, Scand. J. Stat. 31 (2004), pp. 283–293. doi: 10.1111/j.1467-9469.2004.02-064.x [DOI] [Google Scholar]
- 31.Shen X., On methods of sieves and penalization, Ann. Stat. 25 (1997), pp. 2555–2591. doi: 10.1214/aos/1030741085 [DOI] [Google Scholar]
- 32.Shen X. and Wong W.H., Convergence rate of sieve estimates, Ann. Stat. 22 (1994), pp. 580–615. doi: 10.1214/aos/1176325486 [DOI] [Google Scholar]
- 33.Sun J., A nonparametric test for current status data with unequal censoring, J. R. Stat. Soc. B 61 (1999), pp. 243–250. doi: 10.1111/1467-9868.00174 [DOI] [Google Scholar]
- 34.Sun J, The Statistical Analysis of Interval-censored Failure Time Data, New York, Springer, 2006. [Google Scholar]
- 35.van der Vaart A.W. and Wellner J.A, Weak Convergence and Empirical Processes: With Applications to Statistics, Springer, New York, 1996. [Google Scholar]
- 36.Wang P., Zhao H., and Sun J., Regression analysis of case K interval-censored failure time data in the presence of informative censoring, Biometrics 72 (2016), pp. 1103–1112. doi: 10.1111/biom.12527 [DOI] [PubMed] [Google Scholar]
- 37.Wang L.M., McMahan C.S, Hudgens M.G, and Qureshi Z.P., A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data, Biometrics 72 (2016), pp. 222–231. doi: 10.1111/biom.12389 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wang S., Wang C., Wang P., and Sun J., Semiparametric analysis of the additive hazards model with informatively interval-censored failure time data, Comput. Stat. Data Anal. 125 (2018), pp. 1–9. doi: 10.1016/j.csda.2018.03.011 [DOI] [Google Scholar]
- 39.Zhang Y., Hua L., and Huang J., A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data, Scand. J. Stat. 37 (2010), pp. 338–354. doi: 10.1111/j.1467-9469.2009.00680.x [DOI] [Google Scholar]
- 40.Zhou Q., Hu T., and Sun J., A sieve semiparametric maximum likelihood approach for regression analysis of bivariate interval-censored failure time data, J. Am. Stat. Assoc. 112 (2017a), pp. 664–672. doi: 10.1080/01621459.2016.1158113 [DOI] [Google Scholar]
- 41.Zhou Q., Zhou H., and Cai J., Case-cohort studies with interval-censored failure time data, Biometrika 104 (2017b), pp. 17–29. doi: 10.1093/biomet/asw067 [DOI] [PMC free article] [PubMed] [Google Scholar]