SUMMARY.
Interval-censored data arise when the event time of interest can only be ascertained through periodic examinations. In medical studies, subjects may not complete the examination schedule for reasons related to the event of interest. In this article, we develop a semiparametric approach to adjust for such informative dropout in regression analysis of interval-censored data. Specifically, we propose a broad class of joint models, under which the event time of interest follows a transformation model with a random effect and the dropout time follows a different transformation model but with the same random effect. We consider nonparametric maximum likelihood estimation and develop an EM algorithm that involves simple and stable calculations. We prove that the resulting estimators of the regression parameters are consistent, asymptotically normal, and asymptotically efficient with a covariance matrix that can be consistently estimated through profile likelihood. In addition, we show how to consistently estimate the survival function when dropout represents voluntary withdrawal and the cumulative incidence function when dropout is an unavoidable terminal event. Furthermore, we assess the performance of the proposed numerical and inferential procedures through extensive simulation studies. Finally, we provide an application to data on the incidence of diabetes from a major epidemiological cohort study.
Keywords: Joint models, Nonparametric likelihood, Random effects, Semiparametric efficiency, Terminal event, Transformation models
1. Introduction
Interval-censored data arise when the timing of an event is not known precisely but rather is known to lie within a time interval. Such data are frequently encountered in medical research, where the ascertainment of the disease of interest is made over a series of examination times. An example is the Atherosclerosis Risk in Communities (ARIC) study (The ARIC Investigators, 1989), where subjects were examined for asymptomatic diseases, such as diabetes and hypertension, over five visits, with the first four each approximately three years apart and then a gap of about 15 years before the fifth visit, such that the disease was only known to occur within a broad time interval.
A number of methods have been developed for regression analysis of interval-censored data. In particular, nonparametric maximum likelihood estimation for the proportional odds, proportional hazards, and transformation models have been studied by Huang (1995), Huang (1996), and Zeng et al. (2016), respectively. Sieve estimation for the proportional odds and proportional hazards models has been suggested by Rossini and Tsiatis (1996), Huang and Rossini (1997), Shen (1998), and Cai and Betensky (2003). Rank-based estimation methods for linear transformation models have been proposed by Gu et al. (2005), Sun and Sun (2005), Zhang et al. (2005), and Zhang and Zhao (2013).
All aforementioned work assumes that the examination process is independent of the event of interest, possibly conditional on covariates. This assumption is often violated in chronic disease research because subjects may drop out of the study prematurely for health-related reasons. For example, in the ARIC study, a large number of subjects died before their last scheduled visit. When dropout is correlated with the event of interest, the existing methods may yield invalid inference. In the situation where dropout is caused by a terminal event, such as death, the existing methods, which fail to account for the fact that the event of interest cannot occur after the terminal event, will provide incorrect estimation of disease incidence even if dropout is independent of the event of interest.
In this article, we adjust for informative dropout through the use of a random effect. Specifically, we consider a broad class of joint models, under which the event time of interest follows a semiparametric transformation model with a random effect and the dropout time follows a different semiparametric transformation model but with the same random effect. The transformation models encompass the proportional hazards and proportional odds models. We study nonparametric maximum likelihood estimation for the joint models and develop a stable EM algorithm for its implementation. We establish the asymptotic properties of the resulting estimators, with different rates of convergence for the cumulative hazard functions of the event time of interest and the dropout time. In addition, we show how to predict the incidence for the event of interest when its occurrence is precluded by the development of a terminal event. Furthermore, we demonstrate the advantages of the proposed methods over the existing ones through realistic simulation studies. Finally, we provide a detailed illustration with data derived from the ARIC study.
2. Methods
2.1. Models and Likelihood
We consider a random sample of n subjects. For i = 1,...,n, let Ti denote the event time or failure time of interest, Di the dropout time, and Xi(·) a p-vector of possibly time-dependent external covariates for the ith subject. We characterize the dependence between Ti and Di through a random effect bi, which is assumed to be normal with mean zero and variance σ2. Let Xi denote the entire history of the covariates. Conditional on bi and Xi, the cumulative hazard functions for Ti and Di follow the transformation models
| (1) |
and
| (2) |
respectively, where G(·) and H(·) are specific transformation functions, β and γ are unknown regression parameters, and and A(·) are arbitrary cumulative baseline hazard functions. For notational simplicity, we use the same Xi in models (1) and (2), although it is straightforward to use different sets of covariates.
REMARK 1.
Since there are only two possible events per subject, one shared random effect bi is sufficient to capture the dependence and additional parameters would not be identifiable. The models with one shared random effect have been commonly adopted for two (right-censored) events, clustered sampling, and recurrent events (Oakes, 1989; Parner, 1998; Andersen et al., 2012, Chapter 9). For joint modeling of complex multivariate outcomes, such as longitudinal and survival outcomes or recurrent and terminal events, more than one random effect can be used (Zeng and Lin, 2007, 2009).
REMARK 2.
To shed light on our modeling approach, we consider time-independent covariates, in which case models (1) and (2) can be expressed in the form of linear transformation models with a shared random effect
where ϵT,i and ϵD,i are random errors with known distributions. This is a semiparametric version of the familiar linear mixed model for two correlated outcomes. Clearly, bi characterizes the dependence between Ti and Di in that the correlation between the two transformed event times is , where and are the variances of ϵT,i and ϵD,i, respectively.
The transformation functions G(·) and H(·) include completely monotonic functions
| (3) |
and
| (4) |
where fG(·) and fH(·) are density functions with support on [0,∞). Particularly, the class of logarithmic transformations r−1 log(1 + rx) (r ≥ 0) is generated by the gamma density function with mean 1 and variance r. The choice of r = 0 or 1 yields the proportional hazards or proportional odds model, respectively.
REMARK 3.
We allow different transformation functions for the event time of interest and dropout time and let the data determine the best choices. It is natural to choose the proportional hazards models, which correspond to the identity transformation functions, especially when the sample size is not large enough to empirically determine the best transformation functions. The problem of interval-censored data with informative dropout has not been previously studied even under the proportional hazards models.
Suppose that the event of interest, such as diabetes, is asymptomatic, such that its occurrence can only be detected through periodic examinations. By contrast, dropout (e.g., death) can be observed exactly. There is a sequence of potential examination times for each subject. Obviously, no examination can occur after dropout. There exists noninformative censoring (e.g., end of the study), after which examination cannot occur either.
Specifically, let 0 < Ui1 < ... < Ui,Mi < ∞ denote the ith subject’s potential examination times, which have finite support u with the least upper bound τ. Let Ci denote the noninformative censoring time on Di, such that we observe Yi ≡ min(Di,Ci) and Δi ≡ I(Di ≤ Ci), where I(·) is the indicator function. The examination for the ith subject does not occur after Yi. Examination typically does not occur at the time of dropout or the end of the study, such that Yi is not equal to Uij for any j. Thus, the failure time Ti is known to lie in the interval (Li, Ri), where Li = max{Uim : Uim < Ti,Uim ≤ Yi,m = 0,...,Mi}, Ri = min{Uim : Uim ≥ Ti,Uim ≤ Yi,m = 1,...,Mi}, and Ui0 = 0. We let Ri = ∞ if the latter set is empty. If Yi < Ui1, then no examination is performed and (Li, Ri) = (0,∞). The observed data consist of Oi (i = 1,…,n) where Oi = {Li, Ri, Yi, ΔI, Xi(·)}.
Assume that Mi, {Uim : m = 1,...,Mi}, and Ci are independent of (Ti, Di, bi) conditional on Xi. The observed-data likelihood under models (1) and (2) is
where g’(·) denotes the derivative of the function g(·), , and we define
2.2. Estimation Procedures
We adopt the nonparametric maximum likelihood approach, under which the estimators for the cumulative baseline hazard functions Λand A are step functions with jumps at the unique endpoints of the intervals, 0 < t1 < ... < tm1 < ∞, and at the uncensored dropout times, 0 < s1< ... < sm2 < ∞, respectively, where m1 and m2 are the total numbers of potential jump points. We denote the step sizes for Λ as λ1,…,λmi and the step sizes for A as α1,...,αm2. Write θ = (β, γ, σ2) and . We maximize the objective function
over β, γ, σ2, (λ1,...,λm1), and (α1,...,αm2), where
and
with Xil = Xi(tl) for l = 1,...,m1, Xil∗ = Xi(sl) for l = 1,...,m2, and A{Yi} being the jump size of A at Yi.
Direct maximization of Ln(θ, A) is difficult due to the lack of analytical expressions for the parameters λ1,...,λm1 and α1,...,αm2. We introduce some latent random variables to form a likelihood function equivalent to Ln(θ, A) such that the maximization can be carried out by a simple EM algorithm. First, we introduce two latent random variables ξi and ψi with density functions fG(·) and fH(·) given in equations (3) and (4), respectively. We then introduce independent Poisson random variables with means where . Conditional on (ξi, bi), the likelihood function of is
Let and Suppose that we observe N1i = 0 and N2i < 0. The observed-data likelihood for N1i = 0 and N2i > 0 given ξi and bi is equal to
In addition, the observed-data likelihood for (Yi, Δi) given ψi and bi is
Therefore, , and In other words, can be viewed as the observed-data likelihood for with (Wil,ξi,ψi,bi) as latent variables. Based on the foregoing results, we propose an EM algorithm treating (Wil,ξi,ψi,bi) as complete data.
In the M-step, we maximize the conditional expectation of the complete-data log-likelihood given the observed data so as to update the parameters. Specifically, we update β by solving the equation
and we update by
where denotes the conditional expectation given the observed data . In addition, we update γ by solving the equation
and we update A by
Finally, we update .
In the E-step, we evaluate the conditional expectation of and the other terms of ξi, ψi, and bi given the observed data for i = 1,...,n. Specifically, the conditional expectation of given , ξi, and bi is
Note that the joint density of (ξi, ψi, bi) given is proportional to . We evaluate the conditional expectation of Wil and the other terms through numerical integration over ξi, ψi, and bi with Gauss–Laguerre and Gauss–Hermite quadratures.
We iterate between the E-step and the M-step until the sum of the absolute differences of the parameter values at two successive iterations is less than a certain threshold, say, 10−3 We denote the final estimators for θ and as and0 . The survival function for the failure time of interest, , can be estimated by
REMARK 4.
The proposed EM algorithm has several desirable features. First, large-scale optimization is avoided as jump sizes are updated explicitly in the M-step. Second, the regression parameters are updated by solving estimating equations similar to the Poisson regression score equations and the partial likelihood score equations via one-step Newton–Raphson. Finally, the E-step involves only two-dimensional numerical integration.
If dropout is a terminal event, which cannot be avoided, then we have a semi-competing risks set-up (Fine et al., 2001) in that the occurrence of the terminal event precludes the development of the event of interest but not vice versa. It is more meaningful to consider the cumulative incidence function for the failure time of interest
We estimate this quantity by
where the integral is evaluated by numerical integration with Gauss–Hermite quadratures, and and are the estimators of λl and αl, respectively.
We have implicitly assumed that the transformation functions are known. In practice, we consider a variety of transformation models and choose the one that best fits the data according to, say, the Akaike information criterion.
2.3. Inference Procedures
In Web Appendix A, we show that the estimators are consistent, with different rates of convergence (n1/3 and n1/2, respectively) for and . In addition, the estimator is asymptotically normal with a limiting covariance matrix that attains the semiparametric efficiency bound.
To estimate the covariance matrix of , we define the profile likelihood function
where C1 is the set of step functions with non-negative jumps at tl (l = 1,...,m1), and C2 is the set of step functions with non-negative jumps at sl (l = 1,...,m2). Let pli(θ) denote the ith subject’s contribution to pli(θ), that is, where is the log-likelihood function for the ith subject,and . We then estimate the covariance matrix of by the inverse of the matrix
where ej is the jth canonical vector in , and hn is a constant of order n−1/2. To evaluate the profile likelihood, we use the EM algorithm of Section 2.2 but fix the value of θ and only update Λ and A in the M-step.
REMARK 5.
In the conventional profile-likelihood approach (Murphy and van der Vaart, 2000), the covariance matrix of is estimated by the negative inverse of the Hessian matrix of which is determined by the second-order numerical differences. The Hessian matrix may be negative definite, especially in small samples. By contrast, we use the empirical covariance matrix of the gradient of pli(θ) based on the firstorder numerical differences, which approximates the efficient score function for θ. The calculation is quicker, and the resulting covariance matrix estimator is guaranteed to be positive semidefinite.
3. Simulation Studies
We conducted two sets of simulation studies to assess the performance of the proposed methods. In the first set, we considered one time-independent covariate 1) and one time-dependent covariate X1 ~ Unif(0, 1), and one time-independent covariate where and are independent Bernoulli(0 5), V ∼Unif(0,τ), and τ = 4. We considered logarithmic transformation functions and .We set We generated the potential examination times Um ∼ Um−1 + 0.1+Unif(0,τ/5) with U0 = 0 and the censoring time C from Unif(2τ/3,τ). The number of actual examinations is approximately 2.4 per subject. The event time of interest is left-censored for 19% of the subjects, interval-censored for 28% of the subjects, and right-censored for 53% of the subjects. We set n = 100,200, or 400 and used 10,000 replicates. For each dataset, we applied the proposed EM algorithm by setting the initial values of β and γ to 0, the initial value of σ2 to 1 and the initial values of λl and αl to 1/m1 and 1/m2, respectively. We used 20 quadrature points for integration with respect to each random effect and set the convergence threshold to 10−3. The variance estimators were obtained with hn = n−1/2.
Table 1 summarizes the results on the estimation of β, γ, and σ2 for different values of n, rG, and rH. The biases for all parameter estimators are small and decrease as n increases. for large n. The variance estimator for and are accurate especially for large n. The variance estimators for is accurate, under rG = rH = 0 but slightly overestimates the true variabilities under rG = rH = 1. The confidence intervals for β, γ, and σ2 have reasonable coverage probabilities.
Table 1.
Summary statistics for the simulation studies on the proposed estimators
|
rG
= rH = 0 |
rG
= rH = 1 |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | SE | SEE | CP | Bias | SE | SEE | CP | ||
| n=100 | β1 | 0.027 | 0.728 | 0.679 | 0.95 | 0.030 | 0.921 | 0.872 | 0.94 |
| β2 | 0.028 | 0.422 | 0.384 | 0.94 | 0.037 | 0.533 | 0.482 | 0.94 | |
| γ1 | 0.002 | 0.544 | 0.530 | 0.96 | −0.012 | 0.723 | 0.728 | 0.97 | |
| γ2 | 0.009 | 0.288 | 0.290 | 0.96 | 0.000 | 0.381 | 0.384 | 0.96 | |
| σ2 | −0.019 | 0.555 | 0.523 | 0.97 | −0.112 | 0.781 | 0.909 | 0.96 | |
| n=200 | β1 | 0.006 | 0.482 | 0.460 | 0.94 | 0.009 | .623 | 0.596 | 0.94 |
| β2 | 0.016 | 0.277 | 0.262 | 0.94 | 0.021 | 0.356 | 0.329 | 0.93 | |
| γ1 | 0.002 | 0.368 | 0.365 | 0.96 | 0.010 | 0.498 | 0.501 | 0.96 | |
| γ2 | 0.000 | 0.198 | 0.200 | 0.96 | 0.004 | 0.260 | 0.262 | 0.96 | |
| σ2 | −0.002 | 0.367 | 0.357 | 0.96 | −0.044 | 0.556 | 0.635 | 0.95 | |
| n=400 | β1 | 0.000 | 0.329 | 0.318 | 0.94 | 0.011 | 0.429 | 0.412 | 0.94 |
| β2 | 0.007 | 0.187 | 0.182 | 0.95 | 0.012 | 0.244 | 0.228 | 0.94 | |
| γ1 | 0.005 | 0.255 | 0.254 | 0.95 | 0.002 | 0.354 | 0.349 | 0.95 | |
| γ2 | 0.003 | 0.140 | 0.139 | 0.95 | 0.003 | 0.182 | 0.182 | 0.95 | |
| σ2 | 0.002 | 0.250 | 0.245 | 0.96 | −0.015 | 0.401 | 0.443 | 0.95 | |
Note: Bias, SE, SEE, and CP stand, respectively, for the median bias, empirical standard error, median standard error estimator, and empirical coverage percentage of the 95% confidence interval. For σ2, the confidence interval is based on the log transformation.
We also evaluated the method of Zeng et al. (2016), which does not account for informative dropout. The results for this naive method are shown in Table 2. The estimator for β is biased, and the coverage probability of the corresponding confidence interval is poor.
Table 2.
Summary statistics for the simulation studies on the naive method
|
rG
= rH = 0 |
rG
= rH = 1 |
||||||||
|---|---|---|---|---|---|---|---|---|---|
| Bias | SE | SEE | CP | Bias | SE | SEE | CP | ||
| n=100 | β1 | −0.125 | 0.600 | 0.560 | 0.93 | −0.054 | 0.829 | 0.773 | 0.93 |
| β2 | −0.058 | 0.355 | 0.324 | 0.93 | −0.015 | 0.483 | 0.433 | 0.92 | |
| n=200 | β1 | −0.143 | 0.400 | 0.383 | 0.93 | −0.074 | 0.559 | 0.536 | 0.93 |
| β2 | −0.067 | 0.236 | 0.224 | 0.93 | −0.025 | 0.324 | 0.299 | 0.93 | |
| n=400 | β1 | −0.148 | 0.274 | 0.267 | 0.91 | −0.069 | 0.387 | 0.375 | 0.93 |
| β2 | −0.075 | 0.160 | 0.156 | 0.92 | −0.033 | 0.222 | 0.209 | 0.93 | |
Note: Bias, SE, SEE, and CP stand, respectively, for the median bias, empirical standard error, median standard error estimator, and empirical coverage percentage of the 95% confidence interval.
Figure 1a shows the estimation of the baseline survival function for the event time of interest when dropout is regarded as voluntary patient withdrawal. The proposed estimator is virtually unbiased, whereas the naive method (Zeng et al., 2016) overestimates the survival function. Figure 1b shows the estimation of the baseline cumulative incidence function for the event time of interest when dropout is treated as a terminal event. The proposed estimator is again virtually unbiased; the naive estimator has severe positive bias since it does not acknowledge the fact that the event of interest cannot occur after the terminal event.
Figure 1.

Estimation of (a) the baseline survival function and (b) the baseline cumulative incidence function. The solid, dashed, and dotted curves pertain, respectively, to the true value, mean estimate from the proposed method, and mean estimate from the naive method. This figure appears in color in the electronic version of this article.
In the second set of simulation studies, we set β = (−0.5,0)T, γ = (0,0.4)T, and σ2 = 1 and considered rG = rH = 0, 0.5, or 1. The results are presented in Web Appendix B. The basic conclusions are the same as those of the first set.
4. ARIC Study
ARIC is a prospective epidemiological study conducted in four U.S. communities: Forsyth County, NC; Jackson, MS; suburbs of Minneapolis, MN; and Washington County, MD (The ARIC investigators, 1989). One important objective is to investigate risk factors for diabetes. A total of 14,751 Caucasian and African–American participants underwent a baseline examination between 1987 and 1989 and were scheduled for four subsequent examinations to take place in 1990–1992, 1993–1995, 1996–1998, and 2011–2013. Diabetes status (defined as fasting glucose ≥126 mg/dL, non-fasting glucose ≥ 200 mg/dL, self-reported physician diagnosis of diabetes, or use of diabetic medication) was determined at each examination.
We related the incidence of diabetes and death to race, gender, community, and five baseline risk factors: age, body mass index, glucose level, systolic blood pressure, and diastolic blood pressure. We excluded 1933 subjects with prevalent diabetes or unknown diabetes status at baseline and 13 subjects with missing baseline covariate values to obtain a total of 12,805 subjects. Among those subjects, 11,686 (91.3%), 10,557 (82.4%), 9533 (74.4%), and 5035 (39.3%) completed the second, third, fourth, and fifth visits, respectively. As shown in Web Figure 3, there are sufficient overlaps of the visit times for us to study diabetes onset from year 2 to 12 and from year 22 to 27. A total of 2492 (19.5%) subjects developed diabetes during the study, and 4363 (34.1%) subjects died before the end of the study.
We fit models (1) and (2) with logarithmic transformation functions indexed by parameters rG and rH for diabetes and death, respectively. The likelihood is maximized at rG = 2.3 and rH = 0, which is the combination that would be selected by the Akaike information criterion. For easy interpretation, we set rG = 1 and rH = 0.
Table 3 shows the estimation results for the proportional hazards models for both events (rG = rH = 0), the proportional odds models for both events (rG = rH = 1), and the combination of the proportional odds model for diabetes and the proportional hazards model for death (rG = 1,rH = 0). The log-likelihood values are approximately −48724.5, −48707.3, and −48681.1 for the three combinations of transformation parameters. The variance component σ2 was estimated to be 0.561, 0.681, and 0.530 with standard errors 0.063, 0.096, and 0.070, respectively, for the three combinations of transformation parameters, indicating strong dependence between diabetes and death. Under all considered models, an African–American individual has a higher risk of diabetes than a Caucasian individual. In addition, higher baseline body mass index, glucose level, and systolic blood pressure are associated with increased risk for diabetes. These findings are consistent with the current literature on diabetes (Harris et al., 1987; DeFronzo et al., 2004).
Table 3.
Regression analysis for diabetes in the ARIC study with adjustments for death
| Diabete |
Death |
|||||||
|---|---|---|---|---|---|---|---|---|
| rG | rH | Covariate | Est | Std. Err | p-value | Est | Std. Err | p-value |
| 0 | 0 | Jackson | −0.185 | 0.097 | 0.057 | 0.039 | 0.096 | 0.684 |
| Minneapolis Suburbs | −0.424 | 0.069 | < 10−4 | −0.032 | 0.052 | 0.534 | ||
| Washington County | 0.101 | 0.065 | 0.123 | 0.088 | 0.050 | 0.080 | ||
| Age | −0.004 | 0.004 | 0.358 | 0.104 | 0.004 | < 10−4 | ||
| Male | −0.010 | 0.046 | 0.831 | 0.565 | 0.036 | < 10−4 | ||
| White | −0.485 | 0.110 | < 10−4 | −0.503 | 0.100 | < 10−4 | ||
| Body mass index | 0.075 | 0.004 | < 10−4 | −0.017 | 0.018 | 0.363 | ||
| Glucose | 0.102 | 0.003 | < 10−4 | 0.055 | 0.018 | 0.002 | ||
| Systolic blood pressure | 0.006 | 0.002 | 0.001 | 0.313 | 0.024 | < 10−4 | ||
| Diastolic blood pressure | −0.001 | 0.003 | 0.793 | −0.162 | 0.025 | < 10−4 | ||
| 1 | 1 | Jackson | −0.242 | 0.088 | 0.006 | 0.096 | 0.087 | 0.269 |
| Minneapolis Suburbs | −0.510 | 0.082 | < 10−4 | −0.034 | 0.063 | 0.587 | ||
| Washington County | 0.118 | 0.078 | 0.132 | 0.121 | 0.061 | 0.050 | ||
| Age | −0.008 | 0.005 | 0.156 | 0.124 | 0.004 | < 10−4 | ||
| Male | −0.033 | 0.055 | 0.547 | 0.680 | 0.045 | < 10−4 | ||
| White | −0.648 | 0.110 | < 10−4 | −0.604 | 0.101 | < 10−4 | ||
| Body mass index | 0.096 | 0.006 | < 10−4 | −0.005 | 0.004 | 0.265 | ||
| Glucose | 0.124 | 0.003 | < 10−4 | 0.007 | 0.002 | 0.003 | ||
| Systolic blood pressure | 0.007 | 0.002 | 0.001 | 0.020 | 0.002 | < 10−4 | ||
| Diastolic blood pressure | −0.001 | 0.004 | 0.891 | −0.017 | 0.003 | < 10−4 | ||
| 1 | 0 | Jackson | −0.232 | 0.089 | 0.009 | 0.038 | 0.098 | 0.696 |
| Minneapolis Suburbs | −0.502 | 0.080 | < 10−4 | −0.033 | 0.052 | 0.527 | ||
| Washington County | 0.114 | 0.077 | 0.136 | 0.086 | 0.050 | 0.085 | ||
| Age | −0.007 | 0.005 | 0.191 | 0.103 | 0.004 | < 10−4 | ||
| Male | −0.030 | 0.054 | 0.578 | 0.562 | 0.036 | < 10−4 | ||
| White | −0.629 | 0.112 | < 10−4 | −0.500 | 0.102 | < 10−4 | ||
| Body mass index | 0.094 | 0.005 | < 10−4 | −0.003 | 0.004 | 0.411 | ||
| Glucose | 0.122 | 0.003 | < 10−4 | 0.006 | 0.002 | 0.001 | ||
| Systolic blood pressure | 0.007 | 0.002 | 0.001 | 0.017 | 0.001 | < 10−4 | ||
| Diastolic blood pressure | −0.0005 | 0.004 | 0.893 | −0.014 | 0.002 | < 10−4 | ||
Note: Forsyth County, NC, is the reference group for the field center variables.
The results from the naive method, which are shown in Table 4, are considerably different from ours. In particular, the naive method identifies a negative association between age and risk of diabetes, which contradicts the established positive association in the literature (Harris et al., 1987). The proposed method adjusting for death finds no significant negative association. The relationship between age and risk of diabetes identified by the naive method is likely a spurious finding that reflects the strong correlation between age and death.
Table 4.
Regression analysis for diabetes in the ARIC study without adjustments for death
| rG | Covariate | Est | Std. Err | p-value |
|---|---|---|---|---|
| 0 | Jackson | −0.141 | 0.102 | 0.169 |
| Minneapolis Suburbs | −0.373 | 0.062 | < 10−4 | |
| Washington County | 0.095 | 0.058 | 0.099 | |
| Age | −0.010 | 0.004 | 0.009 | |
| Male | −0.042 | 0.041 | 0.303 | |
| White | −0.336 | 0.106 | 0.001 | |
| Body mass index | 0.067 | 0.003 | < 10−4 | |
| Glucose | 0.090 | 0.002 | < 10−4 | |
| Systolic blood pressure | 0.005 | 0.002 | 0.004 | |
| Diastolic blood pressure | 0.000 | 0.003 | 0.992 | |
| 1 | Jackson | −0.210 | 0.134 | 0.116 |
| Minneapolis Suburbs | −0.464 | 0.076 | < 10− | |
| Washington County | 0.112 | 0.072 | 0.121 | |
| Age | −0.013 | 0.005 | 0.009 | |
| Male | −0.062 | 0.052 | 0.230 | |
| White | −0.062 | 0.052 | 0.230 | |
| Body mass index | 0.089 | 0.005 | < 10−4 | |
| Glucose | 0.114 | 0.003 | < 10−4 | |
| Systolic blood pressure | 0.006 | 0.002 | 0.007 | |
| Diastolic blood pressure | 0.001 | 0.003 | 0.813 |
Note: Forsyth County, NC, is the reference group for the field center variables.
Figure 2 compares the estimated cumulative incidence functions for an African–American male versus a Caucasian male with the same values of other risk factors. The risk of diabetes is considerably higher for the African–American individual than the Caucasian individual under all considered models, with appreciably different estimates between the proportional hazards and proportional odds models. The estimated probabilities from the proposed method are lower in the tail than their naive counterparts, especially under the proportional odds model, highlighting the importance of adjusting for death.
Figure 2.

Estimation of cumulative incidence functions for (a) an African–American male versus (b) a Caucasian male residing in Forsyth County, NC, aged 54 years, body mass index 27kg/m2, glucose value 98mg/dl, systolic blood pressure 118mmHg, and diastolic blood pressure 73mmHg. The solid and dashed curves pertain to the proportional hazards and proportional odds models, respectively, from the proposed method, where the dropout time is modeled by the proportional hazards model. The dotted and dash-dotted curves pertain to the proportional hazards and proportional odds models, respectively, from the naive method. This figure appears in color in the electronic version of this article.
5. Discussion
In this article, we study efficient nonparametric maximum likelihood estimation of joint models for interval-censored data with informative dropout. We establish the asymptotic properties of the estimators through innovative use of modern empirical process theory. In the proofs, separate treatments are given to the estimators of the cumulative baseline hazard functions for the event of interest and dropout. We avoid the assumption of Zeng et al. (2016) that a subset of study subjects are examined at the end of the study by carefully evaluating the bracket covering number for a class of functions that involves unbounded Λ.
We applied our methods to data derived from the ARIC study, where diabetes is the event of interest and death is the terminal event. In the ARIC study, there are other outcomes of interest that are either interval-censored (e.g., hypertension, peripheral artery disease) or right-censored (e.g., myocardial infarction, stroke). The proposed framework can be extended to incorporate multiple interval-censored events and multiple right-censored events and thereby analyze an enriched version of data from the ARIC study.
The class of transformation models is very broad and thus allows accurate prediction in a variety of situations. In practice, one would need to determine which model best fits the data. One strategy is to use the Akaike information criterion to select the best transformations, as we did for the ARIC study. It would be worthwhile to develop additional methods for model selection and model checking.
We have assumed that the transformation functions are correctly specified. If the transformation functions or other aspects of the models are incorrectly specified, then the proposed parameter estimators will converge to constants that minimize the Kullback–Leibler distance. Robust variance estimators can be constructed on the basis of the efficient score functions. A thorough investigation of model misspecification is beyond the scope of this article.
In some applications, the examination times are directly related to the event of interest, instead of through dropout. This may be the case if patients tend to visit their doctors more frequently when they are not feeling well. Zhang et al. (2005), Chen et al. (2012, 2014), and Ma et al. (2015) studied this problem for current status data (Huang, 1996) by assuming a frailty model or copula structure for the event time of interest and the examination time. Zhang et al. (2007) considered the case of two examination times and modeled the first examination time, the gap time, and the event time of interest through a proportional hazards frailty model. Zhao et al. (2015) considered the same type of data and assumed a copula model for the event time of interest and the gap time. Wang et al. (2016) considered an arbitrary number of examination times and assumed a shared frailty model. All of these methods require parametric assumptions or approximations for the cumulative baseline hazard functions.
We can extend our approach to the aforementioned settings. For current status data, our algorithm can be directly applied by replacing the dropout time with the examination time in the likelihood. For the case with two examination times, we can jointly model the failure time, the first examination time, and the gap time using transformation models with random effects. For an arbitrary number of examination times, we can model the intensity of the examination process using a transformation model with frailty. The proposed EM algorithm can be modified accordingly.
Supplementary Material
ACKNOWLEDGEMENTS
This work was supported by the National Institutes of Health awards R01GM047845, R01AI029168, R01CA082659, and P01CA142538. The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts (HHSN268201100005C, HHSN268201100006C, HHSN26820 1100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, and HHSN268 201100012C). The authors thank the staff and participants of the ARIC study for their important contributions.
References
- Andersen PK, Borgan O, Gill RD, and Keiding N (2012). Statistical Models Based on Counting Processes New York: Springer. [Google Scholar]
- Cai T and Betensky RA (2003). Hazard regression for intervalcensored data with penalized spline. Biometrics 59, 570–579. [DOI] [PubMed] [Google Scholar]
- Chen CM, Lu TFC, Chen MH, and Hsu CM (2012). Semiparametric transformation models for current status data with informative censoring. Biometrical Journal 54, 641–656. [DOI] [PubMed] [Google Scholar]
- Chen CM, Wei JCC, Hsu CM, and Lee MY (2014). Regression analysis of multivariate current status data with dependent censoring: application to ankylosing spondylitis data. Statistics in Medicine 33, 772–785. [DOI] [PubMed] [Google Scholar]
- DeFronzo RA, Ferrannini E, Zimmet P, and Alberti G (2004). International Textbook of Diabetes Mellitus Chichester,U.K: John Wiley and Sons. [Google Scholar]
- Fine JP, Jiang H, and Chappell R (2001). On semi-competing risks data. Biometrika 88, 907–919. [Google Scholar]
- Gu MG, Sun L, and Zuo G (2005). A baseline-free procedure for transformation models under interval censorship. Lifetime Data Analysis 11, 473–488. [DOI] [PubMed] [Google Scholar]
- Harris MI, Hadden WC, Knowler WC, and Bennett PH (1987). Prevalence of diabetes and impaired glucose tolerance and plasma glucose levels in US population aged 20–74 yr. Diabetes 36, 523–534. [DOI] [PubMed] [Google Scholar]
- Huang J (1995). Maximum likelihood estimation for proportional odds regression model with current status data. Analysis of Censored Data (IMS Lecture Notes Monograph Series) 27, 129–145. [Google Scholar]
- Huang J (1996). Efficient estimation for the proportional hazards model with interval censoring. The Annals of Statistics 24, 540–568. [Google Scholar]
- Huang J and Rossini A (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. Journal of the American Statistical Association 92, 960–967. [Google Scholar]
- Ma L, Hu T, and Sun J (2015). Sieve maximum likelihood regression analysis of dependent current status data. Biometrika 102, 731–738. [Google Scholar]
- Murphy SA and van der Vaart AW (2000). On profile likelihood. Journal of the American Statistical Association 95, 449–465. [Google Scholar]
- Oakes D (1989). Bivariate survival models induced by frailties. Journal of the American Statistical Association 84, 487–493. [Google Scholar]
- Parner E (1998). Asymptotic theory for the correlated gammafrailty model. The Annals of Statistics 26, 183–214. [Google Scholar]
- Rossini A and Tsiatis A (1996). A semiparametric proportional odds regression model for the analysis of current status data. Journal of the American Statistical Association 91, 713–721. [Google Scholar]
- Shen X (1998). Proportional odds regression and sieve maximum likelihood estimation. Biometrika 85, 165–177. [Google Scholar]
- Sun J and Sun L (2005). Semiparametric linear transformation models for current status data. Canadian Journal of Statistics 33, 85–96. [Google Scholar]
- The ARIC Investigators (1989). The Atherosclerosis Risk in Communities (ARIC) study: design and objectives. American Journal of Epidemiology 129, 687–702. [PubMed] [Google Scholar]
- Wang P, Zhao H, and Sun J (2016). Regression analysis of case K interval-censored failure time data in the presence of informative censoring. Biometrics 72, 1103–1112. [DOI] [PubMed] [Google Scholar]
- Zeng D and Lin D (2007). Maximum likelihood estimation in semiparametric regression models with censored data (with discussion). Journal of the Royal Statistical Society, Series B 69, 507–564. [Google Scholar]
- Zeng D and Lin D (2009). Semiparametric transformation models with random effects for joint analysis of recurrent and terminal events. Biometrics 65, 746–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng D, Mao L, and Lin D (2016). Maximum likelihood estimation for semiparametric transformation models with interval-censored data. Biometrika 103, 253–271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Z, Sun J, and Sun L (2005). Statistical analysis of current status data with informative observation times. Statistics in Medicine 24, 1399–1407. [DOI] [PubMed] [Google Scholar]
- Zhang Z, Sun L, Sun J, and Finkelstein DM (2007). Regression analysis of failure time data with informative interval censoring. Statistics in Medicine 26, 2533–2546. [DOI] [PubMed] [Google Scholar]
- Zhang Z, Sun L, Zhao X, and Sun J (2005). Regression analysis of interval-censored failure time data with linear transformation models. Canadian Journal of Statistics 33, 61–70. [Google Scholar]
- Zhang Z and Zhao Y (2013). Empirical likelihood for linear transformation models with interval-censored failure time data. Journal of Multivariate Analysis 116, 398–409. [Google Scholar]
- Zhao S, Hu T, Ma L, Wang P, and Sun J (2015). Regression analysis of informative current status data with the additive hazards model. Lifetime Data Analysis 21, 241–258. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
