Proportional hazards regression with interval censored data using an inverse probability weight

Glenn Heller

doi:10.1007/s10985-010-9191-8

. Author manuscript; available in PMC: 2017 Jul 6.

Published in final edited form as: Lifetime Data Anal. 2010 Dec 30;17(3):373–385. doi: 10.1007/s10985-010-9191-8

Proportional hazards regression with interval censored data using an inverse probability weight

Glenn Heller ¹

PMCID: PMC5499516 NIHMSID: NIHMS867453 PMID: 21191653

Abstract

The prevalence of interval censored data is increasing in medical studies due to the growing use of biomarkers to define a disease progression endpoint. Interval censoring results from periodic monitoring of the progression status. For example, disease progression is established in the interval between the clinic visit where progression is recorded and the prior clinic visit where there was no evidence of disease progression. A methodology is proposed for estimation and inference on the regression coefficients in the Cox proportional hazards model with interval censored data. The methodology is based on estimating equations and uses an inverse probability weight to select event time pairs where the ordering is unambiguous. Simulations are performed to examine the finite sample properties of the estimate and a colon cancer data set is used to demonstrate its performance relative to the conventional partial likelihood estimate that ignores the interval censoring.

Keywords: Estimating equation, Interval censoring, Proportional hazards, Selection probability

1 Introduction

In medical studies, radiographic scans and blood based biomarkers are increasingly being accepted as clinical endpoints in measuring the time to progression. In contrast to survival time, where patient death is the single determining point, disease progression is established in an interval between the clinic visit where progression is recorded and the prior clinic visit where there was no evidence of disease progression. To estimate the relationship between progression free survival time and covariates, it is common to employ the proportional hazards model, using a partial likelihood analysis based on the recorded progression time. However, if the interval width between clinic visits is significant relative to the variability of the true (but unobserved) progression times, then the ranks of the recorded progression times may be substantially different from the ranks of the true progression times, resulting in an inaccurate partial likelihood estimate of the proportional hazards regression coefficients.

Incorporation of interval censoring into the proportional hazards model does not enable canceling of the baseline hazard function, and as a result, estimation of the regression coefficients and the derivation of its asymptotic properties have proven challenging. Full likelihood approaches that require estimation of the baseline hazard to estimate the regression coefficient were developed by partitioning the time axis based on the endpoints of the event time intervals (Finkelstein 1986); employment of an EM algorithm based on piecewise constant event times (Goetghebeur and Ryan 2000); and use of a local likelihood to jointly estimate the regression coefficient and the baseline hazard function (Betensky et al. 2002). Although these methods do not require a global parametric distribution for the survival time, the need to estimate the baseline hazard offsets a key benefit of the semiparametric proportional hazards framework.

Monte Carlo methods have been explored to avoid baseline hazard estimation in the Cox model with interval censored data. Pan (2000) uses a multiple imputation procedure to fill-in failure times for the interval censored events and then applies the standard partial likelihood analysis. Satten (1996) considers a marginal likelihood approach, using Gibbs sampling to generate possible event time ranks. Simulations are used to demonstrate the small sample properties of the regression estimates, but the distribution theory for the estimated regression coefficients has not been developed. Satten, Datta, and Williamson (1998) derive an asymptotic distribution for the regression estimates, but require a parametric specification of the baseline hazard to impute failure times.

Estimating equations have been used to avoid estimation of the baseline hazard function. Zhang et al. (2005) proposed a solution under the constraint that the covariate in the proportional hazards model are discrete. In addition, due to the complexity of the regression estimate, they rely on a heuristic derivation of its asymptotic distribution and suggest a bootstrap estimate for the asymptotic variance (Zhang 2009). An excellent survey of statistical approaches applied to interval censored data is found in Sun (2006).

Software for the interval censored proportional hazards model is sparse (Gomez et al. 2009). In R, there exists one package (Henschel, Heiss, and Mansmann 2009), derived from an algorithm based on the joint maximization of the baseline survival function and the regression coefficient (Pan 1999). However, under the standard assumption that the baseline survival function S₀ is piecewise constant, the number of parameters grows with the sample size and the asymptotic distribution of the coefficient estimate β̂ is unclear. Huang and Wellner (1997) consider a profile likelihood approach L(Ŝ₀_β, β) to estimate the var(β̂), but note that the high dimensional inverse Hessian required for the variance estimate may lead to computational issues. In addition, they note that it remains uncertain whether Ŝ₀_βis a smooth function of β, which would argue against a bootstrap estimate of var(β̂).

Thus to date, although endpoints such as progression free survival that produce interval censored data are increasing, the application of these approaches is not routinely incorporated into a Cox model analysis. As noted, the reasons include: either concurrent estimation of the infinite dimensional baseline hazard, computational dificulty, or the lack of an asymptotic distribution for the regression coefficient estimate.

In this paper, an alternative methodology is proposed to estimate the proportional hazards regression coefficient and to derive the asymptotic distribution of this estimate when interval censoring is present. The method does not require estimation of the baseline hazard and uses standard estimating equation techniques to produce the estimate and its asymptotic distribution. Sections 2 and 3 develop the methodology for estimation and inference. In Section 4, simulations are performed to examine the finite sample adequacy of the parameter estimate and coverage based on asymptotic confidence intervals. An analysis of colon cancer data is undertaken in Section 5 and Section 6 contains concluding remarks.

2 An estimating equation under proportional hazards

The proportional hazards relationship between the event time T and covariate vector X is specified by

h (t ∣ x) = h_{0} (t) exp (x^{T} β_{1}),

(1)

where h(t|x) represents the subject-specific conditional hazard function, h₀(t) is an unspecified baseline hazard function, common for all subjects, and exp(x^T β₁) is the relative risk for an individual with k-dimensional covariate vector x. As a result, the proportional hazards specification contains a finite dimensional parameter β₁ and an infinite dimensional parameter h₀(t).

With right censored data, each individual is associated with an event time, censoring time, and covariate vector {T_i, C_i, X_i}. It is assumed that the event time and censoring time are independent conditional on the covariate vector. The minimum time and censoring indicator are observed for each subject and are denoted by

Y_{i} = min (T_{i}, C_{i}) d_{i}^{*} = I (T_{i} \leq C_{i}) i = 1, \dots, n .

Estimation and inference for $β_{1}^{0}$ , the true value of β₁, is accomplished through the partial likelihood function (Cox, 1975). The score equation, derived from the partial likelihood, is based on the ranks of the observed times and does not contain the infinite dimensional parameter h₀(t).

With interval censored data, however, the precise event times are not observed, and thus the ranks of these event times are unknown. Instead, the event time T_i for the i^th subject is known to lie in an interval [L_i, R_i]. For example, when scheduled clinic visits are used to assess disease progression, R_i represents the time from the start of treatment to the visit when disease progression is determined and L_i represents the time to the previous clinic visit. If the event has not occurred at the time of analysis, the event time is subject to right censoring. The notation used to indicate whether right censoring occurs is d_i = I[R_i < C_i]. The event time interval for a right censored subject is [L_i, +∞], where L_i represents the follow up time to the last negative recording. If the failure time is known precisely, when for example a death has occurred, the interval may be denoted as [R_i, R_i]. Finally, it is assumed that conditional on the covariate vector, the scheduling times are independent of the event time. This conditional independence assumption is common with interval censored data.

An alternative specification of the proportional hazards model is

m (t_{i}) = x_{i}^{T} β_{1}^{0} + ε_{i}

where ε_i are independent, identically distributed, standard extreme value random variables, and m is an unknown, but monotone function of the survival times. This specification shows that the proportional hazards model is a member of the linear transformation family (Dabrowska and Doksum 1988; Cheng, Wei, and Ying 1995). Under the proportional hazards specification, for any pair of observations

Pr (T_{i} > T_{j} ∣ X_{i} = x_{i}, X_{j} = x_{j}) = \frac{exp [x_{j i}^{T} β_{1}^{0}]}{1 + exp [x_{j i}^{T} β_{1}^{0}]},

(2)

where x_ji = x_j − x_i. This result suggests an alternative estimating equation approach for inference on $β_{1}^{0}$ . Using the notation S_ij = I[T_i > T_j], with no censoring, an unbiased estimating equation is

\sum_{i} \sum_{j \neq i} {S_{i j} - E [S_{i j} ∣ x_{i}, x_{j}]} = 0.

An additional component is needed to account for the finite follow up period typical with survival time data. For an event time pair ordering to be observed, the minimum of the pair must be less than the maximum follow up time of the study, τ. Denoting the bounded ordering of the event time pair as $S_{i j}^{τ} = I [T_{i} > T_{j}, T_{j} < τ]$ , then under proportional hazards (Fine, Ying, and Wei 1998),

E [S_{i j}^{τ} ∣ x_{i}, x_{j}] = \frac{exp [x_{j i}^{T} β_{1}^{0}]}{1 + exp [x_{j i}^{T} β_{1}^{0}]} {1 - exp (- β_{2}^{0} exp [x_{i}^{T} β_{1}^{0}] - β_{2}^{0} exp [x_{j}^{T} β_{1}^{0}])}

where $β_{2}^{0} = \int_{t = 0}^{τ} h_{0} (t)$ . Thus, accounting for the finite follow up period requires one additional parameter β₂ and the (k +1) parameters are represented as $β = {(β_{1}^{T}, β_{2})}^{T}$ .

3 Inverse probability weighted estimating equation

With interval censoring, the ordering of the event time pairs ( $S_{i j}^{τ}$ ) may not be observed. A sufficient condition for observing {T_i > T_j, T_j < τ } is {L_i > R_j, R_j < C_j}. This condition or its complementary event is indicated by

Δ_{i j} = I [R_{j} < C_{j}] I [L_{i} > R_{j}] + I [R_{i} < C_{i}] I [L_{j} > R_{i}] .

Li and Pu (2003) considered this condition of nonoverlapping event time interval pairs to estimate β. Their proposal, however, was limited to the accelerated failure time model with a single covariate and required a strong independence assumption between the covariate and the assessment schedule.

A weighted unbiased estimating equation within the proportional hazards frame-work, based on the selected event time pairs is

\sum_{i} \sum_{j \neq i} \frac{Δ_{i j}}{π^{*} (s_{i j}^{τ}, x_{i}, x_{j})} W_{i j} (β) {S_{i j}^{τ} - E [S_{i j}^{τ} ∣ x_{i}, x_{j}]} = 0,

(3)

where $π^{*} (s_{i j}^{τ}, x_{i}, x_{j}) = Pr [Δ_{i j} = 1 ∣ S_{i j}^{τ} = s_{i j}^{τ}, X_{i} = x_{i}, X_{j} = x_{j}]$ is the selection probability and $W_{i j} (β) = \partial E [S_{i j}^{τ} ∣ x_{i}, x_{j}] / \partial β$ .

If π^* is known, then inference for β can be accomplished through the estimating equation (3). If π^* is unknown, but the unobserved $S_{i j}^{τ}$ are missing at random, i.e. $π^{*} (s_{i j}^{τ}, x_{i}, x_{j}) = π^{*} (x_{i}, x_{j})$ , then either setting π^* = 1 or using a working model for π^* would produce an unbiased estimating equation. The unobserved ordered pairs, however, are not missing at random. Specifically, note that I(L_i > R_j) = I(L_i > R_j)I(T_i > T_j) and

\begin{array}{l} Pr [Δ_{i j} = 1 ∣ S_{i j}^{τ} = s_{i j}^{τ}, X_{i} = x_{i}, X_{j} = x_{j}] = \\ s_{i j}^{τ} Pr [Δ_{i j} = 1 ∣ S_{i j}^{τ} = 1, X_{i} = x_{i}, X_{j} = x_{j}] + s_{j i}^{τ} Pr [Δ_{i j} = 1 ∣ S_{j i}^{τ} = 1, X_{i} = x_{i}, X_{j} = x_{j}] . \end{array}

An instructive approach for the missing not at random case is to incorporate an observable auxillary variable that essentially captures the information in $S_{i j}^{τ}$ for predicting Δ_ij (Ibrahim, Lipsitz, and Horton 2001). Since $S_{i j}^{τ}$ informs the order between (T_i, T_j), an observable auxiliary variable that orders the event time interval pair [L_i, R_i] and [L_j, R_j] is

A_{i j} = I [R_{j} < C_{j}] I [L_{i} \geq L_{j}] I [R_{i} > R_{j}] .

The proposal is to use the observable A_ij in place of $S_{i j}^{τ}$ to model the selection probability; the inverse probability weight component in the estimating equation (3). This replacement is based on an estimating equation modified to include the auxillary variables and still retain its mean zero property

\sum_{i} \sum_{j \neq i} \frac{Δ_{i j} W_{i j} (β) {S_{i j}^{τ} - E [S_{i j}^{τ} ∣ x_{i}, x_{j}]}}{Pr [Δ_{i j} = 1 ∣ S_{i j}^{τ} = s_{i j}^{τ}, A_{i j} = a_{i j}, A_{j i} = a_{j i}, X_{i} = x_{i}, X_{j} = x_{j}]} = 0,

(4)

where 0/0 is defined as zero for any term in the summand and throughout the paper. To create an observable estimating equation, the selection probability in (4) is replaced by Pr[Δ_ij = 1|A_ij = a_ij, A_ji = a_ji, X_i = x_i, X_j = x_j]. If a_ij + a_ji = 0, this replacement is trivial since Δ_ij = 0 with probability 1. If a_ij + a_ji = 1, a critical assumption for the viability of this substitution is that conditional on (A_ij, A_ji, X_i, X_j), $S_{i j}^{τ}$ is ignorable for the prediction of Δ_ij. This ignorability assumption essentially transforms the problem into a missing at random framework. Although it is likely that this condition is only an approximation, the simulations in this paper suggest that the approximation is reasonable.

To implement this alternative specification of the selection probability, data pairs are eliminated when right censoring masks the potential ordering of the intervals [L_i, R_i], [L_j, R_j]. The indication that right censoring obscures the ordering is denoted by

1 - ω_{i j} = (1 - d_{i}) I [L_{i} \leq L_{j}] + (1 - d_{j}) I [L_{j} < L_{i}]

and the selection probability conditional on the complementary space where (A_ij, A_ji, X_i, X_j) is informative for Δ_ij is

π (a_{i j}, a_{j i}, x_{i}, x_{j}) \equiv Pr [Δ_{i j} = 1 ∣ A_{i j} = a_{i j}, A_{j i} = a_{j i}, X_{i} = x_{i}, X_{j} = x_{j}, ω_{i j} = 1] .

Incorporation of the selection probability enables the use of the observed event time pairs to create an asymptotic mean zero estimating equation in the presence of interval censored data

\sum_{i} \sum_{j \neq i} \frac{Δ_{i j}}{π (a_{i j}, a_{j i}, x_{i}, x_{j})} W_{i j} (β) {S_{i j}^{τ} - E [S_{i j}^{τ} ∣ x_{i}, x_{j}]} = 0.

(5)

Use of this estimating equation requires an estimate for the selection probability component π(a_ij, a_ji, x_i, x_j). A common approach is to apply a logistic regression model, which is specified as

π (a_{i j}, a_{j i}, x_{i}, x_{j}; γ) = a_{i j} \frac{exp [γ_{0} + x_{j i}^{T} γ_{1}]}{1 + exp [γ_{0} + x_{j i}^{τ} γ_{1}]} + a_{j i} \frac{exp [γ_{0} + x_{i j}^{T} γ_{1}]}{1 + exp [γ_{0} + x_{i j}^{T} γ_{1}]} .

(6)

Estimation of the Cox regression coefficient is accomplished through two estimating equations. First, $\hat{γ} = {({\hat{γ}}_{0}, {\hat{γ}}_{1}^{T})}^{T}$ is the solution to the estimating equation

Q_{γ} (γ) = n^{- 3 / 2} \sum_{i} \sum_{j \neq i} D_{i j} (γ) {\frac{Δ_{i j}}{π (a_{i j}, a_{j i}, x_{i}, x_{j}; γ)} - 1} = 0,

(7)

where D_ij(γ) = ∂π(a_ij, a_ji, x_i, x_j; γ)/∂γ. The estimated selection probabilities are used to estimate β from the profile estimating equation

Q_{β} (β, \hat{γ}) = n^{- 3 / 2} \sum_{i} \sum_{j \neq i} \frac{Δ_{i j}}{π (a_{i j}, a_{j i}, x_{i}, x_{j}; \hat{γ})} W_{i j} (β) {S_{i j}^{τ} - E [S_{i j}^{τ} ∣ x_{i}, x_{j}]} = 0.

(8)

The estimating equations (7) and (8), as a function of the true parameter values, are U-statistics of degree 2. As a result, the following theorem summarizes the asymptotic distribution of the parameter estimates determined by the estimating equations.

Theorem

Let θ = (β^T, γ^T )^T and Q_θ(θ) be the 2k +2 vector of estimating functions defined in (7) and (8). Under the following conditions:

(C1)
The proportional hazards assumption (1) between the unobserved event time T and the covariate vector X is valid.
(C2)
The selection probability in (4) is approximated by π(a_ij, a_ji, x_i, x_j), creating a missing at random structure.
(C3)
The logistic regression model (6) is the proper specification for the selection probability π(a_ij, a_ji, x_i, x_j).

Then n^1/2(θ̂ − θ₀) converges in distribution to N(0, U⁻¹V U⁻¹), where U = lim_n→_∞ E{n^−1/2∂Q_θ(θ₀)/∂θ} and V = lim_n→_∞ n⁻¹var{Q_θ(θ₀)}. In addition, the asymptotic variance of β̂₁ is the upper left k×k submatrix of U⁻¹V U⁻¹. The proof is standard and follows from a one-term Taylor expansion of Q_θ(θ). The estimate of the matrix U is computed by replacing the expectation with the sample average and the parameters with the parameter estimates. The variance-covariance matrix V is obtained through U-statistic theory. Letting q_ijl denote the (i, j) element in the summand of the l^th component of Q_θ(θ), the (l, m) element of V is

n^{- 3} \sum_{i} \sum_{j} \sum_{k \neq j} (q_{ijl} + q_{jil}) (q_{ikm} + q_{kim}) .

4 Simulations

A set of simulations were generated to assess the performance of the interval censored estimating equation estimates β̂_ee. A comparable set of simulations were produced using the conventional partial likelihood estimate β̂_pl, choosing the time of the first recorded positive event as the failure time, and an estimate of β using Pan’s joint maximization algorithm (Pan 1999). Pan’s method, however, does not produce an asymptotic variance for the coefficient estimate. The underlying event time data were generated to satisfy the proportional hazards specification

T_{i} = exp [4 + x_{i} β] \times ε_{i}

where β = 1, x_i was generated as a standard normal random variable, and ε_i was distributed as a Weibull random variable with shape parameter λ and scale parameter equal to 1. The shape parameter λ was varied as {0.693, 1.1, 1.386, 1.609} and the Cox coefficient is β₁ = −λβ. This choice of shape parameters results in odds parameters equal to {2, 3, 4, 5}, based on the odds concordance interpretation of the Cox coefficient

\frac{Pr [T_{j} > T_{i} ∣ x_{i}, x_{j}]}{Pr [T_{i} > T_{j} ∣ x_{i}, x_{j}]} = exp [- x_{j i} β_{1}] .

This sequence of shape parameters translates into event times with decreasing variability.

The monitoring schedule was developed under the clinical scenario that scans to detect disease progression were scheduled either every 12 months or every 24 months. Thus, for subject i, the simulated 12 month scan schedule was

σ_{k, i} = 12 k + U_{i k} (- 2, 2) k = 1, \dots, K

where the U_ik(−2, 2) represent independent uniform random variables, with support between −2 and 2, and K indicates the maximum number of scans. The uniform random variable is included to signify the scenario that a subject would arrive for the scan within a 2 month neighborhood of the scheduled visit. To examine the effect of the length of follow up on the regression estimates, the number of scans (K) was varied as {5, 10, 25} for scans scheduled every 12 months and {3, 6, 15} for scans scheduled every 24 months. The left and right endpoints for each subject’s event time interval were computed as

(L_{i}, R_{i}) = (0, σ_{1 i}) I [T_{i} < σ_{1, i}] + \sum_{k = 2}^{K} {(σ_{k - 1, i}, σ_{k, i}) I [σ_{k - 1, i} < T_{i} < σ_{k, i}]} + (σ_{K, i}, M) I [T_{i} > σ_{K, i}]

In the simulations, the right endpoint of a subject whose underlying event time was greater than the maximum follow up time was assigned a very large value (M). In addition, a uniform censoring random variable was generated as C_i ~ U(0, τ) to simulate the clinical trial scenario of subjects entering the study at different time points and lost to follow up. If C was less than the scan time where the progression was observed, then the follow-up time was censored at the scan time just prior to C. The maximum follow up time of the study τ is equal to the scan interval width times the maximum number of scans scheduled. The sample size for each simulation was 100 and the results of each simulation were based on 5,000 replications.

The estimating equation removes data pairs that do not meet the selection criteria. To assess the potential loss of information with this procedure, the simulation estimated root mean squared error of β̂_ee was compared to the simulation estimated root mean squared error of the partial likelihood estimate and Pan’s regression coeffcient estimate. Pan’s approach, based on a full likelihood, is used to gauge the efficiency of the proposed estimate.

The results for β̂_ee are accurate over the range of simulations examined (Table 1). The bias was small. The average asymptotic standard error of the β̂_ee provided a good approximation to its simulation standard error. The 95% empirical coverage was uniformly good based on the confidence interval β̂±1.96×se(β̂). In contrast, the conventional partial likelihood approach produced a bias in β̂_pl and poorer coverage rates, which were magnified when the scan interval width was larger and the variability of the event times was smaller (Table 2).

Table 1.

Simulation results based on the inverse probability weighted estimating equation. The columns in the table represent: Scan, scan schedule; Shape, variability of the underlying event times; PRC, percent right censored; Bias, bias of the parameter estimator; SSE, standard deviation of the simulation estimates; ASE, average estimated standard error; CP, empirical coverage probability; RMSE, root mean square error of the estimator.

Scan	Shape	PRC	Bias	SSE	ASE	CP	RMSE
12	0.693	0.25	−0.004	0.161	0.160	0.949	0.161
		0.45	−0.007	0.193	0.183	0.943	0.193
		0.65	−0.009	0.230	0.221	0.947	0.230
	1.110	0.25	−0.001	0.191	0.200	0.952	0.191
		0.45	−0.013	0.232	0.230	0.949	0.232
		0.65	−0.054	0.318	0.296	0.939	0.322
	1.386	0.25	0.000	0.218	0.301	0.950	0.218
		0.45	−0.021	0.271	0.279	0.943	0.272
		0.65	−0.079	0.398	0.364	0.943	0.406
	1.609	0.25	0.005	0.237	0.303	0.949	0.237
		0.45	−0.023	0.301	0.339	0.946	0.302
		0.65	−0.126	0.475	0.425	0.938	0.491

24	0.693	0.25	0.030	0.165	0.163	0.939	0.168
		0.45	0.028	0.203	0.195	0.929	0.205
		0.65	0.015	0.273	0.258	0.932	0.273
	1.110	0.25	0.051	0.197	0.202	0.921	0.203
		0.45	0.021	0.263	0.254	0.930	0.264
		0.65	−0.049	0.398	0.364	0.934	0.401
	1.386	0.25	0.084	0.226	0.349	0.908	0.241
		0.45	0.021	0.318	0.318	0.920	0.319
		0.65	−0.086	0.513	0.589	0.927	0.520
	1.609	0.25	0.111	0.258	0.425	0.885	0.281
		0.45	0.015	0.372	0.447	0.920	0.372
		0.65	−0.156	0.686	0.727	0.929	0.704

Open in a new tab

Table 2.

Simulation results from the partial likelihood estimate and the Pan estimate. The columns in the table represent: Scan, scan schedule; Shape, variability of the underlying event times; PRC, percent right censored; Bias, bias of the parameter estimator; SSE, standard deviation of the simulation estimates; ASE, average estimated standard error; CP, empirical coverage probability; RMSE, root mean square error of the estimator.

			PL estimate					Pan estimate
Scan	Shape	PRC	Bias	SSE	ASE	CP	RMSE	Bias	SSE	RMSE
12	0.693	0.25	0.006	0.135	0.137	0.953	0.135	−0.013	0.136	0.137
		0.45	0.012	0.158	0.155	0.948	0.158	0.005	0.150	0.150
		0.65	0.046	0.182	0.183	0.947	0.188	0.034	0.178	0.181
	1.110	0.25	0.034	0.155	0.154	0.935	0.159	0.063	0.139	0.153
		0.45	0.044	0.179	0.177	0.931	0.184	0.084	0.151	0.173
		0.65	0.056	0.224	0.216	0.917	0.231	0.118	0.192	0.225
	1.386	0.25	0.060	0.178	0.171	0.908	0.188	0.157	0.145	0.214
		0.45	0.076	0.200	0.196	0.906	0.214	0.176	0.156	0.235
		0.65	0.102	0.257	0.242	0.885	0.277	0.229	0.194	0.300
	1.609	0.25	0.093	0.196	0.183	0.868	0.217	0.259	0.147	0.298
		0.45	0.113	0.224	0.211	0.861	0.251	0.273	0.157	0.315
		0.65	0.147	0.283	0.263	0.848	0.319	0.334	0.191	0.385

24	0.693	0.25	0.048	0.131	0.134	0.932	0.140	−0.011	0.141	0.141
		0.45	0.062	0.152	0.154	0.924	0.164	0.004	0.164	0.164
		0.65	0.106	0.190	0.191	0.890	0.218	0.006	0.209	0.209
	1.110	0.25	0.125	0.153	0.147	0.815	0.198	0.039	0.152	0.157
		0.45	0.155	0.173	0.168	0.797	0.232	0.045	0.179	0.185
		0.65	0.224	0.224	0.212	0.676	0.317	0.046	0.231	0.236
	1.386	0.25	0.215	0.177	0.157	0.645	0.278	0.114	0.158	0.195
		0.45	0.259	0.196	0.179	0.623	0.325	0.110	0.188	0.218
		0.65	0.346	0.252	0.228	0.564	0.428	0.131	0.234	0.268
	1.609	0.25	0.305	0.197	0.164	0.496	0.363	0.191	0.161	0.250
		0.45	0.349	0.221	0.189	0.499	0.413	0.177	0.193	0.262
		0.65	0.457	0.274	0.240	0.456	0.533	0.217	0.234	0.319

Open in a new tab

The quality of the estimate β̂_ee decreased as the scan interval width increased and the variability of the underlying event time decreased. These parameter settings, along with higher censoring rates, tended to reduce the number of unambiguously ordered event time pairs. A comparison of the root mean squared error for the estimating equation estimator and the partial likelihood estimator shows that when the scan interval width is 12, the root mean squared error of the partial likelihood estimator is always smaller. The results are mixed when the scan interval is 24, where β̂_ee has a smaller root mean square error when the censoring is either 25% or 45% and the variability of the underlying event times is low. These results suggest an additional weight incorporated into the estimating equations, targeted to reduce the variability of β̂_ee, would prove useful. Surprisingly, the root mean squared error for Pan’s full likelihood does not dominate. The Pan estimate incurs significant bias when the event time variability is small.

5 Colon cancer data analysis

Ninety patients with locally advanced colorectal cancer were treated surgically at Memorial Sloan-Kettering Cancer Center. Patients were monitored for their time to recurrence or death. Tumors of the colon produce the protein Carcinoembryonic Antigen (CEA). The prognostic value of CEA for predicting the risk of recurrence in this population is unclear, since CEA may also be elevated in response to other diseases.

An analysis was undertaken to explore the prognostic significance of the baseline CEA measure, taken just after surgery, in assessing the risk of recurrence. The purpose of the analysis was to determine whether the clinician should use CEA in the assessment of peri-surgical treatment options. To determine recurrence, patients underwent a CT scan at six months and one year after surgery, and were subsequently scheduled for yearly scans. Although follow up scans were intended for the life of the patient, patients with no cancer related health problems typically ceased returning to the clinic at some point in their follow up. As a result of the scan schedule, the event tumor recurrence is interval censored. An analysis based on the partial likelihood estimate, using the time of the first recorded positive recurrence on the CT scan as the event, was compared to the estimating equation estimate to explore the differences in the two approaches. Approximately 50% of the patients recurred and no patient died prior to their recurrence. The estimated probability of remaining alive and recurrence free at 10 years was 0.36. This calculation was based on a Kaplan-Meier estimate that accounts for interval censoring (Wellner and Zhan 1997). The relationship between baseline CEA and the time to recurrence was first summarized through the partial likelihood estimate. CEA values ranged from (0.0, 52.6), with the median value equal to 3.7. The distribution of CEA was right skewed and a square root transformation was applied. The estimated coefficient was β̂_pl = 0.171 and the estimated se(β̂_pl) = 0.078, suggesting a positive relationship between baseline CEA and the risk of recurrence.

Accounting for the interval censoring and applying the estimating equation approach produced β̂_ee = 0.235 and the estimated se(β̂_ee) = 0.114. The results of the partial likelihood and the estimating equation methods are summarized in Table 3. Interestingly, both the estimate and its standard error are greater than the partial likelihood estimates. One explanation is that the estimating equation approach uses only data pairs where there is a definitive event time ordering. In contrast, the partial likelihood approach ignores the uncertainty in the event time ordering and uses all the events in the estimation process. This can produce an event time ordering that is partially dictated by the visitation schedule to the clinic, resulting in an attenuation of the partial likelihood estimate β̂_pl. In addition, by implicitly assuming more information in the event time ordering than exists in the presence of interval censored data, the standard error of β̂_pl is expected to be smaller, giving a false sense of confidence as to the location of the true regression coefficients.

Table 3.

Estimated coefficients and standard errors for the square root of CEA from the partial likelihood and estimating equations.

Method	sqrt(CEA)
Method	Coef	SE
Partial likelihood	0.171	0.078
Estimating Equation	0.235	0.114

Open in a new tab

6 Discussion

Interval censored data methods applied to the proportional hazards model have not gained widespread acceptance. Methodology that has incorporated interval censoring requires either estimation of the baseline hazard or a computationally intensive approach. Due to this inadequacy, interval censoring is often ignored, and the right endpoint of the event time interval is used to compute the partial likelihood estimate. The result is a biased estimate, with the bias increasing as a function of the interval censoring width combined with the precision of the underlying event times. In this paper, the proportional hazards specification is used to create an estimating equation that incorporates an inverse selection probability weight. If the selection probability model and the proportional hazards model are properly specified, then a set of unbiased estimating equations can be formed under a surrogacy condition. The asymptotic distribution of the estimated regression coefficients follows directly from the estimating equations. This proposal, which produces a Cox model estimate that is computationally straightforward and has an estimable asymptotic distribution, is timely due to the increasing use of clinical endpoints that are subject to interval censoring.

References

Betensky RA, Lindsey JC, Ryan LM, Wand MP. A local likelihood proportional hazards model for interval censored data. Statistics in Medicine. 2002;21:263–275. doi: 10.1002/sim.993. [DOI] [PubMed] [Google Scholar]
Cheng SC, Wei LJ, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82:835–845. [Google Scholar]
Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
Dabrowska DM, Doksum KA. Partial likelihood in transformation models with censored data. Scandanavian Journal of Statistics. 1988;15:1–23. [Google Scholar]
Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall; New York: 1993. [Google Scholar]
Fine JP, Ying Z, Wei LJ. On the linear transformation model for censored data. Biometrika. 1998;85:980–986. [Google Scholar]
Finkelstein DM. A proportional hazards model for interval-censored failure time data. Biometrics. 1986;42:845–854. [PubMed] [Google Scholar]
Goetghebeur E, Ryan L. Semiparametric regression analysis of interval-censored data. Biometrics. 2000;56:1139–1144. doi: 10.1111/j.0006-341x.2000.01139.x. [DOI] [PubMed] [Google Scholar]
Gomez G, Calle ML, Oller R, Langohr K. Tutorial on methods for interval-censored data and their implementation in R. Statistical Modelling. 2009;9:259–297. [Google Scholar]
Henschel V, Heiss C, Mansmann U. intcox: Compendium to apply the iterative convex minorant algorithm to interval censored event data. 2009 http://127.0.0.1:21646/library/intcox/vignettes/intcox.pdf.
Huang J, Wellner JA. Interval censored survival data: a review of recent progress. In: Lin DY, Fleming T, editors. Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis. Springer-Verlag; New York: 1997. [Google Scholar]
Ibrahim JG, Lipsitz SR, Horton N. Using auxiliary data for parameter estimation with non-ignorably missing outcomes. Applied Statistics. 2001;50:361–373. [Google Scholar]
Li L, Pu Z. Rank estimation of log-linear regression with interval-censored data. Lifetime Data Analysis. 2003;9:57–70. doi: 10.1023/a:1021882122257. [DOI] [PubMed] [Google Scholar]
Pan W. Extending the iterative convex minorant to the Cox model for interval censored data. Journal of Computational and Graphical Statistics. 1999;8:109–120. [Google Scholar]
Pan W. A multiple imputation approach to Cox regression with interval-censored data. Biometrics. 2000;56:199–203. doi: 10.1111/j.0006-341x.2000.00199.x. [DOI] [PubMed] [Google Scholar]
Satten GA. Rank-based inference in the proportional hazards model with interval-censored data. Biometrika. 1996;83:355–370. [Google Scholar]
Satten GA, Datta S, Williamson JM. Inference based on imputed failure times for the proportional hazards model with interval-censored data. Journal of the American Statistical Association. 1998;93:318–327. [Google Scholar]
Sun J. The statistical analysis of interval-censored failure time data. Springer; New York: 2006. [Google Scholar]
Wellner JA, Zhan Y. A hybrid algorithm for computation of the nonparametric maximum likelihood estimator from censored data. Journal of the American Statistical Association. 1997;92:945–959. [Google Scholar]
Zhang Z, Sun L, Zhao X, Sun J. Regression analysis of interval censored failure time data with linear transformation models. The Canadian Journal of Statistics. 2005;33:61–70. [Google Scholar]
Zhang Z. Linear transformation models for interval-censored data: prediction of survival probability and model checking. Statistical Modelling. 2009;9:321–343. [Google Scholar]

[R1] Betensky RA, Lindsey JC, Ryan LM, Wand MP. A local likelihood proportional hazards model for interval censored data. Statistics in Medicine. 2002;21:263–275. doi: 10.1002/sim.993. [DOI] [PubMed] [Google Scholar]

[R2] Cheng SC, Wei LJ, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82:835–845. [Google Scholar]

[R3] Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]

[R4] Dabrowska DM, Doksum KA. Partial likelihood in transformation models with censored data. Scandanavian Journal of Statistics. 1988;15:1–23. [Google Scholar]

[R5] Efron B, Tibshirani RJ. An introduction to the bootstrap. Chapman and Hall; New York: 1993. [Google Scholar]

[R6] Fine JP, Ying Z, Wei LJ. On the linear transformation model for censored data. Biometrika. 1998;85:980–986. [Google Scholar]

[R7] Finkelstein DM. A proportional hazards model for interval-censored failure time data. Biometrics. 1986;42:845–854. [PubMed] [Google Scholar]

[R8] Goetghebeur E, Ryan L. Semiparametric regression analysis of interval-censored data. Biometrics. 2000;56:1139–1144. doi: 10.1111/j.0006-341x.2000.01139.x. [DOI] [PubMed] [Google Scholar]

[R9] Gomez G, Calle ML, Oller R, Langohr K. Tutorial on methods for interval-censored data and their implementation in R. Statistical Modelling. 2009;9:259–297. [Google Scholar]

[R10] Henschel V, Heiss C, Mansmann U. intcox: Compendium to apply the iterative convex minorant algorithm to interval censored event data. 2009 http://127.0.0.1:21646/library/intcox/vignettes/intcox.pdf.

[R11] Huang J, Wellner JA. Interval censored survival data: a review of recent progress. In: Lin DY, Fleming T, editors. Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis. Springer-Verlag; New York: 1997. [Google Scholar]

[R12] Ibrahim JG, Lipsitz SR, Horton N. Using auxiliary data for parameter estimation with non-ignorably missing outcomes. Applied Statistics. 2001;50:361–373. [Google Scholar]

[R13] Li L, Pu Z. Rank estimation of log-linear regression with interval-censored data. Lifetime Data Analysis. 2003;9:57–70. doi: 10.1023/a:1021882122257. [DOI] [PubMed] [Google Scholar]

[R14] Pan W. Extending the iterative convex minorant to the Cox model for interval censored data. Journal of Computational and Graphical Statistics. 1999;8:109–120. [Google Scholar]

[R15] Pan W. A multiple imputation approach to Cox regression with interval-censored data. Biometrics. 2000;56:199–203. doi: 10.1111/j.0006-341x.2000.00199.x. [DOI] [PubMed] [Google Scholar]

[R16] Satten GA. Rank-based inference in the proportional hazards model with interval-censored data. Biometrika. 1996;83:355–370. [Google Scholar]

[R17] Satten GA, Datta S, Williamson JM. Inference based on imputed failure times for the proportional hazards model with interval-censored data. Journal of the American Statistical Association. 1998;93:318–327. [Google Scholar]

[R18] Sun J. The statistical analysis of interval-censored failure time data. Springer; New York: 2006. [Google Scholar]

[R19] Wellner JA, Zhan Y. A hybrid algorithm for computation of the nonparametric maximum likelihood estimator from censored data. Journal of the American Statistical Association. 1997;92:945–959. [Google Scholar]

[R20] Zhang Z, Sun L, Zhao X, Sun J. Regression analysis of interval censored failure time data with linear transformation models. The Canadian Journal of Statistics. 2005;33:61–70. [Google Scholar]

[R21] Zhang Z. Linear transformation models for interval-censored data: prediction of survival probability and model checking. Statistical Modelling. 2009;9:321–343. [Google Scholar]

PERMALINK

Proportional hazards regression with interval censored data using an inverse probability weight

Glenn Heller

Abstract

1 Introduction

2 An estimating equation under proportional hazards

3 Inverse probability weighted estimating equation

Theorem

4 Simulations

Table 1.

Table 2.

5 Colon cancer data analysis

Table 3.

6 Discussion

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Proportional hazards regression with interval censored data using an inverse probability weight

Glenn Heller

Abstract

1 Introduction

2 An estimating equation under proportional hazards

3 Inverse probability weighted estimating equation

Theorem

4 Simulations

Table 1.

Table 2.

5 Colon cancer data analysis

Table 3.

6 Discussion

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases