Semiparametric regression analysis of length-biased interval-censored data

Fei Gao; Kwun Chuen Gary Chan

doi:10.1111/biom.12970

. Author manuscript; available in PMC: 2021 Nov 25.

Published in final edited form as: Biometrics. 2019 Mar 8;75(1):121–132. doi: 10.1111/biom.12970

Semiparametric regression analysis of length-biased interval-censored data

Fei Gao ¹, Kwun Chuen Gary Chan ¹

PMCID: PMC8614128 NIHMSID: NIHMS1045281 PMID: 30267539

Abstract

In prevalent cohort design, subjects who have experienced an initial event but not the failure event are preferentially enrolled and the observed failure times are often length-biased. Moreover, the prospective follow-up may not be continuously monitored and failure times are subject to interval censoring. We study the nonparametric maximum likelihood estimation for the proportional hazards model with length-biased interval-censored data. Direct maximization of likelihood function is intractable, thus we develop a computationally simple and stable expectation-maximization algorithm through introducing two layers of data augmentation. We establish the strong consistency, asymptotic normality and efficiency of the proposed estimator and provide an inferential procedure through profile likelihood. We assess the performance of the proposed methods through extensive simulations and apply the proposed methods to the Massachusetts Health Care Panel Study.

Keywords: left truncation, nonparametric maximum likelihood estimation, proportional hazards model, semiparametric efficiency

1 |. INTRODUCTION

Interval-censored data arise when a failure time is not recorded precisely but is rather known to lie within a time interval. Such data are encountered in prospective follow-up studies, where the ascertainment of the event of interest is made over a series of examination times. Regression analysis of unbiased interval-censored survival data has been extensively studied. In particular, nonparametric maximum likelihood estimation for the proportional hazards and transformation models have been studied by Huang (1996) and Zeng et al. (2016), respectively. Due to intractable likelihood, sieve estimation is also proposed for the proportional hazards model by Huang and Rossini (1997) and Cai and Betensky (2003), among others. A comprehensive review is given in Sun (2007).

Although sampling incident cases in a follow-up study is common, it may require a long follow-up period to observe enough failure events for meaningful analysis. Alternatively, a prevalent cohort design samples individuals who have experienced an initial event but not the failure event at enrollment, and is often considered as a more focused and economical design (Brookmeyer and Gail, 1987). However, subjects with a longer survival time are preferentially sampled in a prevalent cohort. When the incidence of the initial event is stationary over time, a prevalent cohort collects length-biased data (Wang, 1991; Shen et al., 2009), where the probability of observing the failure time is proportional to its value.

Prospective follow-up of a prevalent cohort can be subject to interval censoring. An example is the Massachusetts Health Care Panel Study (Chappell, 1991), where the time to loss of active life for elderly individuals were assessed approximately 1.25, 6, and 10 years after study recruitment. Since only functionally independent individuals were enrolled, subjects with a longer time to loss of active life were more likely to be sampled. Although a prevalent cohort is a biased sample that requires special methods for analysis, it may provide information that is otherwise unavailable in an incident cohort. For example, in an incident sampling design, the right tail of survival distribution may not be identified because of limited study duration. Using prevalent sampling design, the identifiable region for the survival distribution and marked variables that are observed only at the event occurrence could be enlarged (Chan and Wang, 2010). In the Massachusetts Health Care Panel Study data, even though the last monitoring time is 10 years after study recruitment, we can identify a survival distribution that ranges over 30 years (see Section 3.2), because individuals are event-free for a period before enrollment. An added advantage for interval-censored data is that, even when the monitoring time has a discrete distribution, we can identify the continuous survival distribution of the failure event because the event-free period before enrollment is typically continuous. For incident sampling with discrete monitoring time, in contrast, we can only identify a discrete survival distribution.

Statistical methodology for regression modeling of length-biased data are mostly proposed for uncensored and right-censored data. Wang (1996) and Chen (2010) considered uncensored data. For right-censored data, Qin and Shen (2010) proposed inverse weighted estimating equation and Huang and Qin (2012) proposed a composite likelihood approach for the proportional hazards model. Qin et al. (2011) considered the nonparametric maximum likelihood estimator and derived an expectation-maximization (EM) algorithm for computation, and showed that the estimator is efficient.

Even though there has been limited literature on length-biased interval-censored data, several methods were proposed for left-truncated interval-censored data without the length-biased assumption for the truncation time. In particular, Pan and Chappell (1998) considered the proportional hazards model and applied a gradient projection-based method for non-parametric maximum likelihood estimation, where the baseline survival function may be underestimated. Pan and Chappell (2002) considered the same model and suggested a marginal likelihood approach that avoids estimating the baseline hazards function and a monotone maximum likelihood approach assuming that the baseline distribution has a nondecreasing hazard function. Kim (2003) studied the special case of left-truncated current status data, where there is only one examination time, and established the asymptotic properties of the nonparametric maximum likelihood estimators. Recently, Wang et al. (2015) studied the additive hazard model with left-truncated interval-censored data and proposed a sieve estimation method.

In this paper, we study the nonparametric maximum likelihood estimation for the proportional hazards model with length-biased interval-censored data. Through introducing pseudo-truncated data and latent Poisson random variables, we develop a simple and computational stable EM algorithm. We establish the strong consistency and asymptotic normality of the proposed estimators and provide inference through a profile likelihood approach. We assess the performance of the proposed estimator and inferential procedures through extensive simulations and apply the proposed methods to the Massachusetts Health Care Panel Study data.

2 |. THE PROPOSED METHODOLOGY

2.1 |. Model and data

For individuals in the target population, let $\tilde{T}$ be the time to a failure event and $\tilde{Z}$ be a p-vector of covariates. We assume that $\tilde{T}$ follows a proportional hazards model with a cumulative hazard function

Λ (t ∣ \tilde{Z}) = Λ (t) \exp (β^{T} \tilde{Z}),

where β is a p-vector of unknown regression parameters, and Λ(·) is an arbitrary increasing function with Λ(0) = 0.

For length-biased sampling, it is common to assume that the incidence rate of the initial event is constant over calendar time and $\tilde{A}$ , the truncation time, is uniformly distributed in [0, τ], where τ is the maximum support of $\tilde{T}$ (Wang, 1991; Qin et al., 2011). In a prevalent cohort study, a subject is included only if the failure time does not occur before the truncation time, that is, $\tilde{T} \geq \tilde{A}$ . We let T, A, and Z be the failure time, truncation time and covariates, respectively, in the prevalent cohort. Then, (T, A, Z) has the same joint distribution as $(\tilde{T}, \tilde{A}, \tilde{Z})$ conditional on $\tilde{T} \geq \tilde{A}$ . Suppose that the occurrence of the failure is not exactly observed but only determined at a sequence of examination times, denoted as $A < U_{1} \dots < U_{M} \leq τ$ . The failure time is then known to lie in the interval (L, R), where $L = \max {U_{m} : U_{m} < T, m = 0, \dots, M}$ , $R = \min {U_{m} : U_{m} \geq T$ , m = 1, …, M + 1}, U₀ = A, and $U_{M + 1} = \infty$ . In particular, if the failure occurs before the first examination time, then (L, R) = (A, U₁); if the failure has not occurred at the last examination time, then $(L, R) = (U_{M}, \infty)$ . Let $V_{m} = U_{m} - A$ for m = 0, …, M, so that V₀ = 0.

We assume the following non-informative sampling time condition, that M and ${V_{m} : m = 1, \dots, M}$ are independent of (T, A) conditional on Z. For a length-biased sample of n subjects, the observed data are ${O_{i} : i = 1, \dots, n}$ , where $O_{i} = {L_{i}, R_{i}, A_{i}, Z_{i}}$ . The observed-data likelihood is then given by

L_{n} (β, Λ) = \prod_{i = 1}^{n} \frac{[\exp {- Λ (L_{i}) \exp (β^{T} Z_{i})} - I (R_{i} < \infty) \exp {- Λ (R_{i}) \exp (β^{T} Z_{i})}]}{\int_{0}^{τ} \exp {- Λ (a) \exp (β^{T} Z_{i})} d a / τ} .

(1)

The likelihood function $L_{n} (β, Λ)$ involves τ, which is a constant related to study design but not a parameter of interest. In the E-step of the proposed EM algorithm, however, it is required to redistribute mass on [0, τ], and an approximation of τ is required. Let $0 = t_{0} < t_{1} < \dots < t_{k} < \infty$ be the ordered sequence of all L_i and $R_{i} I (R_{i} < \infty)$ . Following Qin et al. (2011), we approximate τ by t_k, which converges to τ at a rate faster than n^1/2, and therefore does not alter subsequent results.

2.2 |. Nonparametric maximum likelihood estimation

We adopt the nonparametric maximum likelihood estimation approach, where the estimator for Λ is a step function with nonnegative finite jumps at the ends of the intervals that bracket the failure times. Specifically, we let λ₀, λ1,…, λ_k be the respective jump sizes at t₀, t₁,…,t_k, where λ₀ = 0. Write $λ = (λ_{1}, \dots, λ_{k})$ . We maximize the objective function

l_{n} (β, λ) \equiv \sum_{i = 1}^{n} (log [\exp {- \sum_{t_{j} \leq L_{i}} λ_{j} \exp (β^{T} Z_{i})} - I (R_{i} < \infty) \exp {- \sum_{t_{j} \leq R_{i}} λ_{j} \exp (β^{T} Z_{i})}] - \log \int_{0}^{τ} \frac{1}{τ} \exp {- \sum_{t_{j} \leq a} λ_{j} \exp (β^{T} Z_{i})} d a) .

Direct maximization of $l_{n} (β, λ)$ is difficult due to a lack of analytical expressions. We introduce two layers of data augmentation and propose an EM algorithm to facilitate computation. First, to handle left truncation, we introduce pseudo-truncated data, which is also referred to as “ghost data” (Turnbull, 1976). In particular, let $O_{i}^{*} \equiv {(T_{i m}^{*}, A_{i m}^{*}, Z_{i}) : T_{i m}^{*} < A_{i m}^{*}, m = 1, \dots, n_{i}}$ denote the pseudo-truncated samples corresponding to subject i, where $(T_{i 1}^{*}, A_{i 1}^{*}), \dots, (T_{i, n_{i}}^{*}, A_{i, n_{i}}^{*})$ are independent and identically distributed given Z_i. Since the estimator for Λ only takes jump at $t_{j} (j = 1, \dots, n)$ , the failure time $T_{i m}^{*}$ can only take values from ${t_{1}, \dots, t_{k}}$ . The number of truncated samples n_i follows a negative binomial distribution with parameter

π_{i} = P (T_{i m}^{*} < A_{i m}^{*} ∣ Z_{i}) = \sum_{j = 0}^{k} P (T_{i m}^{*} = t_{j}, A_{i m}^{*} > t_{j} ∣ Z_{i}) = \sum_{j = 1}^{k} (1 - t_{j} / τ) λ_{j} \exp (β^{T} Z_{i}) \exp {- \sum_{l = 1}^{j} λ_{l} \exp (β^{T} Z_{i})},

such that $E (n_{i} ∣ O_{i}) = π_{i} / (1 - π_{i})$ . Let $n_{i j} = \sum_{m = 1}^{n_{i}} I (T_{i m}^{*} = t_{j})$ . Given the total number $n_{i}, (n_{i 1}, \dots, n_{i k})$ follows a multinomial distribution with probabilities $(p_{i 1}, \dots, p_{i k})$ , where

p_{i j} = P (T_{i m}^{*} = t_{j} ∣ T_{i m}^{*} < A_{i m}^{*}, Z_{i}) = \frac{(1 - t_{j} / τ) λ_{j} \exp (β^{T} Z_{i}) \exp {- \sum_{l = 1}^{j} λ_{l} \exp (β^{T} Z_{i})}}{π_{i}} .

(2)

By the missing information principle (Lai and Ying, 1994), the maximization of $l_{n} (β, λ)$ is equivalent to maximizing the conditional expectation of the log-likelihood function of the “complete-data” ${(O_{i}, O_{i}^{*}) : i = 1, \dots, n}$ given the observed data. The “complete-data” log-likelihood function is given by

{\tilde{l}}_{n}^{C} (β, λ) \equiv \sum_{i = 1}^{n} (log [\exp {- \sum_{t_{j} \leq L_{i}} λ_{j} \exp (β^{T} Z_{i})} - I (R_{i} < \infty) \exp {- \sum_{t_{j} \leq R_{i}} λ_{j} \exp (β^{T} Z_{i})}] + \sum_{j = 1}^{k} n_{i j} \log [λ_{j} \exp (β^{T} Z_{i}) \times \exp {- \sum_{l = 1}^{j} λ_{l} \exp (β^{T} Z_{i})}]) .

While we may propose an EM algorithm based on ${\tilde{l}}_{n}^{C} (β, λ)$ , its maximization step is still difficult to obtain, since β and λ cannot be separated in the complete-data log-likelihood function due to the interval censoring structure.

Therefore, we further introduce data augmentation based on independent Poisson random variables $W_{i j} (i = 1, \dots, n; j = 1, \dots, k, t_{j} \leq R_{i}^{*})$ with means $λ_{j} \exp (β^{T} Z_{i})$ , where $R_{i}^{*} = L_{i} I (R_{i} = \infty) + R_{i} I (R_{i} < \infty)$ . The joint density function for $W_{i j} (j = 1, \dots, k, t_{j} \leq R_{i}^{*})$ is given by

\prod_{j = 1, t_{j} \leq R_{i}^{*}}^{k} \frac{{λ_{j} \exp (β^{T} Z_{i})}^{W_{i j}}}{W_{i j}!} \exp {- λ_{j} \exp (β^{T} Z_{i})} .

Let $N_{i 1} = \sum_{t_{j} \leq L_{i}} W_{i j}$ and $N_{i 2} = I (R_{i} < \infty) \sum_{L_{i} < t_{j} \leq R_{i}} W_{i j}$ . Suppose that we observe N_i1 = 0 and N_i2 > 0. The observed-data likelihood for ${\tilde{O}}_{i} \equiv {N_{i 1} = 0, N_{i 2} > 0}$ is equal to

\Pr (N_{i 1} = 0, N_{i 2} > 0) = \Pr (\sum_{t_{j} \leq L_{i}} W_{i j} = 0) {1 - I (R_{i} < \infty) \Pr (\sum_{L_{i} < t_{j} \leq R_{i}} W_{i j} = 0)} = \exp {- \sum_{t_{j} \leq L_{i}} λ_{j} \exp (β^{T} Z_{i})} - I (R_{i} < \infty) \exp {- \sum_{t_{j} \leq R_{i}} λ_{j} \exp (β^{T} Z_{i})} .

Therefore, ${\tilde{l}}_{n}^{C} (β, λ)$ can be viewed as the observed log-likelihood function for $\tilde{O} \equiv {({\tilde{O}}_{i}, O_{i}^{*}); i = 1, \dots, n}$ with $W_{i j} (i = 1, \dots, n; j = 1, \dots, k, j \leq R_{i}^{*})$ and $n_{i j} (i = 1, \dots, n; j = 1, \dots, k)$ as latent variables. In particular, the complete-data log-likelihood function based on $(W_{i j}, n_{i j})$ is given by

\sum_{i = 1}^{n} \sum_{j = 1}^{k} I (t_{j} \leq R_{i}^{*}) {- \log (W_{i j}!) + W_{i j} (\log λ_{j} + β^{T} Z_{i}) - λ_{j} \exp (β^{T} Z_{i})} + \sum_{i = 1}^{n} \sum_{j = 1}^{k} n_{i j} {\log (1 - t_{j} / τ) + \log λ_{j} + β^{T} Z_{i} - \sum_{l = 1}^{j} λ_{l} \exp (β^{T} Z_{i})} .

Based on this formulation, we propose the following EM algorithm. In the E-step, we evaluate the conditional expectations of W_ij and n_ij given the observed data. In particular, we have

\hat{E} (W_{i j}) = I (L_{i} < t_{j} \leq R_{i}, R_{i} < \infty) \times \frac{λ_{j} \exp (β^{T} Z_{i})}{1 - \exp {- \sum_{L_{i} < t_{l} \leq R_{i}} λ_{l} \exp (β^{T} Z_{i})}},

and

\hat{E} (n_{i j}) = \frac{(1 - t_{j} / τ) λ_{j} \exp (β^{T} Z_{i}) \exp {- \sum_{l = 1}^{j} λ_{l} \exp (β^{T} Z_{i})}}{1 - π_{i}},

where $\hat{E} (\cdot)$ denote the conditional expectation with respect to the observed data $\tilde{O}$ . In the M-step, we maximize the expected complete-data log-likelihood function. We update λ_j by

λ_{j} = \frac{\sum_{i = 1}^{n} {I (t_{j} \leq R_{i}^{*}) \hat{E} (W_{i j}) + \hat{E} (n_{i j})}}{\sum_{i = 1}^{n} {I (t_{j} \leq R_{i}^{*}) + \sum_{l = j}^{k} \hat{E} (n_{i l})} \exp (β^{T} Z_{i})}

and update β by solving

\sum_{i = 1}^{n} \sum_{j = 1}^{k} {I (t_{j} \leq R_{i}^{*}) \hat{E} (W_{i j}) + \hat{E} (n_{i j})} \times [Z_{i} - \frac{\sum_{i^{'} = 1}^{n} {I (t_{j} \leq R_{i^{'}}^{*}) + \sum_{l = j}^{k} \hat{E} (n_{i^{'}, l})} \exp (β^{T} Z_{i^{'}}) Z_{i^{'}}}{\sum_{i^{'} = 1}^{n} {I (t_{j} \leq R_{i^{'}}^{*}) + \sum_{l = j}^{k} \hat{E} (n_{i^{'}, l})} \exp (β^{T} Z_{i^{'}})}] = 0 .

We iterate between the E-step and M-step until convergence. We denote the final estimators for β and λ as $\hat{β}$ and $\hat{λ}$ .

In summary, through introducing two layers of latent random variables, we proposed a stable computing algorithm to obtain the estimators that maximize the nonparametric likelihood function. The latent truncated “ghost data” were introduced to deal with the complications that arise from left truncation and the latent Poisson random variables were introduced to deal with the incomplete data caused by interval-censoring.

2.3 |. Asymptotic properties

In this section, we establish the strong consistency and asymptotic normality of the proposed estimators. We assume the following regularity conditions.

Condition 1. The true value of β, denoted by β₀, belongs to the interior of a known compact set $B \subset ℝ^{p}$ .

Condition 2. The true value Λ₀(·) of Λ(·) is strictly increasing and continuously differentiable in [0, τ] with Λ₀(0) = 0.

Condition 3. The covariate Z has bounded support and is not concentrated on any proper subspace of $ℝ^{p}$ .

Condition 4. The examination times have finite support $V$ with the least upper bound τ. The number of potential examination times M is positive with E(M) < ∞. There exists a positive constant η such that $\Pr (U_{m + 1} - U_{m} \geq η ∣ M, Z) = 1$ . In addition, there exists a probability measure μ in $V$ such that the bivariate distribution function of (U_m, U_m+1) conditional on (M, Z) is dominated by μ × μ and its Radon-Nikodym derivative, denoted by ${\tilde{f}}_{m} (u, v; M, Z)$ , can be expanded to a positive and twice-continuously differentiable function in the set ${(u, v) : 0 \leq u \leq τ, 0 \leq v \leq τ, v - u \geq η}$ .

Conditions 1, 2, and 3 are standard conditions for failure time regression. Condition 4 pertains to the joint distribution of examination times. It requires that two adjacent examination times are separated by at least η; otherwise, the data may contain exact observations, which require a different theoretical treatment. The dominating measure μ is chosen as the Lebesgue measure if the examination times are continuous random variables and as the counting measure if the examinations occur only at a finite number of time points. The number of potential examination times M can be fixed or random, is possibly different among study subjects, and is allowed to depend on covariates.

We state the strong consistency of $(\hat{β}, \hat{λ})$ and the weak convergence of $\hat{β}$ in two theorems.

Theorem 1.

Under Conditions 1–4, $‖ \hat{β} - β_{0} ‖ \to_{a.s.} 0$ , and ${‖ \hat{Λ} - Λ_{0} ‖}_{l^{\infty} (V)} \to_{a.s.} 0$ , where $∥ \cdot ∥_{l \infty (V)}$ denotes the supremum norm on $V$ , and $\hat{Λ} (t) = \sum_{t_{j} \leq t} {\hat{λ}}_{j}$ .

Theorem 2.

Under Conditions 1–4, $n^{1 / 2} (\hat{β} - β_{0})$ converges weakly to a p-dimensional zero-mean normal random vector with a covariance matrix that attains the semiparametric efficiency bound.

The proofs of both theorems are provided in Appendix A.

2.4 |. Variance estimation

We estimate the covariance matrix of $\hat{β}$ by a profile likelihood approach. Let ${\hat{Λ}}_{β} = \arg \max \log_{Λ \in C} L_{n} (β, Λ)$ , where $C$ is the set of bounded step functions with non-negative jumps at t_l (l = 1,…, m). The maximizer ${\hat{Λ}}_{β}$ can be obtained using the EM algorithm of Section 2.2 if we fix β and only update λ in the M-step. The profile log-likelihood function is defined as

p l_{n} (β) = \max_{Λ \in C} \log L_{n} (β, Λ) = \log L_{n} (β, {\hat{Λ}}_{β}) .

Let $\tilde{p} l_{i} (β)$ be the ith subject’s contribution to $p l_{n} (β)$ . We estimate the covariance matrix of $\hat{β}$ by the inverse of

\sum_{i = 1}^{n} {(\begin{matrix} \underline{\tilde{p} l_{i} (\hat{β} + h_{n} e_{1}) - \tilde{p} l_{i} (\hat{β})} \\ h_{n} \\ ⋮ \\ \frac{\tilde{p} l_{i} (\hat{β} + h_{n} e_{p}) - \tilde{p} l_{i} (\hat{β})}{h_{n}} \end{matrix})}^{\otimes 2},

where e_j is the jth canonical vector in $ℝ^{p}, a^{\otimes 2} = a a^{T}$ , and h_n is a constant of order n^−1/2. In the numerical studies, we used h_n = 5n^−1/2 as suggested by Zeng et al. (2016).

The above profile likelihood approach is different from that of Murphy and Vaart (2000). They estimate the covariance matrix of $\hat{β}$ by the negative inverse of the Hessian matrix of pl_n(β) at $\hat{β}$ , which is obtained by second order numerical differences. The estimated matrix may not be positive semidefinite, especially in small samples. Here, we estimate the covariance matrix by the inverse of the empirical covariance matrix of the gradient of $\tilde{p} l_{i} (β)$ using the first-order numerical differences, similar to Zeng et al. (2017). The calculation is quicker than the approach requiring second-order numerical differences, and the estimated covariance matrix is guaranteed to be positive semidefinite.

3 |. NUMERICAL STUDIES

3.1 |. Simulation

We conducted simulation studies to evaluate the performance of the proposed methods. We considered two covariates Z₁ ∼ Bernoulli(0.5) and z₂ ∼ Uniform(−0.5, 0.5). We set β = (0.5, 1)^T, Λ(t) = 0.3t, and τ = 15. We generated the truncation time $\tilde{A}$ from Uniform(0, τ) and generated the sequence of potential examination times U_m ∼ U_m−1 + 0.1 + Uniform(0, 2) with U₀ = $\tilde{A}$ . We set n = 100, 200, or 400 and examined 10000 replicates for each sample size. We compared the proposed nonparametric maximum likelihood method with the maximum conditional likelihood method of Pan and Chappell (1998), which is applicable to left-truncated interval-censored data. For coherent comparisons, we compute both estimators by EM algorithms, where the algorithm for the conditional likelihood estimator is an adaption of the proposed algorithm and is given in Appendix B. We set the initial value of β to 0 and the initial value of λ_l to 1/k.

Table 1 summarizes the simulation results on the estimation of β using the proposed and conditional likelihood approaches. The biases for the proposed estimators are small and decrease as sample size increases. The biases for the conditional likelihood estimators are larger than those for the proposed estimators for all sample sizes, but decrease as sample size increases. The variance estimators for $\hat{β}$ using both approaches are accurate and the confidence intervals have proper coverage probabilities. As expected, the proposed estimator shows substantial efficiency gain compared to the conditional likelihood estimator. Web Figure S1 in the Supplementary Materials gives the estimated baseline survival functions. The nonparametric maximum likelihood estimation gives unbiased estimates, while the condition likelihood estimators tend to underestimate the true values, as indicated in Pan and Chappell (1998).

TABLE 1.

Summary statistics for the simulation studies with length-biased assumption

		NPMLE					CLE
		Bias	SE	SEE	RMSE	CP	Bias	SE	SEE	RMSE	CP
n=100	β ₁	0.008	0.169	0.171	0.169	0.956	0.049	0.253	0.235	0.258	0.928
	β ₂	0.015	0.295	0.317	0.295	0.965	0.089	0.444	0.436	0.453	0.942
n=200	β ₁	0.004	0.117	0.116	0.117	0.950	0.025	0.170	0.159	0.172	0.933
	β ₂	0.005	0.206	0.212	0.206	0.957	0.044	0.296	0.290	0.299	0.943
n=400	β ₁	0.002	0.082	0.081	0.082	0.944	0.014	0.115	0.111	0.116	0.937
	β ₂	0.002	0.144	0.146	0.144	0.951	0.024	0.200	0.199	0.202	0.949

Open in a new tab

Note: NPMLE and CLE are the nonparametric maximum likelihood and conditional likelihood estimators. SE, SEE, RMSE, and CP are the empirical standard error, mean standard error estimator, root mean squared error, and empirical coverage probability of the 95% confidence interval, respectively.

We further assess the robustness of the nonparametric maximum likelihood estimator when the uniform assumption for the truncation time does not hold. In particular, we considered the same simulation setting but generated the truncation time $\tilde{A}$ from Exp(0.1) such that the stationary incidence assumption is violated. Table 2 shows the simulation results. The nonparametric maximum likelihood estimators are slightly biased, while the coverages of the 95% confidence intervals are acceptable. Even though the bias of the proposed estimator is larger than the conditional likelihood estimator when sample size is large (n = 400) and length-biased sampling is violated, the mean squared error of the proposed estimator is still smaller.

TABLE 2.

Summary statistics for the simulation studies without length-biased assumption

		NPMLE					CLE
		Bias	SE	SEE	RMSE	CP	Bias	SE	SEE	RMSE	CP
n=100	β ₁	−0.039	0.160	0.167	0.165	0.953	0.045	0.244	0.231	0.248	0.933
	β ₂	−0.077	0.285	0.307	0.295	0.951	0.083	0.442	0.428	0.450	0.941
n=200	β ₁	−0.044	0.111	0.113	0.119	0.933	0.023	0.164	0.157	0.165	0.935
	β ₂	−0.088	0.198	0.205	0.217	0.930	0.045	0.293	0.285	0.296	0.943
n=400	β ₁	−0.047	0.078	0.078	0.091	0.899	0.012	0.113	0.109	0.113	0.938
	β ₂	−0.095	0.137	0.140	0.167	0.894	0.021	0.198	0.196	0.199	0.947

Open in a new tab

3.2 |. Massachusetts health care panel study

We apply the proposed methods to the Massachusetts Health Care Panel Study, which has been described and analyzed previously (Chappell, 1991; Pan and Chappell, 1998; Hudgens, 2005). The study aimed at assessing the risk at which elderly individuals lose active life, which is defined as a continued ability to perform various activities of daily living such as dressing and bathing. The study was first conducted in 1975 taking a baseline survey of Massachusetts residents over the age of 65. Since only subjects who were active at baseline were included, the time to loss of active life was subject to left truncation. Three follow-up waves were then taken at 1.25, 6, and 10 years after baseline to determine if subjects are still living actively, so the time to loss of active life was also interval-censored.

The data set includes 1286 subjects with enrollment age ranges from 65 to 97.3. Since the study population were defined to be over age 65, we consider the failure time as age at loss of active life minus 65. Since the subjects are active at the enrollment, the truncation time is the age at enrollment minus 65. We applied the proposed methods to study the association between loss of active life and gender.

Table 3 shows the estimation results for the regression parameter in the Massachusetts Health Care Panel Study. The point estimates from the proposed approach and the conditional likelihood approach are both positive, indicating that male subjects are associated with a higher risk of losing active life than females. The standard error estimate of the proposed nonparametric maximum likelihood estimator is smaller than that of the conditional likelihood estimator, so that at a 5% significance level, the null hypothesis of no association between gender and loss of active life is only rejected by the nonparametric maximum likelihood approach.

TABLE 3.

Estimation results for the regression parameter in the Massachusetts Health Care Panel Study

	NPMLE			CLE
Covariate	Estimate	Std Err	p-value	Estimate	Std Err	p-value
Male	0.144	0.059	0.014	0.133	0.076	0.081

Open in a new tab

Note: NPMLE and CLE are the nonparametric maximum likelihood and conditional likelihood estimators.

Figure 1 shows the estimated survival probabilities for male and female subjects using the two approaches. Even though they give similar estimates for the survival probabilities, the nonparametric maximum likelihood approach gives an estimate with finer jumps, resulting from the additional assumption on the truncation time. Using both approaches, the female subjects have a higher survival probability than the male subjects.

Estimated survival probabilities for subgroups of subjects in the Massachusetts Health Care Panel Study. The solid and dashed curves pertain to the nonparametric maximum likelihood and conditional likelihood estimation approaches, respectively.

If the length-biased sampling assumption does not hold, the regression coefficients of the two methods would converge to different values. Therefore, we estimated the difference of the estimators and construct a 95% confidence interval by bootstrapping with 1000 replications, to see if the stationary assumption for the truncation time holds. The differences of the two estimators was 0.011 with a 95% confidence interval (−0.092, 0.107), indicating that the length-biased assumption is possibly valid.

4. | DISCUSSION

In this paper, we adopt the nonparametric maximum likelihood estimation where the estimator for Λ is a step function that is right-continuous which is usually considered in the literature. As mentioned by a reviewer, if Λ is only restricted to be nondecreasing, then the true maximizer of the likelihood should involve a left-continuous Λ, that is, Λ(t) = Λ(t_j+1) on (t_j, t_j+1] for j = 0, …, m − 1. The two versions are asymptotically equivalent since any two adjacent step points get closer as sample size increases. In Web Appendix A of the Supplementary Materials, we implemented the version with left-continuous Λ and demonstrated that the numerical difference between the two versions is ignorable.

The iterative convex minorant (ICM) algorithm (Pan, 1999) is an alternative algorithm for the EM algorithm adopted in the paper to obtain the nonparametric maximum likelihood estimators for interval-censored data. Even though it is generally faster than the EM algorithm considered in the paper, it may become unstable for large datasets because it attempts to update a large number of parameters simultaneously using a quasi-Newton method (Zeng et al., 2016). Wang et al. (2016) also advocated the use of an EM algorithm by comparing it with the R package intcox (Henschel and Mansmann, 2013) that adopts the algorithm of Pan (1999). They found that ICM algorithm often exhibits larger biases, indicating that it may not converge to the true maximizer of the likelihood function.

In this paper, we studied the nonparametric maximum likelihood estimation of the proportional hazards model for length-biased interval-censored data. Although length-biasedness requires a stationary incidence distribution for the initial event, the proposed methods can be extended to situations when the incidence distribution follows a parametric model (Huang et al., 2015). In that case, the denominator of the individual components in L_n(β, Λ) need to be modified corresponding to the distribution of the truncation time, and the proposed EM algorithm can be adjusted accordingly.

The efficiency gain of the proposed nonparametric maximum likelihood estimators over the conditional likelihood estimators mainly comes from the information of the (uniform) distributional assumption on the truncation time. Relatively, the conditional likelihood estimators are more robust against the assumption. In practice, one need to carefully ascertain the assumption to apply the proposed approach. For the right-censored left-truncated data, graphical methods (Wang, 1991; Asgharian et al., 2006) and a goodness-of-fit test (Mandel and Betensky, 2007) have been proposed to test the length-biasedness assumption. In the numerical examples, we used a bootstrapped method to the difference of the nonparametric maximum likelihood and conditional likelihood approaches as an indirect test of length-biasedness. The diagnostic methods for right-censored data cannot be directly extended to the interval-censoring case, since the estimator for the survival function converges in a different, n^1/3, rate. Formal tests for the length-biased assumption with interval-censored data will be developed in the future.

The individuals in the MHCPS data may also be subject to the risk of a competing cause, for example, death, such that the subjects who died before loss of active life were right-censored. In addition, there may be selection effect such that only alive subjects were included in the study. Therefore, the assumptions of conditional independent censoring and truncation times may be questionable. However, the information on the cause of right censoring is not available in MHCPS data, so we are not able to access the validity of the assumptions. The regression analysis of competing risks interval-censored data has been studied (Mao et al., 2017), however, no existing methods considered the scenario when the competing risks interval-censored data are also subject to left truncation. Methods incorporating such complications are important future research.

Supplementary Material

NIHMS1045281-supplement-S1.pdf^{(211.7KB, pdf)}

ACKNOWLEDGMENTS

The authors thank Prof Rick Chappell for sharing the data from Massachusetts Health Care Panel Study. We also thank the associate editor and two referees for constructive feedbacks that improve the paper. The authors were partially supported by the National Heart, Lung, and Blood Institute of the U.S. National Institutes of Health.

Appendix A

Proof of asymptotic results

We use $ℙ_{n}$ to denote the empirical measure from n independent subjects and $ℙ$ to denote the true probability measure. Write $G_{n} = n^{1 / 2} (ℙ_{n} - ℙ)$ . Let L(β, Λ) be the observed-data likelihood for a single subject

L (β, Λ) = \frac{\sum_{m = 0}^{M} Δ_{m} [\exp {- Λ (U_{m}) \exp (β^{T} Z)} - \exp {- Λ (U_{m + 1}) \exp (β^{T} Z)}]}{\int_{0}^{τ} \exp {- Λ (a) \exp (β^{T} Z)} d a / τ},

where $Δ_{m} = I (U_{m} < T \leq U_{m + 1})$ .

Write $l (β, Λ) = \log L (β, Λ)$ . Let $\tilde{Λ}$ be a step function that takes jumps only at t₁, …, t_k with $\tilde{Λ} (t_{j}) = Λ_{0} (t_{j})$ for j = 1,…, k. Let

m (β, Λ) = \log {\frac{L (β, Λ) + L (β_{0}, \tilde{Λ})}{2}}

and

M = {m (β, Λ) : β \in B, Λ \in D_{M}},

where D_M = {Λ : Λ is increasing with Λ(0) = 0, Λ(τ) ≤ M}, and M < ∞. The proofs make use of two lemmas, whose proofs are given in Web Appendix B.

Lemma 1.

Under Conditions 1–4, the classes of functions $M$ is ℙ-Donsker.

Lemma 2.

Under Conditions 1–4,

E [\sum_{m = 0}^{M} {\hat{Λ} (U_{m}) - Λ_{0} (U_{m})}^{2}] = O_{P} (n^{- 2 / 3}) + O ({‖ \hat{β} - β_{0} ‖}^{2}) .

Proof of Theorem 1

The jump points {t₁,..., t_k} depend on sample size n and for any ϵ > 0, ⋃_jB_ϵ(t_j) covers the support $V$ as n → ∞, where B_r(t) is the open ball around t_j with radius r. By the continuity of Λ₀, $\tilde{Λ} (t)$ converges uniformly to Λ₀(t). It follows from Lemma 1 that the class $M$ is Donsker. By the concavity of the log function,

ℙ_{n} m (\hat{β}, \hat{Λ}) \geq \frac{1}{2} {ℙ_{n} l (\hat{β}, \hat{Λ}) + ℙ_{n} l (β_{0}, \tilde{Λ})} \geq ℙ_{n} l (β_{0}, \tilde{Λ}) = ℙ_{n} m (β_{0}, \tilde{Λ}) .

Since $B$ is bounded, for any subsequence of $\hat{β}$ , we can find a further subsequence converging to $β_{*} \in B .$ . In addition, by Helly’s selection lemma, for any subsequence of $\hat{Λ}$ , there exists a further subsequence that converges to some increasing function Λ_*. We choose the converging subsequences of $\hat{β}$ and $\hat{Λ}$ such that we can obtain without loss of generality that $\hat{β} \to β_{*}$ and $\hat{Λ} \to Λ_{*}$ pointwise on any interior set of $V$ . Therefore,

0 \leq ℙ_{n} m (\hat{β}, \hat{Λ}) - ℙ_{n} m (β_{0}, \tilde{Λ}) = ℙ \log \frac{L (\hat{β}, \hat{Λ}) + L (β_{0}, \tilde{Λ})}{2 L (β_{0}, \tilde{Λ})} + o_{P} (1) = ℙ \log {\frac{1}{2} + \frac{L (β_{*}, Λ_{*})}{2 L (β_{0}, \tilde{Λ})}} + o_{P} (1),

such that the negative Kullback-Leibler information is positive. Therefore,

\frac{\sum_{m = 0}^{M} Δ_{m} [\exp {- Λ_{*} (U_{m}) \exp (β_{*}^{T} Z)} - I (R < \infty) \exp {- Λ_{*} (U_{m + 1}) \exp (β_{*}^{T} Z)}]}{\int_{0}^{τ} \exp {- Λ_{*} (a) \exp (β_{*}^{T} Z)} d a} = \frac{\sum_{m = 0}^{M} Δ_{m} [\exp {- Λ_{0} (U_{m}) \exp (β_{0}^{T} Z)} - I (R < \infty) \exp {- Λ_{0} (U_{m + 1}) \exp (β_{0}^{T} Z)}]}{\int_{0}^{τ} \exp {- Λ_{0} (a) \exp (β_{0}^{T} Z)} d a}

with probability 1. For any m ∈ {0,..., M}, we set Δ_m′ = 1 in the above equation m′ = m,..., M and take the sum of the resulting equations to obtain

\frac{\exp {- Λ_{*} (U_{m}) \exp (β_{*}^{T} Z)}}{\int_{0}^{τ} \exp {- Λ_{*} (a) \exp (β_{*}^{T} Z)} d a} = \frac{\exp {- Λ_{0} (U_{m}) \exp (β_{0}^{T} Z)}}{\int_{0}^{τ} \exp {- Λ_{0} (a) \exp (β_{0}^{T} Z)} d a} .

Because m is arbitrary, we can replace U_m in the above equation by any t ∈ $V$ . We take the logarithm and differentiate both sides with respect to t to find

Λ_{*}^{'} (t) \exp (β_{*}^{T} Z) = Λ_{0}^{'} (t) \exp (β_{0}^{T} Z),

such that β_* = β₀ and $Λ_{*}^{'} (t) = Λ_{0}^{'} (t)$ for t ∈ $V$ . Hence, Λ_*(t) = Λ₀(t) for t ∈ $V$ . We conclude that $‖ \hat{β} - β_{0} ‖ \to 0$ and $| \hat{Λ} (t) - Λ_{0} (t) | \to 0$ for any t ∈ $V$ . Because Λ₀ is continuous, $\hat{Λ}$ converges uniformly to Λ₀ on $V$ .

Proof of Theorem 2.

Let

Q_{1} (t, u, v; β, Λ) = \exp (β^{T} Z) \frac{I (t \leq v) \exp {- Λ (v) \exp (β^{T} Z)} - I (t \leq u) \exp {- Λ (u) \exp (β^{T} Z)}}{\exp {- Λ (u) \exp (β^{T} Z)} - \exp {- Λ (v) \exp (β^{T} Z)}},

Q_{2} (t; β, Λ) = \exp (β^{T} Z) \times \frac{\int_{0}^{τ} I (t \leq a) \exp {- Λ (a) \exp (β^{T} Z)} d a}{\int_{0}^{τ} \exp {- Λ (a) \exp (β^{T} Z)} d a},

and $Q (t; β, Λ) = \sum_{m = 0}^{M} Δ_{m} Q_{1} (t, U_{m}, U_{m + 1}; β, Λ) + Q_{2} (t; β, Λ)$ . The score equations for β is given by

l_{β} (β, Λ) = Z \int Q (t; β, Λ) d Λ (t) .

The score operator for Λ along the submodel dΛ_ϵ,h = (1 + ϵh)dΛ for h ∈ L₂(μ) is given by

l_{Λ} (β, Λ) (h) = \int Q (t; β, Λ) h (t) d Λ (t) .

Clearly,

G_{n} {l_{β} (\hat{β}, \hat{Λ})} = - \sqrt{n} ℙ {l_{β} (\hat{β}, \hat{Λ}) - l_{β} (β_{0}, Λ_{0})},

and

G_{n} {l_{Λ} (\hat{β}, \hat{Λ}) (h)} = - \sqrt{n} ℙ {l_{Λ} (\hat{β}, \hat{Λ}) (h) - l_{Λ} (β_{0}, Λ_{0}) (h)} .

We apply the Taylor series expansions at (β₀, Λ₀) to the right sides of the above two equations. In light of Lemma 2, the second-order terms are bounded by $O_{P} (n^{- 1 / 6} + \sqrt{∥} \hat{β} - β_{0} ∥^{2})$ . Therefore,

G_{n} {l_{β} (\hat{β}, \hat{Λ})} = - \sqrt{n} ℙ {l_{β β} (\hat{β} - β_{0}) + l_{β Λ} (\hat{Λ} - Λ_{0})} + O_{P} (n^{- 1 / 6} + \sqrt{∥} \hat{β} - β_{0} ∥^{2}),

and

G_{n} {l_{Λ} (\hat{β}, \hat{Λ}) (h)} = - \sqrt{n} ℙ {l_{Λ β} (h) (\hat{β} - β_{0}) + l_{ΛΛ} (h, \hat{Λ} - Λ_{0})} + O_{P} (n^{- 1 / 6} + \sqrt{∥} \hat{β} - β_{0} ∥^{2}),

dΛ_ϵ,h, l_Λβ(h) is the derivative of l_Λ(h) with respect to β, and l_ΛΛ(h, $\hat{Λ}$ − Λ₀) is the derivative of l_Λ(h) along the submodel dΛ₀ + ϵd( $\hat{Λ}$ − Λ₀). All derivatives are evaluated at (β₀, Λ₀).

If the least favorable direction exists, we denote it as h* ∈ L₂(μ)^p. We first show the existence of h*, which is the solution of $l_{Λ}^{*} l_{Λ} (h^{*}) = l_{Λ}^{*} l_{β},$ where $l_{Λ}^{*}$ is the adjoint operator of l_Λ. We equip L₂(μ) with an inner product defined as

< h^{(1)}, h^{(2)} > = \int h^{(1)} (t) h^{(2)} (t) d μ (t) .

On the same space, we define

∥ h ∥ = ℙ {l_{Λ} (β_{0}, Λ_{0}) {(h)}^{2}}^{1 / 2} = ℙ {[{\int Q (t; β_{0}, Λ_{0}) h (t) d Λ (t)}^{2}]}^{1 / 2} .

It is easy to show that ∥·∥ is a seminorm on L₂(μ). Furthermore, if ∥h∥ = 0, then $ℙ {l_{Λ} (β_{0}, Λ_{0}) {(h)}^{2}} = 0$ . Thus, with probability 1, $l_{Λ} (β_{0}, Λ_{0}) (h) = 0$ . By the arguments in the Lemma 2, h(t) = 0 for t ∈ $V$ . Clearly, ∥h∥ ≤ c < h, h >^1/2 for some constant c by the Cauchy-Schwarz inequality. According to the bounded inverse theorem in Banach spaces, we have ${〈 h, h 〉}^{1 / 2} \leq \tilde{c} ∥ h ∥$ for another constant $\tilde{c}$ . By the Lax-Milgram theorem (Zeidler, 1995), there exists $h^{*} \in L_{2} {(μ)}^{p}$ that satisfies

\int_{0}^{τ} ℙ {Q (t; β_{0}, Λ_{0}) Q (s; β_{0}, Λ_{0})} h^{*} (s) d Λ_{0} (s) = \int_{0}^{τ} ℙ {Z Q (t; β_{0}, Λ_{0}) Q (s; β_{0}, Λ_{0})} d Λ_{0} (s)

for t ∈ $V$ . Differentiation of the integral equations with respect to t yields

q_{1} (t) h^{*} (t) + \int_{t}^{τ} q_{2} (s, t) h^{*} (s) d s + \int_{0}^{t} q_{3} (s, t) h^{*} (s) d s = q_{4} (t),

where q₁(t) > 0, and q_j (j = 1, 2, 3) and q₄ are continuously differentiable functions. Thus, h* can be expanded to be a continuously differentiable function in $V$ with bounded total variations. It follows that

G_{n} {l_{β} (\hat{β}, \hat{Λ})} - G_{n} {l_{Λ} (\hat{β}, \hat{Λ}) (h^{*})} = - \sqrt{n} ℙ {l_{β β} (\hat{β} - β_{0}) + l_{β Λ} (\hat{Λ} - Λ_{0})} + \sqrt{n} ℙ {l_{Λ β} (h^{*}) (\hat{β} - β_{0}) + l_{ΛΛ} (h^{*}, \hat{Λ} - Λ_{0})} + O_{P} (n^{- 1 / 6} + \sqrt{n} ‖ \hat{β} - β_{0} ‖^{2}) = \sqrt{n} ℙ [{l_{β} (β_{0}, Λ_{0}) - l_{Λ} (β_{0}, Λ_{0}) (h^{*})}^{\otimes 2}] (\hat{β} - β_{0}) + O_{P} (n^{- 1 / 6} + \sqrt{n} ‖ \hat{β} - β_{0} ‖^{2}) .

Using the arguments in the proof of Lemma 1, we can show that $l_{β} (β_{0}, Λ_{0}) - l_{Λ} (β_{0}, Λ_{0}) (h^{*})$ belongs to a Donsker class. Next, we show that $ℙ [{l_{β} - l_{Λ} (h^{*})}^{\otimes 2}]$ is invertible. If the matrix is singular, then there exists an vector $v \in ℝ^{p}$ such that $v^{T} ℙ [{l_{β} - l_{Λ} (h^{*})}^{\otimes 2}] v = 0$ . It follows that, with probability 1, the score function along the submodel ${β_{0} + ϵ v, Λ_{ϵ, - v^{T} h^{*}}}$ is zero. That is,

\sum_{m = 0}^{M} Δ_{m} \int \frac{I (t \leq U_{m + 1}) \exp {- Λ_{0} (U_{m + 1}) \exp (β_{0}^{T} Z)} - I (t \leq U_{m}) \exp {- Λ_{0} (U_{m}) \exp (β_{0}^{T} Z)}}{\exp {- Λ_{0} (U_{m}) \exp (β_{0}^{T} Z)} - \exp {- Λ_{0} (U_{m + 1}) \exp (β_{0}^{T} Z)}} \times v^{T} {Z - h^{*} (t)} d Λ_{0} (t) + \int \frac{\int_{0}^{τ} I (t \leq a) \exp {- Λ_{0} (a) \exp (β_{0}^{T} Z)} d a}{\int_{0}^{τ} \exp {- Λ_{0} (a) \exp (β_{0}^{T} Z)} d a} v^{T} {Z - h^{*} (t)} d Λ_{0} (t) = 0.

For any m ∈ {0,…, M}, we sum over all possible Δ_m′ with m′ = m,…, M to obtain

- \int_{0}^{U_{m}} v^{T} {Z - h^{*} (t)} d Λ_{0} (t) + \int \frac{\int_{0}^{a} \exp {- Λ_{0} (a) \exp (β_{0}^{T} Z)} d a}{\int_{0}^{τ} \exp {- Λ_{0} (a) \exp (β_{0}^{T} Z)} d a} v^{T} \times {Z - h^{*} (t)} d Λ_{0} (t) = 0.

Because m is arbitrary, we can replace U_m in the above equation by t ∈ $V$ . We differentiate both sides with respect to t to obtain

v^{T} {Z - h^{*} (t)} Λ_{0}^{'} (t) = 0

for any t ∈ $V$ . It then follows that υ = 0. Hence, the matrix $ℙ [{l_{β} - l_{Λ} (h^{*})}^{\otimes 2}]$ is invertible.

Then, $\hat{β} - β_{0} = O_{P} (n^{- 1 / 2})$ , and

\sqrt{n} (\hat{β} - β_{0}) = {(ℙ [{l_{β} (β_{0}, Λ_{0}) - l_{Λ} (β_{0}, Λ_{0}) (h^{*})}^{\otimes 2}])}^{- 1} \times G_{n} {l_{β} (\hat{β}, \hat{Λ}) - l_{Λ} (\hat{β}, \hat{Λ}) (h^{*})} + o_{P} (1) .

The influence function for $\hat{β}$ is the efficient influence function, such that $\sqrt{n} (\hat{β} - β_{0})$ converges weakly to a zero-mean normal random vector whose covariance matrix attains the semiparametric efficiency bound.

Appendix B

An EM algorithm for maximum conditional likelihood estimation

A nonparametric maximum conditional likelihood estimator is considered in Pan and Chappell (1998) for the proportional hazards model with left-truncated interval-censored data. A slight variation of the proposed EM algorithm can be used to compute their estimator, and is used in the numerical comparisons.

The observed-data conditional likelihood given the truncation time is given by

\prod_{i = 1}^{n} \frac{\exp {- Λ (L_{i}) \exp (β^{T} Z_{i})} - I (R_{i} < \infty) \exp {- Λ (R_{i}) \exp (β^{T} Z_{i})}}{\exp {- Λ (A_{i}) \exp (β^{T} Z_{i})}} .

We estimate Λ nonparametrically such that the estimator for Λ is a step function that jumps only at t₁,…, t_k, which are the ordered sequence of all L_i, R_i, and A_i. We maximize the objective function

\sum_{i = 1}^{n} (log [\exp {- \sum_{A_{i} \leq t_{j} \leq L_{i}} λ_{j} \exp (β^{T} Z_{i})} - I (R_{i} < \infty) \exp {- \sum_{A_{i} \leq t_{j} \leq R_{i}} λ_{j} \exp (β^{T} Z_{i})}]) .

We introduce a sequence of independent Poisson random variables $W_{i j} (i = 1, \dots, n; j = 1, \dots, k, A_{i} \leq t_{j} \leq R_{i}^{*})$ with means $λ_{j} \exp (β^{T} Z_{i})$ . Let $N_{i 1} = \sum_{A_{i} \leq t_{j} \leq L_{i}} W_{i j}$ , and $N_{i 2} = I (R_{i} < \infty) \sum_{L_{i} < t_{j} \leq R_{i}} W_{i j}$ . The objective function can be viewed as the observed-data log-likelihood for {N_i1 = 0, N_i2 > 0 : i = 1,…, n} with $W_{i j} (j = 1, \dots, k, A_{i} \leq t_{j} \leq R_{i}^{*})$ as latent variables. We propose an EM algorithm. In the E-step, we evaluate

\hat{E} (W_{i j}) = I (L_{i} < t_{j} \leq R_{i}, R_{i} < \infty) \times \frac{λ_{j} \exp (β^{T} Z_{i})}{1 - \exp {- \sum_{L_{i} < t_{l} \leq R_{i}} λ_{l} \exp (β^{T} Z_{i})}} .

In the M step, we update λ_j by

λ_{j} = \frac{\sum_{i = 1}^{n} I (A_{i} \leq t_{j} \leq R_{i}^{*}) \hat{E} (W_{i j})}{\sum_{i = 1}^{n} I (A_{i} \leq t_{j} \leq R_{i}^{*}) \exp (β^{T} Z_{i})}

and update β by solving

\sum_{i = 1}^{n} \sum_{j = 1}^{k} I (A_{i} \leq t_{j} \leq R_{i}^{*}) \hat{E} (W_{i j}) \times {Z_{i} - \frac{\sum_{i^{'} = 1}^{n} I (A_{i^{'}} \leq t_{j} \leq R_{i^{'}}^{*}) \exp (β^{T} Z_{i^{'}}) Z_{i^{'}}}{\sum_{i^{'} = 1}^{n} I (A_{i^{'}} \leq t_{j} \leq R_{i^{'}}^{*}) \exp (β^{T} Z_{i^{'}})}} = 0 .

We iterate between the E-step and M-step until convergence.

Footnotes

SUPPORTING INFORMATION

The Web Appendices and Figure, referenced in Sections 3 and 4 and Appendix A, and the computation codes for the simulation studies are available with this paper at the Biometrics website on Wiley Online Library.

REFERENCES

Asgharian M, Wolfson DB, and Zhang X. (2006). Checking stationarity of the incidence rate using prevalent cohort survival data. Stat Med 25, 1751–1767. [DOI] [PubMed] [Google Scholar]
Brookmeyer R. and Gail MH (1987). Biases in prevalent cohorts. Biometrics 43, 739–749. [PubMed] [Google Scholar]
Cai T. and Betensky RA (2003). Hazard regression for interval-censored data with penalized spline. Biometrics 59, 570–579. [DOI] [PubMed] [Google Scholar]
Chan KCG And Wang M-C (2010). Backward estimation of stochastic processes with failure events as time origins. Ann Appl Stat 4, 1602–1620. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chappell R. (1991). Sampling design of multiwave studies with an application to the Massachusetts health care panel study. Stat Med 10, 1945–1958. [DOI] [PubMed] [Google Scholar]
Chen YQ (2010). Semiparametric regression in size-biased sampling. Biometrics 66, 149–158. [DOI] [PMC free article] [PubMed] [Google Scholar]
Henschel V. and Mansmann U. (2013). intcox: Iterated convex minorant algorithm for interval-censored event data. R package version 0.9.3.
Huang J. (1996). Efficient estimation for the proportional hazards model with interval censoring. Ann Stat 24, 540–568. [Google Scholar]
Huang C-Y and Qin J. (2012). Composite partial likelihood estimation under length-biased sampling, with application to a prevalent cohort study of dementia. J Am Stat Assoc 107, 946–957. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang J. and Rossini AJ (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. J Am Stat Assoc 92, 960–967. [Google Scholar]
Huang C-Y, Ning J, and Qin J. (2015). Semiparametric likelihood inference for left-truncated and right-censored data. Biostatistics 16, 785–798. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hudgens MG (2005). On nonparametric maximum likelihood estimation with interval censoring and left truncation. J R Stat Soc B 67, 573–587. [Google Scholar]
Kim JS (2003). Efficient estimation for the proportional hazards model with left-truncated and “Case 1” interval-censored data. Stat Sin 13, 519–537. [Google Scholar]
Lai TL and Ying Z. (1994). A missing information principle and M-estimators in regression analysis with censored and truncated data. Ann Stat 22, 1222–1255. [Google Scholar]
Mandel M. and Betensky RA (2007). Testing goodness of fit of a uniform truncation model. Biometrics 63, 405–412. [DOI] [PubMed] [Google Scholar]
Mao L, Lin D-Y, and Zeng D. (2017). Semiparametric regression analysis of interval-censored competing risks data. Biometrics 73, 857–865. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy SA and Vaart AW (2000). On profile likelihood. J Am Stat Assoc 95, 449–465. [Google Scholar]
Pan W. (1999). Extending the iterative convex minorant algorithm to the Cox model for interval-censored data. J Comput Graph Stat 8, 109–120. [Google Scholar]
Pan W. and Chappell R. (1998). Computation of the NPMLE of distribution functions for interval censored and truncated data with applications to the Cox model. Comput Stat Data Anal 28, 33–50. [Google Scholar]
Pan W. and Chappell R. (2002). Estimation in the Cox proportional hazards model with left-truncated and interval-censored data. Biometrics 58, 64–70. [DOI] [PubMed] [Google Scholar]
Qin J. and Shen Y. (2010). Statistical methods for analyzing right-censored length-biased data under Cox model. Biometrics 66, 382–392. [DOI] [PMC free article] [PubMed] [Google Scholar]
Qin J, Ning J, Liu H, and Shen Y. (2011). Maximum likelihood estimations and EM algorithms with length-biased data. J Am Stat Assoc 106, 1434–1449. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen Y, Ning J, and Qin J. (2009). Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J Am Stat Assoc 104, 1192–1202. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun J. (2007). The Statistical Analysis of Interval-Censored Failure Time Data. New York: Springer. [Google Scholar]
Turnbull BW (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc B 38, 290–295. [Google Scholar]
Wang M-C (1991). Nonparametric estimation from cross-sectional survival data J Am Stat Assoc 86, 130–143. [Google Scholar]
Wang M-C (1996). Hazards regression analysis for length-biased data. Biometrika 83, 343–354. [Google Scholar]
Wang P, Tong X, Zhao S, and Sun J. (2015). Regression analysis of left-truncated and case i interval-censored data with the additive hazards model. Commun Stat Theory Methods 44, 1537–1551. [Google Scholar]
Wang L, McMahan CS, Hudgens MG, and Qureshi Z. (2016). A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics 72, 222–231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeidler E. (1995). Applied Functional Analysis—Applications to Mathematical Physics. New York: Springer. [Google Scholar]
Zeng D, Mao L, and Lin DY (2016). Maximum likelihood estimation for semiparametric transformation models with interval-scensored data. Biometrika 103, 253–271. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeng D, Gao F, and Lin DY (2017). Maximum likelihood estimation for semiparametric regression models with multivariate interval-censored data. Biometrika 104, 505–525. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1045281-supplement-S1.pdf^{(211.7KB, pdf)}

[R1] Asgharian M, Wolfson DB, and Zhang X. (2006). Checking stationarity of the incidence rate using prevalent cohort survival data. Stat Med 25, 1751–1767. [DOI] [PubMed] [Google Scholar]

[R2] Brookmeyer R. and Gail MH (1987). Biases in prevalent cohorts. Biometrics 43, 739–749. [PubMed] [Google Scholar]

[R3] Cai T. and Betensky RA (2003). Hazard regression for interval-censored data with penalized spline. Biometrics 59, 570–579. [DOI] [PubMed] [Google Scholar]

[R4] Chan KCG And Wang M-C (2010). Backward estimation of stochastic processes with failure events as time origins. Ann Appl Stat 4, 1602–1620. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Chappell R. (1991). Sampling design of multiwave studies with an application to the Massachusetts health care panel study. Stat Med 10, 1945–1958. [DOI] [PubMed] [Google Scholar]

[R6] Chen YQ (2010). Semiparametric regression in size-biased sampling. Biometrics 66, 149–158. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Henschel V. and Mansmann U. (2013). intcox: Iterated convex minorant algorithm for interval-censored event data. R package version 0.9.3.

[R8] Huang J. (1996). Efficient estimation for the proportional hazards model with interval censoring. Ann Stat 24, 540–568. [Google Scholar]

[R9] Huang C-Y and Qin J. (2012). Composite partial likelihood estimation under length-biased sampling, with application to a prevalent cohort study of dementia. J Am Stat Assoc 107, 946–957. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Huang J. and Rossini AJ (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. J Am Stat Assoc 92, 960–967. [Google Scholar]

[R11] Huang C-Y, Ning J, and Qin J. (2015). Semiparametric likelihood inference for left-truncated and right-censored data. Biostatistics 16, 785–798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Hudgens MG (2005). On nonparametric maximum likelihood estimation with interval censoring and left truncation. J R Stat Soc B 67, 573–587. [Google Scholar]

[R13] Kim JS (2003). Efficient estimation for the proportional hazards model with left-truncated and “Case 1” interval-censored data. Stat Sin 13, 519–537. [Google Scholar]

[R14] Lai TL and Ying Z. (1994). A missing information principle and M-estimators in regression analysis with censored and truncated data. Ann Stat 22, 1222–1255. [Google Scholar]

[R15] Mandel M. and Betensky RA (2007). Testing goodness of fit of a uniform truncation model. Biometrics 63, 405–412. [DOI] [PubMed] [Google Scholar]

[R16] Mao L, Lin D-Y, and Zeng D. (2017). Semiparametric regression analysis of interval-censored competing risks data. Biometrics 73, 857–865. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Murphy SA and Vaart AW (2000). On profile likelihood. J Am Stat Assoc 95, 449–465. [Google Scholar]

[R18] Pan W. (1999). Extending the iterative convex minorant algorithm to the Cox model for interval-censored data. J Comput Graph Stat 8, 109–120. [Google Scholar]

[R19] Pan W. and Chappell R. (1998). Computation of the NPMLE of distribution functions for interval censored and truncated data with applications to the Cox model. Comput Stat Data Anal 28, 33–50. [Google Scholar]

[R20] Pan W. and Chappell R. (2002). Estimation in the Cox proportional hazards model with left-truncated and interval-censored data. Biometrics 58, 64–70. [DOI] [PubMed] [Google Scholar]

[R21] Qin J. and Shen Y. (2010). Statistical methods for analyzing right-censored length-biased data under Cox model. Biometrics 66, 382–392. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Qin J, Ning J, Liu H, and Shen Y. (2011). Maximum likelihood estimations and EM algorithms with length-biased data. J Am Stat Assoc 106, 1434–1449. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Shen Y, Ning J, and Qin J. (2009). Analyzing length-biased data with semiparametric transformation and accelerated failure time models. J Am Stat Assoc 104, 1192–1202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Sun J. (2007). The Statistical Analysis of Interval-Censored Failure Time Data. New York: Springer. [Google Scholar]

[R25] Turnbull BW (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc B 38, 290–295. [Google Scholar]

[R26] Wang M-C (1991). Nonparametric estimation from cross-sectional survival data J Am Stat Assoc 86, 130–143. [Google Scholar]

[R27] Wang M-C (1996). Hazards regression analysis for length-biased data. Biometrika 83, 343–354. [Google Scholar]

[R28] Wang P, Tong X, Zhao S, and Sun J. (2015). Regression analysis of left-truncated and case i interval-censored data with the additive hazards model. Commun Stat Theory Methods 44, 1537–1551. [Google Scholar]

[R29] Wang L, McMahan CS, Hudgens MG, and Qureshi Z. (2016). A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics 72, 222–231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Zeidler E. (1995). Applied Functional Analysis—Applications to Mathematical Physics. New York: Springer. [Google Scholar]

[R31] Zeng D, Mao L, and Lin DY (2016). Maximum likelihood estimation for semiparametric transformation models with interval-scensored data. Biometrika 103, 253–271. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Zeng D, Gao F, and Lin DY (2017). Maximum likelihood estimation for semiparametric regression models with multivariate interval-censored data. Biometrika 104, 505–525. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Semiparametric regression analysis of length-biased interval-censored data

Fei Gao

Kwun Chuen Gary Chan

Abstract

1 |. INTRODUCTION

2 |. THE PROPOSED METHODOLOGY

2.1 |. Model and data

2.2 |. Nonparametric maximum likelihood estimation

2.3 |. Asymptotic properties

Theorem 1.

Theorem 2.

2.4 |. Variance estimation

3 |. NUMERICAL STUDIES

3.1 |. Simulation

TABLE 1.

TABLE 2.

3.2 |. Massachusetts health care panel study

TABLE 3.

FIGURE 1.

4. | DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Appendix A

Proof of asymptotic results

Lemma 1.

Lemma 2.

Proof of Theorem 1

Proof of Theorem 2.

Appendix B

An EM algorithm for maximum conditional likelihood estimation

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Semiparametric regression analysis of length-biased interval-censored data

Fei Gao

Kwun Chuen Gary Chan

Abstract

1 |. INTRODUCTION

2 |. THE PROPOSED METHODOLOGY

2.1 |. Model and data

2.2 |. Nonparametric maximum likelihood estimation

2.3 |. Asymptotic properties

Theorem 1.

Theorem 2.

2.4 |. Variance estimation

3 |. NUMERICAL STUDIES

3.1 |. Simulation

TABLE 1.

TABLE 2.

3.2 |. Massachusetts health care panel study

TABLE 3.

FIGURE 1.

4. | DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Appendix A

Proof of asymptotic results

Lemma 1.

Lemma 2.

Proof of Theorem 1

Proof of Theorem 2.

Appendix B

An EM algorithm for maximum conditional likelihood estimation

Footnotes

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases