Maximum Likelihood Estimations and EM Algorithms with Length-biased Data

Jing Qin; Jing Ning; Hao Liu; Yu Shen

doi:10.1198/jasa.2011.tm10156

. Author manuscript; available in PMC: 2012 Feb 7.

Published in final edited form as: J Am Stat Assoc. 2011 Dec 1;106(496):1434–1449. doi: 10.1198/jasa.2011.tm10156

Maximum Likelihood Estimations and EM Algorithms with Length-biased Data

Jing Qin ¹, Jing Ning ², Hao Liu ³, Yu Shen ⁴

PMCID: PMC3273908 NIHMSID: NIHMS352232 PMID: 22323840

SUMMARY

Length-biased sampling has been well recognized in economics, industrial reliability, etiology applications, epidemiological, genetic and cancer screening studies. Length-biased right-censored data have a unique data structure different from traditional survival data. The nonparametric and semiparametric estimations and inference methods for traditional survival data are not directly applicable for length-biased right-censored data. We propose new expectation-maximization algorithms for estimations based on full likelihoods involving infinite dimensional parameters under three settings for length-biased data: estimating nonparametric distribution function, estimating nonparametric hazard function under an increasing failure rate constraint, and jointly estimating baseline hazards function and the covariate coefficients under the Cox proportional hazards model. Extensive empirical simulation studies show that the maximum likelihood estimators perform well with moderate sample sizes and lead to more efficient estimators compared to the estimating equation approaches. The proposed estimates are also more robust to various right-censoring mechanisms. We prove the strong consistency properties of the estimators, and establish the asymptotic normality of the semi-parametric maximum likelihood estimators under the Cox model using modern empirical processes theory. We apply the proposed methods to a prevalent cohort medical study. Supplemental materials are available online.

Keywords: Cox regression model, EM algorithm, Increasing failure rate, Non-parametric likelihood, Profile likelihood, Right-censored data

1. INTRODUCTION

When the observed failure times are not randomly selected from the target population of interest but with probability proportional to their underlying length, we have length-biased time-to-event data. Length-biased data are naturally encountered in applications of renewal processes (Cox and Miller, 1977; Vardi, 1982; Dewanji and Kalbfleisch, 1987; Vardi, 1989), industrial applications (Kvam, 2008), etiologic studies (Simon, 1980), genome-wide linkage studies (Terwilliger et al., 1997), epidemiologic cohort studies (Keiding, 1991; Gail and Benichou, 2000; Gordis, 2000; Sansgiry and Akman, 2000; Scheike and Keiding, 2006), cancer prevention trials (Zelen and Feinleib, 1969; Zelen, 2004), and studies of labor economy (Lancaster, 1990; McClean and Devine, 1995; De Uña Álvarez et al., 2003). In observational studies, a prevalent cohort design that draws samples from individuals with a condition or disease at the time of enrollment is generally more efficient and practical. The recruited patients who have already experienced an initiating event are followed prospectively for the failure event (e.g. disease progression or death) or are right censored. Under this sampling design, individuals with longer survival times measured from the onset of the disease are more likely to be included in the cohort, whereas those with shorter survival times are selectively excluded. Length-biased sampling thereby manifests in the observations, because the “observed” time intervals from initiation to failure within the prevalent cohort tend to be longer than those arising from the underlying distribution of the general population. How to properly adjust for potential selection bias in analyzing length-biased data has been a longstanding statistical problem. Although we use a prevalent cohort study in medical applications here to illustrate length-biased data, it is apparent that the issues caused by biased sampling are common in many potential applications and sampling designs.

In a seminal paper, Vardi (1989) described the multiplicative censorship model, which connected four well-investigated statistical problems: A. Estimating a non-parametric distribution function under multiplicative censoring, B. Estimating the underlying distribution in renewal processes, C. Solving a nonparametric deconvolution problem, and D. Estimating a monotone decreasing density function. Vardi (1989) presented problems A and C, which have a natural connection with the measurement error problem and inverse problem discussed by van der Vaart and Wellner (1992) and Bickel and Ritov (1994). Most importantly, Vardi (1989) and Wang (1991) showed that the nonparametric maximum likelihood estimation (NPMLE) of the survival distribution under multiplicative censoring (problem B) is equivalent to the nonparametric estimation for survival distribution of the observed length-biased data. The large sample properties of the corresponding NPMLE is established in Asgharian et al. (2002), and the asymptotic efficiency of the NPMLE follows from Asgharian and Wolfson (2005) and van der Vaart (1998, Theorem 25.47). In this paper we explore the potential to extend the approach of Vardi (1989) to nonparametric estimations in more general settings and to semiparametric regression models.

The Cox proportional hazards model, the most popular semiparametric model for regression analysis of traditional survival data, assumes a nonparametric baseline hazard function and a regression function of the covariates (Cox, 1972, 1975). Only limited literature exists that describes modeling risk factors on the distribution of the underlying population when observed failure times are subject to length bias. Recently, Tsai (2009) generalized the pseudo-partial likelihood approach of Wang (1996) to model right-censored length-biased data. Qin and Shen (2010) proposed inverse weighted estimating equation approaches for right-censored length-biased data under the proportional hazards model. These approaches do not provide a straightforward way to analyze length-biased data if the censoring time depends on the covariates, and may not yield efficient estimators. For traditional survival data, Zeng and Lin (2007) demonstrated that estimating equations approaches under either the semiparametric Cox model or the transformation models are less efficient than the profile maximum likelihood estimation approach. (For related works on the profile likelihood for traditional survival data, see Nielsen et al. (1992); Klein (1992); Murphy (1994, 1995); Murphy and van der Vaart (2000); Zeng et al. (2005); Zeng and Lin (2007).) For right-censored length-biased data, we expect the similar efficiency advantage of the maximum likelihood estimation (MLE) method, and the robustness of the method to various assumptions of the censoring distribution.

Implementing the profile likelihood method is much more challenging when working with right-censored length-biased data compared to traditional survival data. One significant difference is that the full profile likelihoods have positive support on both censored and failure time points for length-biased data in contrast to the MLEs for traditional survival data and the conditional likelihood estimates for length-biased data. We propose new expectation-maximization (EM) algorithms for the maximum likelihood estimation of the nonparametric and semiparametric Cox regression models for right-censored length-biased data. One new aspect of our method is that we derive the likelihood for the unobserved (i.e. left-truncated) subpopulation given the observed length-biased data in the full likelihood, which serves as the missing data mechanism in the EM algorithm. In constrast to the EM algorithm of Vardi (1989), which estimates the underlying distribution function via estimation of the biased distribution, our EM algorithm directly estimates the target unbiased distribution function. As a result, any model and parameter constraints for the target distribution function can be directly imposed.

The rest of the paper is organized as follows. In Section 2, we introduce a new EM algorithm for the nonparametric estimation of the target distribution given length-biased data. In Section 3, we apply the new EM algorithm to estimate a distribution function with an increasing failure rate constraint. In Section 4, we propose the maximum semiparametric likelihood estimation under the Cox proportional hazards model and derive the large sample properties for length-biased data. We provide a convenient profile estimation approach based on the EM algorithm, with which the standard software for Cox regression can be adapted for right-censored length-biased data. We describe our simulation studies in Section 5 and the application of our method to a data example in Section 6. Section 7 contains some concluding remarks.

2. A NEW EM ALGORITHM FOR ESTIMATING NONPARAMETRIC SURVIVAL FUNCTION

Consider a prevalent cohort study in which the subjects are diagnosed with a disease and are at risk for a failure event. Let T̃, be the duration from the disease onset to failure with the unbiased density function f(t) = dF(t)/dt and survival function S(t). The observed data include the backward recurrence time A (from disease onset to the study entry), forward recurrence time V (from the study entry to failure), and length-biased time T = A+V. Based on the renewal theory (Vardi, 1982, 1989; Lancaster, 1990, Chapter 3), the joint distribution of (A, V) is

\frac{f (a + v)}{μ}, a, v > 0, where μ = \int t f (t) d t .

When the prevalent cohort is followed prospectively, V is subject to right censoring. The censoring time, denoted by C, is measured from the study entry. Let δ = I(V < C) be the censoring indicator and assume that (A, V) is independent of C. Let X = min(A + V, A + C). Denote the observed data as (X_i, A_i, δ_i), i = 1, 2, …, n. The density function of the observed biased T is defined as g(y) = dG(y)/dy, where dG(y) = ydF(y)/μ, and the survival function of T̃ is

S (t) = \int_{t}^{\infty} d F (u) = μ \int_{t}^{\infty} u^{- 1} d G (u) .

Therefore, the likelihood for the observed data (X_i, A_i, δ_i) is proportional to

\prod_{i = 1}^{n} \frac{f^{δ_{i}} (X_{i}) S^{1 - δ_{i}} (X_{i})}{μ} \propto \prod_{i = 1}^{n} {[d G (X_{i})]}^{δ_{i}} \prod_{i = 1}^{n} {[\int_{x \geq X_{i}} x^{- 1} d G (x)]}^{1 - δ_{i}} .

(1)

Vardi (1989) proposed an EM-algorithm for the NPMLE of G. Using the relationship between G and F, dF(t) = t⁻¹dG(t)/∫t⁻¹dG(t), the NPMLE for F can be derived. However, it is often difficult to impose constraints on F when F is estimated from the NPMLE of G, because the constraints on F may not be easily translated to the constraints on G.

As demonstrated in Vardi (1989), to maximize (1) it is sufficient to consider the discrete version of distribution F, i.e., p(T̃ = t_i) = p_i, nonparametrically on the point masses at

t_{1} < t_{2} < \dots < t_{k},

where t₁, …, t_k are the ordered unique failure and censoring times for {X₁, …, X_n}, k ≤ n. In principle, the length-biased observations (A, T) can be equivalently generated from a truncation model with

A \sim U (0, \hat{τ}), \tilde{T} \sim F, on (0, \hat{τ}),

(2)

where τ̂ = t_k, A and T̃ are independent, dF(t_i) = p_i and $\sum_{i = 1}^{k} p_{i} = 1$ , and (A, T) is observed if and only if T̃ ≥ A. The probability of observing a length-biased observation under this setting is π = P(T̃ ≥ A) = E(T̃)/τ̂.

We propose an EM algorithm with a different missing mechanism to directly estimate the target distribution, F. For a cohort subject to left truncation, the biased samples on n subjects denoted by O = {(X₁, δ₁, A₁), ···, (X_n, δ_n, A_n), A_i ≤ X_i, i = 1, ···, n}, are observed, whereas the data on m subjects are left truncated. Here the latent left-truncated data are denoted by $O^{*} = {(T_{1}^{*}, A_{1}^{*}), \dots, (T_{m}^{*}, A_{m}^{*}), A_{i}^{*} > T_{i}^{*}, i = 1, 2, \dots, m}$ . The random integer m then follows a negative binomial distribution with parameter π. The probability mass function of m is

(\begin{matrix} m + n - 1 \\ m \end{matrix}) {(1 - π)}^{m} π^{n}, m = 0, 1, 2, \dots and E (m ∣ O) = n (1 - π) / π .

Following the principle of the EM algorithm, we think of {O, O*} as the ‘complete data’, and consider that pseudo missing data also referred to as “ghosts” data in Turnbull (1976) are $O^{*} = {(T_{1}^{*}, A_{1}^{*}), \dots, (T_{m}^{*}, A_{m}^{*}), m}$ and the observed ‘incomplete data’ are O. We derive the full likelihood including the component of the truncated observations. The log-likelihood based on the complete data {O, O*} is

\sum_{j = 1}^{k} [\sum_{i = 1}^{n} I (T_{i} = t_{j}) + \sum_{i = 1}^{m} I (T_{i}^{*} = t_{j})] log p_{j},

(3)

where T_i ≥ A_i, i = 1, 2, ···, n and $T_{l}^{*} < A_{l}^{*}$ , l = 1, ···, m. Then conditional on the observed data,

\begin{array}{l} E [\sum_{i = 1}^{n} I (T_{i} = t_{j}) | O] = \sum_{i = 1}^{n} {δ_{i} I (T_{i} = t_{j}) + (1 - δ_{i}) P (T_{i} = t_{j} ∣ T_{i} \geq A_{i}, T_{i} \geq X_{i})} \\ = \sum_{i = 1}^{n} [δ_{i} I (X_{i} = t_{j}) + (1 - δ_{i}) \frac{I (X_{i} \leq t_{j}) p_{j}}{\int_{s \geq X_{i}} f (s) d s}], \end{array}

because (1 − δ_i)P (T_i = t_j|T_i ≥ A_i, T_i ≥ X_i) = (1 − δ_i)P (T = t_j | T ≥ X_i) = (1 − δ_i)I(X_i ≤ t_j)p_i/S(X_i). Conditional on the observed data O, the expectation for the missing left-truncated data can be expressed as

E {E [\sum_{i = 1}^{m} I (T_{i}^{*} = t_{j}) | m] | O, T^{*} < A^{*}} .

Under the truncation model specified in (2),

E I (T^{*} = t_{j} | O, T^{*} < A^{*}) = P r (T^{*} = t_{j}, A^{*} > t_{j}) / P r (T^{*} < A^{*}) = \frac{p_{j} (1 - t_{j} / \hat{τ})}{1 - π} .

This together with E(m | O) = n(1 − π)/π,

E {E [\sum_{i = 1}^{m} I (T_{i}^{*} = t_{j}) | m] | O, T^{*} < A^{*}} = \frac{n (1 - π)}{π} \frac{(1 - t_{j} / \hat{τ}) p_{j}}{1 - π} = \frac{n}{π} (1 - t_{j} / \hat{τ}) p_{j} .

Subject to $\sum_{i = 1}^{k} p_{i} = 1$ and p_i ≥ 0, we maximize the expected complete-data log-likelihood conditional on the observed data via the EM algorithm,

ℓ_{E} (p) = \sum_{j = 1}^{k} w_{j} log p_{j},

(4)

where p = (p₁, ···, p_k), and

w_{j} = \sum_{i = 1}^{n} [δ_{i} I (X_{i} = t_{j}) + (1 - δ_{i}) \frac{p_{j} I (X_{i} \leq t_{j})}{\sum_{j = 1}^{k} p_{j} I (X_{i} \leq t_{j})}] + \frac{n}{π} (1 - t_{j} / \hat{τ}) p_{j} .

By simple algebra, $\sum_{j = 1}^{k} w_{j} = n + n (1 - π) / π = n / π$ . The following iterative EM algorithm can be used to solve p̂_j for j = 1, ···, k.

Step 1
Select an arbitrary $p_{j}^{(0)}$ satisfying $\sum_{j = 1}^{k} p_{j}^{(0)} = 1, p_{j}^{(0)} \geq 0$ .
Step 2
Solve $p_{j}^{(1)}$ by maximizing (4), so that we replace $p_{j}^{(0)}$ with
${\hat{p}}_{j}^{(1)} = \frac{{\hat{π}}^{(0)}}{n} {\sum_{i = 1}^{n} [δ_{i} I (X_{i} = t_{j}) + (1 - δ_{i}) \frac{{\hat{p}}_{j}^{(0)} I (X_{i} \leq t_{j})}{\sum_{j = 1}^{k} {\hat{p}}_{j}^{(0)} I (X_{i} \leq t_{j})}] + \frac{n}{{\hat{π}}^{0}} (1 - \frac{t_{j}}{\hat{τ}}) {\hat{p}}_{j}^{(0)}},$ (5)

where ${\hat{π}}^{(0)} = \sum_{j = 1}^{k} t_{j} {\hat{p}}_{j}^{(0)} / \hat{τ}$ .

With a given convergence criterion, we can solve p_j iteratively. Let p̂_j denote the MLE of p_j, j = 1, ···, k, the NPMLE $\hat{F} (t) = \sum_{j = 1}^{k} {\hat{p}}_{j} I (t_{j} \leq t), \hat{π} = \int t d \hat{F} (t) / \hat{τ}$ , π̂ = ∫ tdF̂(t)/τ̂, and

Q_{1}^{n} (t) = \frac{1}{n_{1}} \sum_{i = 1}^{n} δ_{i} I (X_{i} \leq t), Q_{0}^{n} (t) = \frac{1}{n_{0}} \sum_{i = 1}^{n} (1 - δ_{i}) I (X_{i} \leq t),

where $n_{1} = \sum_{i = 1}^{n} δ_{i}$ and $n_{0} = \sum_{i = 1}^{n} (1 - δ_{i})$ . Thus, the limiting form of (5) is

d \hat{F} (t) = \hat{π} n_{1} n^{- 1} {d Q}_{1}^{n} (t) + \hat{π} n_{0} n^{- 1} d \hat{F} (t) \int_{0}^{t} \frac{{d Q}_{0}^{n} (s)}{1 - \hat{F} (s)} + (1 - t / \hat{τ}) d \hat{F} (t) .

(6)

Remark 1

In contrast to the NPMLE for traditional survival analyses, which has jumps only at the observed failure time points, the proposed NPMLE for length-biased data has jumps at all observed but unique points including censored times, similar to that of Vardi (1989).

Remark 2

Equation (6) for the constructed EM algorithm with the unbiased distribution function F is equivalent to that for Vardi’s EM algorithm based on a ‘multiplicative-censorship’ model with the biased distribution function G. Denoting dĜ(t) = tdF̂(t)/μ̂, where μ̂ = π̂τ̂, we re-express equation (6) as an equation of Ĝ,

d \hat{G} (t) = n_{1} n^{- 1} {d Q}_{1}^{n} (t) + n_{0} n^{- 1} \frac{d \hat{G} (t)}{t} \int_{0}^{t} {[\int_{r \geq s} r^{- 1} d \hat{G} (r)]}^{- 1} {d Q}_{0}^{n} (s),

which is the same equation derived by Vardi (1989) and Vardi and Zhang (1992). The advantage of the new EM algorithm is that it directly estimates the target distribution function of the unbiased data, which allows one to directly impose constraints on F. This advantage will be further elucidated in the next two sections.

Remark 3

The ‘missing’ data (i.e. left-truncated failure times), { $T_{1}^{*}, \dots, T_{m}^{*}$ } are assumed not subject to right censoring. It is clear that whether T* is subject to right censoring or not is irrelevant in the derivation of the above EM algorithm.

Remark 4

The development of the methods and large sample properties is focused on [0, τ] throughout the paper, where τ is a finite upper bound of the support of the population survival times, and Λ(τ) < ∞. In practice, τ can be estimated by $t_{k} = {max}_{i = 1}^{n} X_{i}$ . We prove (in Appendix A.2) the following lemma that τ̂ ≡ t_k → τ in probability, and that the convergence rate is faster than n^1/2.

Lemma 1

Suppose that E(C) > 0 and τ < ∞. Then for 1 > η > 1/2, n^η(τ̂ − τ) =o_p(1).

3. NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION WITH INCREASING FAILURE RATE

In some applications, it is known or assumed that the survival function for the target population has an increasing failure rate (Barlow and Proschan, 1975; Padgett and Wei, 1980; Tsai, 1988). The maximum likelihood estimation of a distribution function with an increasing failure rate was derived for traditional right-censored data by Padgett and Wei (1980), and for left-truncated and right-censored data by Tsai (1988). Using the same notation as in Section 2, the observed right-censored length-biased data are denoted by (X, A, δ). Let λ(t) denote the hazard function for the target cumulative density function F. Let z₁ < ··· < z_k_* denote the distinct ordered failure times {X₁, ···, X_n}. Let the size of the risk set at time x be denoted by $R (x) = \sum_{i = 1}^{n} I (A_{i} \leq x \leq X_{i})$ and the size of failure at time x be denoted by $d (x) = \sum_{i = 1}^{n} I (X_{i} = x, δ_{i} = 1)$ . Under the increasing failure rate constraint, Tsai (1988) proposed a maximum conditional likelihood estimator of λ, conditional on the truncation time A,

\hat{λ} (y) = {\begin{array}{l} 0 & y < z_{1}, \\ {\hat{λ}}_{j} & z_{j} \leq y < z_{j + 1}; j = 1, 2, \dots, k^{*} - 1, \\ {\hat{λ}}_{k^{*}} & z_{k^{*}} \leq y \end{array}

where

{\hat{λ}}_{j} = max_{1 \leq r \leq j} min_{j \leq s \leq k^{*} - 1} {\sum_{i = r}^{s} d (z_{i}) / \sum_{i = r}^{s} R (z_{i}) (z_{i + 1} - z_{i})} .

By applying the new EM algorithm, we consider a full likelihood estimation of the hazard function for the target population. Define λ(t_j) = λ_j; p_j can be expressed as $λ_{j} exp (- \int_{0}^{t_{j}} λ (s) d s)$ ; thus the expected complete-data log-likelihood function in (4) is

ℓ_{E} (λ) = \sum_{i = 1}^{k} w_{i} log λ_{i} - \sum_{i = 1}^{k} w_{i} \sum_{j = 1}^{i} \int_{t_{j - 1}}^{t_{j}} λ (t) d t,

(7)

where λ = (λ₁, ··· λ_k), and t₁ < ··· < t_k is defined in §2. Because the hazard function λ (.) increases with time,

ℓ_{E} (λ) \leq \sum_{i = 1}^{k} w_{i} log λ_{i} - \sum_{j = 1}^{k} [\sum_{i = j}^{k} w_{i} (t_{j} - t_{j - 1})] λ_{j - 1},

where λ₀ = 0. Taking a partial derivative of λ_j to the right side of the above inequality, we have

\frac{w_{j}}{λ_{j}} - \sum_{i = j + 1}^{k} w_{i} (t_{j + 1} - t_{j}) = 0 .

(8)

Using arguments similar to those of Marshall and Proschan (1965) and Padgett and Wei (1980), the solution to equation (8) also maximizes the expected log-likelihood ℓ_E(λ) defined in (7),

λ_{j} = \frac{w_{j}}{\sum_{i = j + 1}^{k} w_{i} (t_{j + 1} - t_{j})} .

Applying the pool-adjacent-violators algorithm, we can then achieve monotonicity for the NPMLE of λ(·),

\hat{λ} (x) = {\begin{array}{l} 0 & x < t_{1} \\ {\hat{λ}}_{j} & t_{j} \leq x < t_{j + 1}; j = 1, 2, \dots, k - 1, \\ {\hat{λ}}_{k} & x = t_{k}, \end{array}

where ${\hat{λ}}_{j} = {max}_{1 \leq r \leq j} {min}_{j \leq s \leq k - 1} {\sum_{i = r}^{s} w_{i} / \sum_{i = r}^{s} [\sum_{l = i + l}^{k} w_{l}] (t_{i + 1} - t_{i})}$ . Although the formula for the proposed NPMLE of the monotone hazard function bears some similarity to that of Tsai (1988), the full likelihood approach is essentially different from the conditional likelihood approach of Tsai (1988), where the estimated function has jumps only at distinct failure time points. By using the information of the left-truncated data in the full likelihood function, the NPMLE is expected to be more efficient and smoother than the maximum conditional likelihood estimate. We will further compare the two approaches in empirical studies.

As the hazard function λ (t) is increasing on t ∈ [0, τ], the corresponding cumulative hazard function Λ(t) is convex. Let Λ̂_n(·) denote the estimator obtained by the EM algorithm together with the pool-adjacent-violators algorithm. Then Λ̂_n(·) is the greatest convex minorant of Λ_n(·), where Λ_n(·) is the NPMLE Λ(·) when there is no constraint on its shape. The strong consistent results of Λ_n(·) uniformly on [0, τ] can be easily derived from the uniform consistency of its survival function, established by Asgharian and Wolfson (2005). The consistency of Λ̂_n can be inferred, because the pool-adjacent-violator algorithm leads to a continuous map for Λ_n(·) to Λ̂_n(·). The technical details are provided in the Appendix A.3.

4. MLE UNDER COX REGRESSION MODEL

4.1 Full Likelihood and Score Functions

Since the cornerstone work of Cox (1972, 1975), the proportional hazards model has become the standard regression model for analyzing traditional right-censored survival data. Specifically, the covariate-specific hazard function is specified as

λ_{Z} (t) = λ (t) exp (β^{'} Z)

where Z is a covariate vector and the baseline hazard function λ(t) is not specified parametrically. Breslow (1972) showed that by inserting an estimator (Breslow’s estimator) of the hazard function with a fixed β into the full likelihood, the profile likelihood for β is reduced to Cox’s partial likelihood for β. Later, Kalbfleisch and Prentice (1973) and Andersen et al. (1992) proved that the rank-based likelihood method is also equivalent to the partial likelihood method. When survival data are subject to biased sampling, neither the Cox’s partial likelihood approach nor Kalbfleisch and Prentice’s rank-based likelihood method can be directly applied. This is because the observed biased data do not follow the proportional hazards model that is assumed for unbiased data from the target population; and because the rank-based likelihood method is not applicable, due to the dependency of the length-biased data on the magnitude of the length.

The density function of an unbiased T̃ given Z is denoted by f(t | Z) and the corresponding survival function by S(t | Z). For random but length-biased samples of n subjects, the observed data consist of { Inline graphic ≡ (A_i, X_i, δ_i, Z_i), i = 1, ···, n}, which are n i.i.d. copies of ≡ (A, X, δ, Z). The full likelihood function of the observed data is proportional to

L_{n} = \prod_{i = 1}^{n} \frac{f^{δ_{i}} (X_{i} ∣ Z_{i}) S^{(1 - δ_{i})} (X_{i} ∣ Z_{i})}{μ_{β, Λ} (Z_{i})}

(9)

where $μ_{β, Λ} (Z_{i}) = \int_{0}^{\infty} t f (t ∣ Z_{i}) d t = \int_{0}^{\infty} S (t ∣ Z_{i}) d t$ . The identifiability of the model can be established, similar to the case for the Cox model under traditional survival data, where the identifiability has been established (Elbers and Ridder, 1982). By decomposing the full likelihood to the product of the conditional likelihood of X given A and the marginal likelihood of A, we have

L_{n} = [\prod_{i = 1}^{n} \frac{f^{δ_{i}} (X_{i} ∣ Z_{i}) S^{1 - δ_{i}} (X_{i} ∣ Z_{i})}{S (A_{i} ∣ Z_{i})}] [\prod_{i = 1}^{n} \frac{S (A_{i} ∣ Z_{i})}{μ_{β, Λ} (Z_{i})}] .

Although the estimating equation derived from the likelihood conditional on A (the first component in L_n) shares the same advantage of Cox’s partial likelihood by canceling the baseline hazard function (Wang et al., 1993; Kalbfleisch and Lawless, 1991), the conditional likelihood approach is generally less efficient than the full likelihood approach.

Using the notation of counting process, we denote N_i(t) = I(X_i ≤ t)δ_i, Y_i(t) = I(X_i ≥ t) for i = 1, ···, n. The log-likelihood function of (9) can be expressed as

ℓ_{n} (β, Λ) = n^{- 1} \sum_{i = 1}^{n} {\int_{0}^{τ} (β^{'} Z_{i} + log λ (t)) {d N}_{i} (t) - Λ_{Z_{i}} (X_{i}) - log μ_{β, Λ} (Z_{i})},

(10)

where $Λ (t) = \int_{0}^{t} λ (s) d s$ , and $Λ_{Z_{i}} (t) = \int_{0}^{t} exp (β^{'} Z_{i}) d Λ (s)$ . The estimation for MLE of β and the infinite dimensional parameter Λ can be computationally intractable if directly maximizing (10) or solving its score equations. We thereby propose an alternative computational approach, which is a generalization of the EM algorithm for NPMLE discussed in Section 2 under the Cox proportional hazards model.

The semiparametric MLE for the baseline hazard function Λ is obtained by maximizing the likelihood over the set of piece-wise constant functions. Of note, the estimator can have jumps at both censored and uncensored times by observing that the likelihood function achieves its maximum for the hazard function with jumps on {t₁, ···, t_k}, where t₁ < ··· < t_k denotes distinct failure and censored time points. Similar to the argument in (Vardi, 1989, page 754), for any estimator of Λ (t) that jumps outside of the event times {t₁, ···, t_k} one can find a greater likelihood with jumps on {t₁, ···, t_k} only. A detail explanation is in Supplemental Materials available online.

4.2. MLE and EM Algorithm

For i = 1, ···, n, let $T_{i j}^{*}$ , j = 1, 2, …, m_i be the truncated latent data corresponding to covariate Z_i. We develop the EM algorithm based on the discretized version of Λ(u) =Σ_{u≥t_j} λ_j, where λ_j is the positive jump at time t_j for j = 1, ···, k, and λ = (λ₁, ···, λ_k). For notational convenience, denote f_i(t) = dF (t | Z_i). The log-likelihood based on the complete data is then

\sum_{j = 1}^{k} \sum_{i = 1}^{n} [I (T_{i} = t_{j}) + \sum_{l = 1}^{m_{i}} I (T_{i l}^{*} = t_{j})] log f_{i} (t_{j})

Conditional on the observed data relative to the ith subject, Inline graphic = {X_i, A_i, δ_i, Z_i}, we obtain the expectation that

\begin{array}{l} w_{i j} = E [I (T_{i} = t_{j}) + \sum_{l = 1}^{m_{i}} I (T_{i l}^{*} = t_{j}) | O_{i}] \\ = δ_{i} I (X_{i} = t_{j}) + (1 - δ_{i}) \frac{p_{i j} I (X_{i} \leq t_{j})}{\sum_{j = 1}^{k} p_{i j} I (X_{i} \leq t_{j})} + \frac{\hat{τ}}{μ_{i}} (1 - t_{j} / \hat{τ}) p_{i j} \end{array}

(11)

where

p_{i j} = λ_{j} exp (β^{'} Z_{i}) exp {- \sum_{l = 1}^{j} λ_{l} exp (β^{'} Z_{i})}, and μ_{i} = \sum_{j = 1}^{k} t_{j} p_{i j} .

Thus, the expected complete-data log-likelihood function conditional on the observed data is as follows:

\begin{array}{l} ℓ_{E} (β, λ) = \sum_{i = 1}^{n} \sum_{j = 1}^{k} w_{i j} log f_{i} (t_{j}) \\ = \sum_{j = 1}^{k} w_{+ j} log λ_{j} + \sum_{i = 1}^{n} w_{i +} β^{'} Z_{i} - \sum_{l = 1}^{k} \sum_{j = l}^{k} \sum_{i = 1}^{n} w_{i j} exp (β^{'} Z_{i}) λ_{l}, \end{array}

where $w_{+ j} = \sum_{i = 1}^{n} w_{i j}$ , and $w_{i +} = \sum_{j = 1}^{k} w_{i j}$ . In the M-step, we maximize the expected complete-data log-likelihood function conditional on the observed data with respect to the baseline hazard function at t_j, for j = 1, ···, k,

\frac{\partial ℓ_{E} (β, λ)}{\partial λ_{j}} = \frac{w_{+ j}}{λ_{j}} - \sum_{l = j}^{k} \sum_{i = 1}^{n} w_{i l} exp (β^{'} Z_{i}) = 0,

which leads to a closed form of λ_j as a function of β, denoted by

λ_{j} (β) = \frac{w_{+ j}}{\sum_{l = j}^{k} \sum_{i = 1}^{n} w_{i l} exp (β^{'} Z_{i})} .

(12)

Here, λ_j is the maximizer of the M-step. Next, we maximize the expected complete-data log-likelihood function with respect to β

\frac{\partial ℓ_{E} (β, λ)}{\partial β} = \sum_{i = 1}^{n} w_{i +} Z_{i} - \sum_{l = 1}^{k} \sum_{j = l}^{k} \sum_{i = 1}^{n} w_{i j} Z_{i} exp (β^{'} Z_{i}) λ_{l} .

(13)

By inserting λ_j(β) of (12) into the equation (13), β can be solved from the following equation,

\sum_{i = 1}^{n} w_{i +} Z_{i} - \sum_{l = 1}^{k} w_{+ l} {\frac{\sum_{i = 1}^{n} \sum_{j = l}^{k} w_{i j} Z_{i} exp (β^{'} Z_{i})}{\sum_{i = 1}^{n} \sum_{j = l}^{k} w_{i j} exp (β^{'} Z_{i})}} = 0,

(14)

which is equivalent to maximizing the complete-data profile likelihood function for λ. With the estimated λ_j (j = 1, ···, k) and β, one can update the expectation of the likelihood via w_ij in (11) and repeat the M-step until the estimators of β and λ_j (j = 1, ···, k) converge.

At the M-step, the estimating equation (14) reveals that we may use the existing software for conventional right-censored data to estimate the covariate coefficient β under the Cox proportional hazards model. To simplify the description, consider a model with one covariate Z. First we need to create a vector with a length of nk for the weight function defined by W_nk = (w₁₁, ···, w₁_k, w₂₁, ···, w₂_k, ···, w_n₁, ···, w_nk), which is estimated at the E-step. The corresponding failure time data and covariate vectors are constructed with the same length as W_nk, T_nk = (t₁, ···, t_k, ···, t₁, ···, t_k) and Z_nk = (Z₁, ···, Z₁, ···, Z_n, ···, Z_n), respectively. By using the function “coxph” in S-PLUS (or R) with the “weights” option, we obtain the estimator of β at the M-step from

> coxph (Surv (T_{n k}, Δ) \sim Z_{n k}, weights = W_{n k}),

where the censoring indicator, Δ = (1, ···, 1), is an identity vector of length nk.

Note that the algorithm computes β and λ iteratively through the EM steps. The value of the complete log-likelihood ℓ_E(β, λ) increases with each EM step. More specifically, our EM algorithm falls in the general scheme of the ECM algorithm, a variation of EM methods proposed by Meng and Rubin (1993). The convergence of our EM algorithm to the local maximizer is guaranteed by the same conditions that ensure the convergence of the ECM algorithm, as proved in Meng and Rubin (1993). The uniqueness of the NPMLE is guaranteed by the Assumption 5 in Appendix A.1.

4.3. Asymptotic Properties

In this section, we establish the strong consistency and asymptotic normality of the MLE under the regularity conditions list in Appendix A.1. For asymptotic proof, we denote the MLE by (β̂_n, Λ̂_n), and let (β₀, Λ₀) be the true value. In Appendix A.4, we prove the strong consistency by the classical Kullback-Leibler information approach, which has been successfully applied for NPMLE in traditional survival analysis (Murphy, 1994; Parner, 1998).

Theorem 1

Under the regularity conditions listed in Appendix A.1, the MLE (β̂_n, Λ̂_n) are consistent: β̂_n converges to β₀, and Λ̂_n(t) converges to Λ₀(t) almost surely and uniformly in t for t ∈ [0, τ] as n → ∞.

The computation for the MLE of Λ is based on the discretized version Λ̂_n(t) = Σ_{t≥t_j} λ̂_j. The existence and uniqueness of the NPMLE can be proved based on the log-likelihood function ℓ_n(β, λ), in terms of {β, λ} where λ ≡ {λ₁, …, λ_k}, and

ℓ_{n} (β, λ) = \sum_{i = 1}^{n} [\int_{0}^{τ} β^{'} Z_{i} {d N}_{i} (t) + \sum_{l = 1}^{k} δ_{i} log λ_{l}] - \sum_{l = 1}^{k} λ_{l} \sum_{i = 1}^{n} e^{β^{'} Z_{i}} 1 (t_{l} \leq t_{i}) - \sum_{i = 1}^{n} log [\int_{0}^{τ} exp (- e^{β^{'} Z_{i}} \sum_{l = 1}^{k} λ_{l} 1 (t_{l} \leq s)) d s] .

(15)

Let λ̂(·, β) be the maximizer of ℓ_n(β, λ) for given β. The existence and uniqueness of NPMLE are guaranteed by Assumption 5: the information matrix of the profile likelihood evaluated at the true value β₀ is positive defiinite.

Next, we will apply the Z-theorem for the infinite-dimensional estimating equations to prove the weak convergence of the estimators (van der Vaart and Wellner, 1996, Theorem 3.3.1, p. 310). The score equation of β is

U_{1 n} (β, Λ) \equiv \frac{1}{n} \sum_{i = 1}^{n} {\int_{0}^{τ} Z_{i} {d N}_{i} (u) - \int_{0}^{τ} Y_{i} (u) Z_{i} e^{β^{'} Z_{i}} d Λ (u) + \int_{0}^{τ} Z_{i} (\int_{u}^{τ} S (v ∣ Z_{i}) d v) e^{β^{'} Z_{i}} Λ (u) / μ_{β, Λ} (Z_{i})} .

(16)

To obtain the MLE of Λ(·), consider a submodel defined by dΛ_η = (1 + ηh)dΛ, where h is a bounded and integrable function. Taking the derivative of ℓ_n(β, Λ_η) with respect to η, evaluating it at η = 0, and setting h(·) = 1(· ≤ t), we have the score equation for Λ

U_{2 n} (t, β, Λ) \equiv \frac{1}{n} \sum_{i = 1}^{n} {\int_{0}^{t} {d N}_{i} (u) - \int_{0}^{t} Y_{i} (u) e^{β^{'} Z_{i}} d Λ (u) + \int_{0}^{t} (\int_{u}^{τ} S (v ∣ Z_{i}) d v) e^{β^{'} Z_{i}} d Λ (u) / μ_{β, Λ} (Z_{i})} .

(17)

Denote the vector for score functions by U_n(·, β, Λ) ≡ {U₁_n(β, Λ), U₂_n(t, β, Λ)}, and its expectation E₀ under the true values (β₀, Λ₀) by

U_{0} (\cdot, β, Λ) \equiv {U_{10} (β, Λ), U_{20} (\cdot, β, Λ)},

where U₁₀(β, Λ) = E₀{U₁_n(β, Λ)} and U₂₀(t, β, Λ) = E₀{U₂_n(t, β, Λ)}. Both the score function U_n and its expectation U₀ are defined on the parameter set Inline graphic × , where the set is assumed to be compact in ℝ^p, and the set consists of nondecreasing functions in the space of functions with bounded variation. Let ψ̂_n = (β̂_n, Λ̂_n), ψ = (β, Λ) and ψ₀ = (β₀, Λ₀).

By the definition of MLE, U_n(·, ψ̂_n) = 0. As the true parameter ψ₀ satisfies U₀(·, ψ₀) = 0, $\sqrt{n} {U_{0} (\cdot, {\hat{ψ}}_{n}) - U_{0} (\cdot, ψ_{0})} = \sqrt{n} {U_{0} (\cdot, {\hat{ψ}}_{n}) - U_{n} (\cdot, {\hat{ψ}}_{n})}$ . In the Appendix A.5, we prove a stochastic approximation $∣ \sqrt{n} {U_{0} (\cdot, {\hat{ψ}}_{n}) - U_{n} (\cdot, {\hat{ψ}}_{n})} - \sqrt{n} {U_{n} (\cdot, {\hat{ψ}}_{0}) - U_{0} (\cdot, ψ_{0})} ∣ = o_{p} (1)$ . Denote that U̇_ψ₀ is the Fréchet derivative of the map U₀(·, ψ) evaluated at ψ₀. By the definition of the Fréchet derivative, ${\dot{U}}_{ψ_{0}} {\sqrt{n} ({\hat{ψ}}_{n} - ψ_{0})} = - \sqrt{n} {U_{n} (\cdot, ψ_{0}) - U_{0} (\cdot, ψ_{0})} + o_{P} (1)$ .

The estimating function evaluated at ψ₀, $\sqrt{n} U_{n} (ψ_{0}) = \sqrt{n} {U_{n} (\cdot, ψ_{0}) - U_{0} (\cdot, ψ_{0})}$ , is a sum of i.i.d. random quantities. We will prove by the empirical processes theory that $\sqrt{n} U_{n} (ψ_{0})$ converges weakly to Inline graphic = ( , ), where is a Gaussian random vector and is a tight Gaussian process. The covariance matrix for is is Σ₁₁ = E₀{U₁_n(β₀, Λ₀)^⊗2}, and the covariance between (s) and (t) is Σ₂₂(s, t) = E₀{U₂_n(s, β₀, Λ₀)U₂_n(t, β₀, Λ₀)}. By the Z-theorem for the infinite-dimensional estimating equations (van der Vaart and Wellner, 1996), we have

Theorem 2

Under the regularity conditions listed in the Appendix A.1, $\sqrt{n} ({\hat{ψ}}_{n} - ψ_{0})$ converges weakly to a tight mean zero Gaussian process $- {\dot{U}}_{ψ_{0}}^{- 1} (W)$ .

Note that the asymptotic distribution of the sequence $\sqrt{n} ({\hat{ψ}}_{n} - ψ_{0})$ is completely determined by the tightness of ${\dot{U}}_{ψ_{0}}^{- 1} (W)$ and its marginal covariance function. We characterize the Fréchet derivative U̇_ψ₀, viewed as an operator on the parameter space of Inline graphic × . Define for l = 0, 1, 2,

\begin{array}{l} K_{1}^{(l)} (u) = E_{0} {Z^{\otimes l} e^{β_{0}^{'} Z} S_{0} (u ∣ Z) μ_{0}^{- 1} (Z) \int_{0}^{u} S_{C} (s ∣ Z) d s} \\ K_{2}^{(l)} (t, u) = \int_{u}^{τ} E_{0} {Z^{\otimes l} e^{2 β_{0}^{'} Z} μ_{0}^{- 1} (Z) S_{0} (v ∣ Z) (Λ_{0} (t \land v) - Φ_{0} (t ∣ Z) μ_{0}^{- 1} (Z))} d v, \end{array}

where S_C(u | Z) = P(C ≥ u | Z), μ₀(Z) = μ_{β₀, Λ₀} (Z), $S_{0} (t ∣ Z) = exp {- exp (β_{0}^{'} Z) Λ_{0} (t)}$ , and

Φ_{0} (t ∣ Z) = \int_{0}^{t} (\int_{u}^{τ} S_{0} (v ∣ Z) d v) d Λ_{0} (u) .

By Assumption 5, the Fisher information of β for known Λ₀ is positive definite

J_{0} \equiv {\int_{0}^{τ} K_{1}^{(2)} (u) d Λ_{0} (u) + \int_{0}^{τ} K_{2}^{(2)} (τ, u) d Λ_{0} (u)} .

(18)

Then the Fréchet derivative U̇_ψ₀ can be written in the following form:

{\dot{U}}_{ψ_{0}} (β, Λ) \equiv - (\begin{matrix} σ_{11} & σ_{12} \\ σ_{21} & σ_{22} \end{matrix}) (\begin{matrix} β \\ Λ \end{matrix}) \equiv (σ_{11} (β) + σ_{12} (Λ), σ_{21} (β) (\cdot) + σ_{22} (Λ) (\cdot)),

(19)

where

\begin{array}{l} σ_{11} (β) = {\int_{0}^{τ} K_{1}^{(2)} (u) d Λ_{0} (u) + \int_{0}^{τ} K_{2}^{(2)} (τ, u) d Λ_{0} (u)}^{'} β \\ σ_{12} (Λ) = \int_{0}^{τ} K_{1}^{(1)} (u) d Λ (u) + \int_{0}^{τ} K_{2}^{(1)} (τ, u) d Λ (u) \\ σ_{21} (β) (t) = {\int_{0}^{t} K_{1}^{(1)} (u) d Λ_{0} (u) + \int_{0}^{τ} K_{2}^{(1)} (t, u) d Λ_{0} (u)}^{'} β \\ σ_{22} (Λ) (t) = \int_{0}^{t} K_{1}^{(0)} (u) d Λ (u) + \int_{0}^{t} K_{2}^{(0)} (t, u) d Λ (u) . \end{array}

We show the invertibility of U̇_ψ₀ by translating the operator into the Fredholm integral equations of the second kind (Tricomi, 1985, Chapter 2). We prove in the appendix A.5 that the inverse of ${\dot{U}}_{ψ_{0}}^{- 1}$ exists, and the inverse is continuous with the following form

{\dot{U}}_{ψ_{0}}^{- 1} (β, Λ) \equiv (\begin{matrix} σ_{11}^{- 1} + σ_{11}^{- 1} σ_{12} Φ^{- 1} σ_{21} σ_{11}^{- 1} & σ_{11}^{- 1} σ_{12} Φ^{- 1} \\ Φ^{- 1} σ_{21} σ_{11}^{- 1} & Φ^{- 1} \end{matrix}) (\begin{matrix} β \\ Λ \end{matrix}),

(20)

where the functional $Φ = σ_{22} - σ_{21} σ_{11}^{- 1} σ_{12}$ . We shown in the appendix that Φ has an inverse Λ → Φ⁻¹(Λ) expressed in the following form as a function of t

Φ^{- 1} (Λ) (t) = \int_{0}^{t} \frac{d Λ (u)}{K_{1}^{(0)} (u)} - \int_{0}^{τ} (\int_{0}^{t} \frac{H (u, v)}{K_{1}^{(0)} (u)} d u) d Λ (v),

(21)

where H(u, v) is the solution of the following integral equation

H (u, v) = \frac{Q (u, v)}{K_{1}^{(0)} (v)} + \int_{0}^{τ} H (u, s) \frac{Q (s, v)}{K_{1}^{(0)} (v)} d s,

(22)

Q (t, u) = {(K_{1}^{(1)} (t) λ_{0} (t) + \int_{0}^{τ} {\dot{K}}_{2}^{(1)} (t, v) d Λ_{0} (v))}^{'} J_{0}^{- 1} (K_{1}^{(1)} (u) + K_{2}^{(1)} (τ, u)) - {\dot{K}}_{2}^{(0)} (t, u),

(23)

and with the notation ${\dot{K}}_{2}^{(l)} (t, u) = \partial K_{2}^{(l)} (t, u) / \partial t$ for l = 0, 1,

{\dot{K}}_{2}^{(l)} (t, u) = \int_{u}^{τ} E_{0} {Z^{\otimes l} e^{2 β_{0}^{'} Z} \frac{S_{0} (v ∣ Z)}{μ_{0}^{- 1} (Z)} (λ_{0} (t \land v) - λ_{0} (t) \int_{t}^{τ} \frac{S_{0} (s ∣ Z)}{μ_{0}^{- 1} (Z)} d s)} d v .

By Theorem 2, $\sqrt{n} ({\hat{β}}_{n} - β_{0})$ converges in distribution to a mean zero normal random vector characterized by

σ_{11}^{- 1} (W_{1}) + σ_{11}^{- 1} σ_{12} Φ^{- 1} σ_{21} σ_{11}^{- 1} (W_{1}) - σ_{11}^{- 1} σ_{12} Φ^{- 1} (W_{2}),

(24)

where the Gaussian process Φ⁻¹( Inline graphic ) has the following form

Φ^{- 1} (W_{2}) (t) = \int_{0}^{t} \frac{{d W}_{2} (u)}{K_{1}^{(0)} (u)} - \int_{0}^{τ} (\int_{0}^{t} \frac{H (u, v)}{K_{1}^{(0)} (u)} d u) {d W}_{2} (v) .

Note that the stochastic integral is well defined via the formula of integration by the parts because the functions $u \to 1 / K_{1}^{(0)} (u)$ and $v \to \int_{0}^{t} H (u, v) / K_{1}^{(0)} (u) d u$ are of bounded variation on [0, τ].

Additionally, the process $\sqrt{n} ({\hat{Λ}}_{n} - Λ_{0})$ converges weakly to a tight Gaussian process

Φ^{- 1} σ_{21} σ_{11}^{- 1} (W_{1}) + Φ^{- 1} (W_{2}),

(25)

where the processes

σ_{21} σ_{11}^{- 1} (W_{1}) (t) = {\int_{0}^{t} K_{1}^{(1)} (u) d Λ_{0} (u) + \int_{0}^{τ} K_{2}^{(1)} (t, u) d Λ_{0} (u)}^{'} J_{0}^{- 1} W_{1} .

If the baseline function Λ₀ is known, then $\sqrt{n} ({\hat{β}}_{n} - β_{0})$ converges in distribution to a Gaussian random variate $σ_{11}^{- 1} (W_{1})$ with mean zero and the sandwich covariance matrix $J_{0}^{- 1} \sum_{11} J_{0}^{- 1}$ . Because of the variation associated with the profile-likelihood estimator Λ̂_n, the asymptotic variance of $\sqrt{n} ({\hat{β}}_{n} - β_{0})$ is more complicated, with the extra terms indicated in (24). The variance-covariance matrix may be estimated by its empirical plug-in version, but the computation can be extremely complicated as it needs to solve the integral equation (22). We describe alternative methods for this computation in the following section.

4.4. Variance Estimation

Unlike the estimation of the regression coefficients, variances of the estimator of the MLE, β̂_n cannot be obtained directly from existing software such as R or SAS, because they cannot incorporate the variations caused by the profile likelihood estimator. Instead, we can use bootstrapping techniques or use the information matrix to estimate the variance of the estimators. When working with the observed full likelihood with unknown parameters (β, Λ), the total number of parameters has the same order as the number of observed distinct times, which often yields an information matrix of a high dimension. Murphy and van der Vaart (1999) showed that the inverse of the information matrix for the profile likelihood provides valid variance estimates for the finite-dimensional parameters of interest, i.e., β̂_n under the semiparametric models. There is no general analytical formula for calculating the profile information matrix; thus we describe a numerical EM-aided differentiation approach (Murphy and van der Vaart, 1999; Chen and Little, 1999) to approximate the profile information matrix. Chen and Little (1999) also proved that the score function of the profile likelihood for the observed data is the same as the expected complete-data score function conditional on the observed data at the convergent point. Therefore, the second derivative of the log profile likelihood evaluated at the MLE β̂_n can be approximated by ∂²ℓ_E(β, λ(β))/∂β²|_{β=β̂_n}, which is the first derivative of the expected complete-data score function conditional on the observed data and profiled over λ (β) = (λ ₁(β), ···, λ_k(β)) given by (12). By perturbation around the profile MLE β̂_n, the information matrix for β can then be calculated as follows:

Perturb the lth component of β̂_n = (β̂₁, ···, β̂_p) by a small value ε = 1/n in the neighborhood of β̂_l (in both directions). The perturbed estimator is denoted by β̂_∊_;,_l = (β₁, ···, β_l ± ε, ···, β_k), where l = 1, ···, p.
Approximate the lth row of the information matrix of β by
$\frac{1}{ε} {{\frac{\partial ℓ_{E} (β, λ (β))}{\partial β} |}_{β = {\hat{β}}_{ε, l}}},$

where the hazard function λ (β) is obtained from (12) using the M-step described in Section 4.2.

When estimating the variance of λ̂_n, the bootstrap approach can be used. This re-sampling approach is valid, given our established asymptotic normality for the MLE of β̂_n and λ̂_n. In this case, we should obtain the variances of both β̂_n and λ̂_n.

5. SIMULATIONS

We performed simulation studies to evaluate the proposed methods and the corresponding EM algorithms for two settings: nonparametric MLE with an increasing failure rate, and the profile likelihood estimators under the Cox proportional hazards model for length-biased data. We aimed to assess the small sample accuracy and precision of our estimators, and to compare their performance with those of the existing methods under each setting. Each study comprised 1000 repetitions. Sample sizes of 200 and 400 were used.

5.1. Estimating a Distribution Function with an Increasing Failure Rate

We generated independent pairs of (A, T̃) with failure times from a Weibull distribution (F(t) = 1 − exp{−(t/α₂)^α₁}) with α₁ = 2 and α₂ = 1 and truncation times from a uniform distribution to ensure the stationarity assumption. Here, the specified Weibull distribution has an increasing failure rate. The censoring variables measured from the examination time were independently generated from uniform distributions.

Table 1 compares the performance of our proposed estimator and that of Tsai’s estimator (Tsai, 1988), denoted by F_c. When F (t) is greater than 0.5, both estimators achieved outstanding accuracy. In contrast, when F(t) is small, both had downward biases, and the bias decreased with increasing sample sizes. As expected, the empirical standard deviations of our estimator were as much as 25% lower than those of Tsai’s estimator.

Table 1.

Summary Statistics of Simulations for the Estimated Distribution Function with Increasing Failure Rate

Sample Size	C%	F(t)	F̂_p(t)			F̂_c(t)
Sample Size	C%	F(t)	Est.	ESD	ESMSE	Est.	ESD	ESMSE
200	10%	0.10	0.060	0.024	0.047	0.066	0.032	0.047
		0.25	0.211	0.044	0.059	0.211	0.045	0.060
		0.50	0.486	0.045	0.047	0.469	0.046	0.056
		0.75	0.745	0.030	0.030	0.732	0.034	0.038
		0.90	0.898	0.018	0.018	0.892	0.020	0.021
200	30%	0.10	0.068	0.028	0.043	0.065	0.033	0.048
		0.25	0.222	0.045	0.053	0.210	0.046	0.061
		0.50	0.490	0.046	0.047	0.468	0.049	0.058
		0.75	0.748	0.031	0.032	0.732	0.036	0.041
		0.90	0.901	0.019	0.019	0.892	0.022	0.024
400	10%	0.10	0.065	0.019	0.040	0.075	0.025	0.035
		0.25	0.217	0.031	0.045	0.221	0.033	0.043
		0.50	0.495	0.033	0.033	0.478	0.033	0.039
		0.75	0.748	0.021	0.021	0.738	0.023	0.026
		0.90	0.899	0.012	0.012	0.895	0.014	0.015
400	30%	0.10	0.080	0.021	0.029	0.075	0.025	0.036
		0.25	0.236	0.033	0.036	0.221	0.033	0.044
		0.50	0.501	0.033	0.034	0.477	0.034	0.041
		0.75	0.751	0.022	0.022	0.737	0.025	0.028
		0.90	0.901	0.013	0.013	0.895	0.016	0.016

Open in a new tab

Note: F̂_p(t) is the proposed estimator; F̂_c is Tsai’s conditional estimator; C%=censoring percentage; Est.= average of estimates; ESD = empirical standard deviation; ESMSE = empirical square root of the mean squared error = (bias² + ESD²)^1/2.

5.2. Estimating Regression Coefficients Under the Cox Model

We generated unbiased failure times (T̃) from the proportional hazards model with two covariates, where β = (β₁, β₂) = (0.5, 1), the binary covariate Z₁ ~ Bernoulli(1,0.5), the continuous covariate Z₂ ~ uniform(−0.5,0.5), and the baseline hazard function is t. The censoring times C were independently generated either from uniform distributions or from the specified covariate dependent distributions (see Table 2).

Table 2.

Summary Statistics of Simulations for Estimating Regression Coefficients under Cox Model with β = (β₁, β₂) = (0.5, 1). Mean SE is the mean of the estimated standard errors

Cohort Size	C%	Proposed approach				EE - I		EE - 2		Tsai’s Method
Cohort Size	C%	Est.	ESD	Mean SE	95 % CP	Est.	ESD	Est.	ESD	Est.	ESD
200	15%	(.49,.98)	(.11,.20)	(.11,.19)	(.96,.95)	(.50,1.01)	(.14,.25)	(.51,1.04)	(.13,.24)	(.51,1.01)	(.12,.22)
	30%	(.48,.94)	(.11,.21)	(.11,.20)	(.94,.93)	(.47,.91)	(.19, .33)	(.51,1.01)	(.16,.28)	(.50,1.01)	(.14,.25)
	50%	(.46,.93)	(.12,.21)	(.12,.20)	(.93,.94)	(.42,.85)	(.23,.40)	(.51,1.02)	(.19,.34)	(.49,1.01)	(.18,.31)
400	15%	(.49,.98)	(.08,.14)	(.08,.14)	(.95,.95)	(.50,.99)	(.10,.18)	(.51,1.02)	(.09,.17)	(.52,1.01)	(.09,.15)
	30%	(.48,.97)	(.08,.15)	(.08,.14)	(.93,.93)	(.46,.94)	(.14,.24)	(.50,1.01)	(.11,.21)	(.50,1.00)	(.10,.17)
	50%	(.48,.94)	(.08,.15)	(.08,.15)	(.94,.92)	(.42,.84)	(.18,.31)	(.51,1.02)	(.14,.24)	(.49,1.01)	(.13,.22)
λ_c = t exp (0.5Z₁ + 05Z₂).
200	30%	(.48,.95)	(.11,.20)	(.11,.20)	(.94,.93)	(.39,.89)	(.18,.31)	(.46,.97)	(.15,.27)	(.58,1.07)	(.13,.23)
400	30%	(.48,.97)	(.08,.15)	(.08,.14)	(.96,.96)	(.39,.91)	(.13,.23)	(.45,.97)	(.11,.20)	(.58,1.08)	(.10,.17)
λ_c = t exp (Z₂)
200	30%	(.48,.96)	(.11,.20)	(.11,.20)	(.95,.94)	(.52,.82)	(.18,.31)	(.53,.92)	(.15,.28)	(.49,1.15)	(.14,0.24)
400	30%	(.48,.96)	(.08,.15)	(.08,.14)	(.93,.93)	(.50,.80)	(.12,.22)	(.51,.91)	(.11,.19)	(.50,1.15)	(.10,.17)
C ~ Z₁U (0, 1) + (1 − Z₁)U (0, 1.8)
200	30%	(.49,.96)	(.12,.21)	(.12,.20)	(.95,.93)	(.82,.92)	(.20,.33)	(.71,1.00)	(.17,.28)	(.41,1.02)	(.15,.25)
400	30%	(.48,.96)	(.09,.14)	(.08,.14)	(.92,.95)	(.82,.89)	(.11,.22)	(.69,.97)	(.12,.19)	(.70,.96)	(.09,.17)

Open in a new tab

For a light (15%) or moderate (30%) censoring percentage, the mean estimates of the coefficients agreed well with the true parameter no matter whether the censoring distribution was dependent on or independent of the covariates. Even with heavy censoring (50%), the inferences associated with the proposed method were fairly accurate for all of the scenarios investigated: the means of the estimated standard errors were close to the empirical standard errors and the coverage of the 95% confidence intervals were reasonable, ranged from 92% to 96%.

In Table 2, we also show a comparison of the performance between the proposed MLE and the existing estimation methods for length-biased data. The estimating equation approaches of Qin and Shen (2010) are as follow:

\begin{array}{r} E E - I : & \sum_{i = 1}^{n} δ_{i} [Z_{i} - \frac{\sum_{j = 1}^{n} I (X_{j} \geq X_{i}) δ_{j} {X_{j} S_{C} (X_{j} - A_{j})}^{- 1} Z_{j} exp (β^{'} Z_{j})}{\sum_{j = 1}^{n} I (X_{j} \geq X_{i}) δ_{j} {X_{j} S_{C} (X_{j} - A_{j})}^{- 1} exp (β^{'} Z_{j})}] = 0, \\ E E - I I : & \sum_{i = 1}^{n} δ_{i} [Z_{i} - \frac{\sum_{j = 1}^{n} I (X_{j} \geq X_{i}) δ_{j} {W_{C} (X_{j})}^{- 1} Z_{j} exp (β^{'} Z_{j})}{\sum_{j = 1}^{n} I (X_{j} \geq X_{i}) δ_{j} {W_{C} (X_{j})}^{- 1} exp (β^{'} Z_{j})}] = 0, \end{array}

where S_C(·) is the survival function of the censoring time C, and $W_{C} (t) = \int_{0}^{t} S_{C} (s) d s$ . Tsai (2009) proposed the pseudo-partial likelihood method with the following estimating equations based on the score statistics,

\begin{array}{r} E E - P L : & \sum_{i = 1}^{n} \int_{0}^{\infty} {Z_{i} - \frac{\sum_{j = 1}^{n} Z_{j} Y_{j} (t) exp (β^{'} Z_{j}) W (t, X_{j})}{\sum_{j = 1}^{n} Y_{j} (t) exp (β^{'} Z_{j}) W (t, X_{j})}} {d N}_{i} (t) = 0, \end{array}

where $W (t, X_{j}) = δ_{j} \frac{W_{C} (X_{j}) - W_{C} (X_{j} - t)}{W_{C} (X_{j})} + (1 - δ_{j}) \frac{S_{C} (X_{j}) - S_{C} (X_{j} - t)}{S_{C} (X_{j})}$ .

When the censoring was independent of the covariates, all the types of estimators had a small bias with light censoring (15%). With moderate (30%) or heavy censoring (50%), the biases associated with the MLE were much smaller than those associated with EE I. The MLE method always exhibited clearly superior efficiency, with smaller empirical standard errors. For instance, the standard errors associated with the estimating equations were 1.12 to 1.62 times greater, and the standard errors associated with the pseudo-partial likelihood method were 1.09 to 1.50 times greater than those associated with the MLE method based on a sample size of 200. When the censoring distribution was dependent on the covariates, the estimators obtained by EE-I, EE-II and EE-PL were biased compared with those obtained by the MLE method. In summary, the MLE approach is the most efficient one among the four methods, and is also the most robust to various censoring mechanisms.

6. A REAL DATA EXAMPLE

Dementia is a progressive degenerative medical condition and is one of the leading causes of all deaths in the United States and Canada. The Canadian Study of Health and Aging was a multicenter epidemiologic study of dementia, in which 14,026 subjects 65 years or older were randomly chosen throughout Canada to receive an invitation for a health survey. A total of 10,263 subjects agreed to participate in the study (Wolfson et al., 2001). The participants were then screened for dementia, and 1132 people were identified as having the disease. The individuals with dementia were followed until their deaths or last follow-up dates in 1996, and their dates of dementia onset were ascertained from their medical records.

After excluding subjects with missing data regarding the date of disease onset or classification of dementia subtype, a total of 818 patients remained {393 with probable Alzheimer’s disease, 252 with possible Alzheimer’s disease and 173 with vascular dementia. Other study variables included the approximate date of dementia onset, date of screening for dementia, date of death or censoring and death indicator variable. Given the prevalent cases ascertained cross-sectionally, Asgharian et al. (2006) validated the stationarity assumption, which was defined as that the incidence of dementia did not change over the period of the study.

At the end of the study, 638 out of 818 patients had died and the others were right censored. Within this elderly cohort, it seems reasonable to assume that the overall death rate increases with age. Applying the NPMLE approaches described in Sections 2 and 3, we estimated the hazard function for each subtype of dementia, and plotted the survival of patients with probable Alzheimer’s disease, possible Alzheimer’s disease and vascular dementia with and without the constraint of the increasing risk of death (see Figure 1). It is not surprising that the estimated survival curves with the constraint, i.e., additional information, have narrower confidence intervals than their corresponding survival curves without the constraint. As pointed out by one referee, a monotone hazards constraint may not hold for death due to dementia, particularly for patients with vascular dementia, because of the uncertainty associated with the cause of death.

Estimated survival curves according to subtypes of dementia, with and without the constraint of increasing risk of death with age

Using the diagnosis subtype of possible Alzheimer’s disease as the baseline cohort, we defined two indicator variables for the other two subtypes of dementia for the Cox proportional hazards model. Applying the proposed method in Section 4, the estimated covariate Effects of two subtypes of dementia and their standard errors are listed in Table 3. The results showed that the long-term survival distributions were statistically significantly different between the group with vascular dementia and the group with possible Alzheimer’s dementia, and marginally different between the group with probable Alzheimer’s dementia and the group with possible Alzheimer’s dementia. We also analyzed the same data set using the estimating equation methods (EE-I and EE-II) by Qin and Shen (2010) and the pseudo-partial likelihood method by Tsai (2009). The estimated coefficients with the associated standard errors obtained by both EE-II and EE-PL indicated no statistically significant survival differences between the three subtypes of dementia. The results from EE-I suggested there existed a statistically significant survival difference between the group with vascular dementia and the group with possible Alzheimer’s dementia, but no statistically significant survival differences between the group with vascular dementia and the group with probable Alzheimer’s dementia. The discrepancy in the inferences is most likely caused by the loss in efficiency when using the estimating equation method or the pseudo-partial likelihood method compared to using the MLE method.

Table 3.

Estimates (Standard Errors) of Regression Coefficients Using Length-biased Adjusted Methods for Dementia Data.

	MLE	EE-I	EE-II	EE-PL
Probable Alzheimer	0.125 (0.062)	0.109 (0.092)	0.134 (0.091)	0.064(0.081)
Vascular Dementia	0.185 (0.077)	0.245 (0.110)	0.208 (0.110)	0.164(0.111)

Open in a new tab

7. CONCLUDING REMARKS

We have proposed new EM algorithms for length-biased data to obtain full likelihood maximum estimators under three settings, and the missing data mechanism in the EM-algorithm is the left truncation for the length-biased data. In constrast to Vardi’s (1989) EM algorithm for estimating nonparametric survival distributions, the advantage of the new EM algorithm is that one can directly estimate the non-parametric survival distribution or hazard function of the unbiased failure time T̃.

One major challenge to maximum likelihood estimation when it involves infinite dimensional parameters is computational intractability. We have implemented the new EM algorithm together with the profile likelihood method for jointly estimating the baseline hazard function and the covariate coefficients under the Cox regression model for length-biased data. Commercially available statistics software for the Cox model can be adapted for easy computation. The EM algorithm is not computational intensive even with continuous covariates, since p_ij and λ_j can be obtained easily from the closed form expressions.

Similar to the NPMLE for traditional survival data, the method we have proposed requires the observation of at least one failure time to ensure that the large sample properties hold in the settings for length-biased data. As shown in our empirical studies, the proposed computational algorithms to solve MLE perform well in terms of accuracy, and are more efficient compared to the existing estimating equation approaches, which are more efficient than the conditional approach (Wang et al., 1993). Without assuming a known parametric distribution of Z as in Bergeron et al. (2008), maximizing the likelihood function (1) is equally efficient to maximinizing the full likelihood including the marginal distribution of Z.

Parallel to the observation of Zeng and Lin (2007) for traditional survival data, estimators obtained from MLE are much more robust and efficient than those from the estimating equation approaches, and well suited for the proportional hazards regression model and other nonparametric estimations for length-biased right-censored data. The proposed EM algorithm may be further generalized to other semiparametric models, and the tools for model checking should be developed for length-biased right-censored data.

Supplementary Material

websupplementary

NIHMS352232-supplement-websupplementary.pdf^{(96.8KB, pdf)}

Acknowledgments

This research was partially supported by National Institute of Health grant R01-CA079466.

We thank one Associate Editor and two Referees for their very constructive comments. We also thank Professor Masoud Asgharian and investigators of the Canadian Study of Health and Aging (CHSA) for providing us the dementia data from CHSA. The data reported in the example were collected as part of the CHSA. The core study was funded by the Seniors' Independence Research Program, through the National Health Research and Development Program of Health Canada (Project no.6606-3954-MC(S)). Additional funding was provided by Pfizer Canada Incorporated through the Medical Research Council/Pharmaceutical Manufacturers Association of Canada Health Activity Program, NHRDP Project 6603-1417-302(R), Bayer Incorporated, and the British Columbia Health Research Foundation Projects 38 (93-2) and 34 (96-1). The study was coordinated through the University of Ottawa and the Division of Aging and Seniors, Health Canada.

A APPENDIX

A.1 Assumptions

Denote the Euclidean norm by |·|. For a vector z = (z₁, …, z_p)′, |z| ≡ (|z₁|² + ··· + |z_p|²)^1/2. To avoid measurability issues, the probability is understood as the outer probability (van der Vaart and Wellner, 1996). We adopt the convention that 0/0 = 0. We assume the following regularity assumptions:

The true value of the hazard function λ₀(·) is continuously differentiable. In addition, the upper bound τ of the support of the cumulative hazard function Λ₀ is finite.
The parameter β is in a compact set that contains β₀. The parameter set for the baseline function contains all nondecreasing functions Λ satisfying Λ (0) = 0 and Λ (τ) < ∞.
The residual censoring time C satisfies P(C > V) > 0. Its survival function S_C(·) is continuous.
For the covariate vector Z, the terms E₀|Z|² and E₀|e^β^′^Z| are bounded.
The information matrix −∂²E[ℓ_n(β, λ̂ (·, β)]/∂β² evaluated at the true values β₀ is positive definite.
If P(b′Z = c₀) = 1 for some constant c₀, then b = 0.

Assumption 1 and 3 imply that the bivariate function Q(t, u) defined in (23) is continuous on [0, τ] × [0, τ]. Assumption 3 implies that the censoring may occur after V. Assumption 5 is a classical condition that appears in the study of the Cox model for traditional survival data (Andersen et al., 1992, Condition VII2.1.(e), page 497). It implies that the matrix J₀ defined in (18) is positive definite, which is the information matrix for β when the baseline function Λ₀ is known. The positive definiteness of J₀ can be also implied by the fact the model is identifiable (Rothenberg, 1971). Assumption 6 means that there is no covariate colinearity, which ensure the model identifiability.

A.2 Proof of the lemma on the convergence of τ̂ = t_k to τ

Recall that X_i = min(A_i+V_i, A_i+C_i), and t_k = max{X₁, ···, X_n}. For any arbitrary small ε > 0,

\begin{array}{l} P (∣ max_{1 \leq i \leq n} X_{i} - τ ∣ > ε) = {[P (X_{1} < τ - ε)]}^{n} = {1 - E [I (A_{1} + V_{1} > τ - ε) S_{C} (τ - ε - A_{1})]}^{n} \\ = {1 - \int_{τ - ε}^{τ} w (y) f (y) d y / μ}^{n} \\ = {1 - w (ξ) f (ξ) ε / μ}^{n}, \end{array}

where $w (y) = \int_{0}^{y} S_{C} (τ - ε - a) d a > 0, τ - ε \leq ξ \leq τ, f (ξ) > 0$ , and the last equality uses the mean value theorem. Given the above equation, for any arbitrary small η > 0,

P (n^{η} ∣ max_{1 \leq i \leq n} X_{i} - τ ∣ > ε) = {1 - w (ξ) f (ξ) ε n^{- η} / μ}^{n},

where τ − εn⁻^η ≤ ξ ≤ τ. Note that when n → ∞,

\begin{matrix} w (ξ) = \int_{0}^{ξ} S_{C} (τ - ε n^{- η} - a) d a = \int_{τ - ξ - ε n^{- η}}^{τ - ε n^{- η}} S_{C} (s) d s \to \int_{0}^{τ} S_{C} (s) d s, \\ \int_{0}^{τ} S_{C} (s) d s = E [\int_{0}^{τ} I (C > s) d s] = E [min (C, τ)] . \end{matrix}

Thus, as long as E[C] > 0 which is inferred by Assumption 3, we have w(ξ) → w(τ) > 0 and f(ξ) → f(τ) > 0, for n → ∞, since τ is the upper bound for the support of population time and Λ(τ) < ∞. Therefore, when n → ∞ we have that

P (n^{η} ∣ \hat{τ} - τ ∣ > ε) \approx exp (- w (τ) f (τ) ε n^{1 - η} / μ),

which implies that τ̂ → τ in a rate higher than n^1/2 but lower than n (i.e. when 1/2 < η < 1, the above probability converges to zero).

Note that by the Borel-Cantelli Lemma, τ̂_n → τ in probability implies that for every subsequence, there is a sub-subsequence {n′} such that τ̂_n_′ → τ almost surely (Ferguson, 1996, page 8). This combined with the fact that τ can be consistently estimated by τ̂_n complete the proof for strong consistency of Λ̂_n. As τ̂_n ≤ τ, these facts also infer that the mean μ of the population failure time T̃ can be consistently estimated by $\int_{0}^{\hat{τ}} exp (- {\hat{Λ}}_{n} (s)) d s$ .

A.3 Consistency of Λ̂_n with increasing failure rate

Let ||·||_τ denote the supremum norm over [0, τ]. We have the following strong consistency result for the NPMLE Λ̂_n(·) proposed in Section 3:

Proposition 1

Suppose Λ is convex on its support [0, τ]. Under the regularity conditions,

{| | {\hat{Λ}}_{n} - Λ | |}_{τ} \leq {| | Λ_{n} - Λ | |}_{τ} \to_{a . s .} 0.

We adapt the proof in Huang and Wellner (1995). Let ε_n = ||Λ_n − Λ||_τ. As argued in Asgharian and Wolfson (2005), ε_n → 0 almost surely when n → ∞. Since Λ is convex on [0, τ], it must be continuous on [0, τ]. The function Λ − ε_n is convex and is a minorant of Λ_n, i.e., Λ (s) − ε_n ≤ Λ_n(s) for all 0 ≤ s ≤ τ. By the definition of Λ̂_n, we have that for all 0 ≤ s ≤ τ,

Λ (s) - ε_{n} \leq {\hat{Λ}}_{n} (s) \leq Λ_{n} (s) .

It follows that −ε_n ≤ Λ̂_n(s) − Λ(s) ≤ Λ_n(s) − Λ(s) ≤ ε_n for all 0 ≤ s ≤ τ. The conclusion of the proposition follows as ||Λ̂_n − Λ||_τ ≤ ε_n goes to 0 almost surely.

A.4 Consistency: Proof of Theorem 1

Note that the log-likelihood function ℓ_n(β, λ) is strictly concave in λ as each function of λ in ℓ_n(β, λ) is concave or strictly concave and the summation of concave functions is concave. Hence, for each β in a compact set Inline graphic , we can find a unique maximizer of λ̂(·, β) of the likelihood function l_n(β, λ). The existence of the NPMLE for {β, λ} follows by the compactness of for the profile likelihood ℓ_n(β, λ̂(·, β)), which is continuous in β. The uniqueness of the NPMLE is guaranteed by Assumption 5 for large samples.

The technical details of the consistency proof are similar to those of Murphy (1995) or Parner (1998). We provide only a sketch of the proof. As the MLE (β̂_n, Λ̂_n) maximizes the log-likelihood function ℓ_n, the method is to use the empirical Kullback-Leibler distance ℓ_n(β̂_n, Λ̂_n) − ℓ_n(β₀, Λ₀), which must always be nonnegative. If (β̂_n, Λ̂_n) converges at all, say, to (β*, Λ*), then ℓ_n(β̂_n, Λ̂_n) − ℓ_n(β₀, Λ₀) must converge to the negative Kullback-Leibler distance between P_β_{*; Λ*} and P_{β₀, Λ₀} by the strong law of large numbers, where P_β; _Λ is the probability measure under the parameter (β, Λ). The Kullback-Leibler distance between P_β_{*; Λ*} and P_{β₀, Λ₀} therefore must be zero, and we conclude that P_β_{*; Λ*} = P_{β₀, Λ₀} almost surely. It then follows by model identifiability that β* = β₀ and Λ* = Λ₀.

We need to find, for any subsequence of (β̂_n, Λ̂_n), a further convergent subsequence. The first step is to show that (β̂_n, Λ̂_n) stays bounded. As β̂_n is in a compact set, it must stay bounded. Because (β̂_n, Λ̂_n) maximizes the likelihood function, ℓ_n(β̂_n, Λ̂_n) − ℓ_n(β̄, Λ̄) ≥ 0 for each (β̄, Λ̄) in the parameter set. Recall that τ̂ = t_k. We show that Λ̂_n(τ̂) stays bounded, ${lim^{¯}}_{n} {\hat{Λ}}_{n} < \infty$ . We use the method of contradiction. Suppose that Λ̂_n(τ̂) diverges. Then we can construct some sequence {β̄_n, Λ̄_n} such that the empirical Kullback-Leibler distance ℓ_n(β̂_n, Λ̂_n) − ℓ_n(β̄_n, Λ̄_n) would become negative infinity. This is a contradiction as the Kullback-Leibler distance is always nonnegative. The construction of the contradiction is along the lines as in Murphy (1994). Briefly, we choose β̂_n = β₀ and define Λ̄_n to be

{\bar{Λ}}_{n} (t) = \int_{0}^{t} {\sum_{i = 1}^{n} W_{i} (u, β_{0}, Λ_{0})}^{- 1} d {\sum_{j = 1}^{n} N_{j} (u)},

(26)

where $W_{i} (u, β, Λ) = e^{β^{'} Z_{i}} (Y_{i} (u) - \int_{u}^{τ} S (v ∣ Z_{i}) d v / \int_{0}^{τ} S (v ∣ Z_{i}) d v)$ , and note that

{E W}_{i} (u, β, Λ) - E [e^{β^{'} Z_{i}} {u - E_{C} {(u - C)}^{+}} S (u ∣ Z) / μ (Z)] \geq 0.

It can be shown easily that Λ̄_n converges to Λ₀ almost surely and uniformly in t. By a technical argument similar to that of Murphy (1995), we can show that ℓ_n(β̂_n, Λ̂_n) − ℓ_n(β̄_n, Λ̄_n) → −∞ as n → ∞. This is impossible so Λ̂_n must stay bounded.

As Λ̂_n stays bounded, we can apply Helly’s selection principle to find a convergent subsequence of (β̂_{n_k} Λ̂_{n_k}) for an arbitrary subsequence from the sequence indexed; by {1, ···, n}. By the strong law of large numbers, such convergent subsequence must converge to (β₀, Λ₀) using the classical Kullback-Leibler information approach. For any given subsequence {n_k}, we can identify a further subsequence of (β̂ _{n_k}, Λ̂ _{n_k}) that converges to (β₀, Λ₀). Helly’s selection theorem implies that the entire sequence (β̂_n, Λ̂_n(t)) must converge to (β₀, Λ₀(t)) for each t ∈ [0, τ]. By the assumption that Λ₀(·) is monotone and continuous, the convergence of Λ̂_n(t) ∈ [0, τ] at each t is also uniformly in t. The convergence is also almost surely a true convergence as the proof is carried out for a fixed ω in the underlying probability space Ω, where we use the law of large numbers, except that it is countable many times.

A.5 Asymptotic Normality: Proof of Theorem 2

We prove the asymptotic normality by the Z-theorem for the infinite-dimensional estimating equations (van der Vaart and Wellner, 1996, Theorem 3.3.1, page 310). This approach has been successfully applied by (Murphy, 1995, Theorem 1) and (Parner, 1998, Theorem 2) among many others. The proof requires the confirmation of the three main conditions of the Z-theorem: Fréchet differentiability, weak convergence of $\sqrt{n} U_{n} (β_{0}, Λ_{0})$ and a stochastic approximation of the estimating equations, which we outline below.

Fréchet Derivative and its Invertibility

Let $Λ_{Z} (t) = \int_{0}^{t} e^{β^{'} Z} d Λ (u)$ . We first show that the population estimating equation U₀ is Fréchet differentiable and its Fréchet derivative is continuously invertible. We have defined U₀(·, β, Λ) = (U₁₀(β, Λ), U₂₀(·, β, Λ)) where

\begin{array}{l} U_{10} (β, Λ) = \int_{0}^{τ} E_{0} {Z d N (t) - Y (u) Z d Λ_{Z} (u) + Z (\int_{u}^{τ} S (v ∣ Z) d v) d Λ_{Z} (u) / μ_{β, Λ} (Z)}, \\ U_{20} (t, β, Λ) = \int_{0}^{t} E_{0} {d N (u) - Y (u) d Λ_{Z} (u) + (\int_{u}^{τ} S (v ∣ Z) d v) d Λ_{Z} (u) / μ_{β, Λ} (Z)} . \end{array}

The Fréchet derivative can be calculated from the Gâteaux variations of U₀(β, Λ) at (β₀, Λ₀). That is done to differentiate U₀(β_η, Λ_η) with respect to η and evaluated at η = 0, where the submodels are β_η = β₀ + ηβ and Λ_η = Λ₀ + ηΛ.

The Gâteaux derivative of U₂₀(t, β, Λ) evaluated at (β₀, Λ₀) is

- {σ_{21} (β) + σ_{22} (Λ)},

where $σ_{21} (β) \equiv \partial / \partial η U_{20} (t, β_{η}, Λ_{0}) ∣_{η = 0} = β^{'} {\int_{0}^{t} K_{1}^{(1)} (u) d Λ_{0} (u) + \int_{0}^{τ} K_{2}^{(1)} (t, u) d Λ_{0} (u)}$ , and $σ_{22} (Λ) \equiv \partial / \partial η U_{20} (t, β_{0}, Λ_{n}) ∣_{η = 0} = \int_{0}^{t} K_{1}^{(0)} (u) d Λ (u) + \int_{0}^{τ} K_{2}^{(0)} (t, u) d Λ (u)$ . The Gâteaux derivative of U₁₀(t, β, Λ) evaluated at (β ₀, Λ₀) is

- {σ_{11} (β) + σ_{12} (Λ)},

where $σ_{11} (β) \equiv \partial / \partial η U_{10} (β_{η}, Λ_{0}) ∣_{β = β_{0}} = {\int_{0}^{τ} K_{1}^{(2)} (u) d Λ_{0} (u) + \int_{0}^{τ} K_{2}^{(2)} (τ, u) d Λ_{0} (u)}^{'} β$ , and $σ_{12} (Λ) \equiv \partial / \partial η U_{10} (β_{0}, Λ_{η}) ∣_{η = 0} = \int_{0}^{τ} K_{1}^{(1)} (u) d Λ (u) + \int_{0}^{τ} K_{2}^{(1)} (τ, u) d Λ (u)$ .

To obtain the results on weak convergence, we need to strengthen the Ĝateaux differentiability to Fréchet differentiability, essentially for the proof of tightness (van der Vaart and Wellner, 1996, page 310). The Fréchet differentiability of U₀(β, Λ) can be confirmed by definition; its derivative has the form in (19). Note that the operator U̇_ψ₀ is a linear continuous operator defined on parameter space in the product space of ℝ^p and the Banach space L₂[0, τ]. If the inverse operator ${\dot{U}}_{ψ_{0}}^{- 1}$ exists, then it must be continuous by Banach’s continuous inverse theorem (Zeidler, 1995, page 179). Hence, to prove the continuous invertibility of U̇_ψ₀, we only need to show the existence of the inverse operator ${\dot{U}}_{ψ_{0}}^{- 1}$ .

To show the existence of the inverse operator ${\dot{U}}_{ψ_{0}}^{- 1}$ , we only need to show by the formula in (20) that σ₁₁ and $Φ = σ_{22} - σ_{21} σ_{11}^{- 1} σ_{12}$ have inverse. The operator $σ_{11} (β) = J_{0}^{'} β$ is a linear operator, where the matrix J₀ defined in (18) is the Fisher information for β for known Λ₀. By Assumption 5, the matrix J₀ has an inverse, and hence σ₁₁ is invertible. The operator Φ has the following form:

Φ (Λ) = - [\int_{0}^{t} K_{1}^{(0)} (u) d Λ (u) + \int_{0}^{τ} {K_{2}^{(0)} (t, u) - {(\int_{0}^{t} K_{1}^{(1)} (v) d Λ_{0} (v) + \int_{0}^{τ} K_{2}^{(1)} (t, v) d Λ_{0} (v))}^{'} J_{0}^{- 1} \cdot (K_{1}^{(1)} (u) + K_{2}^{(1)} (τ, u))} d Λ (u)] .

The invertibility of Φ is equivalent to show that there exits a unique solution to the equation Φ(Λ) = Λ̃ for a function Λ̃ of bounded variation. Taking the derivative with respect to t on both sides of the equation,

- d \tilde{Λ} (t) = K_{1}^{(0)} (t) d Λ (t) - \int_{0}^{τ} Q (t, u) d Λ (u),

(27)

where Q(t, u) is defined previously in (23). We observe that the integral equation (27) is the Fredholm equation of the second type. By Assumptions 1 and 3, the bivariate function Q(t, u) defined in (23) is continuous on [0, τ] × [0, τ]. Also by assumption 3, the function $1 / K_{1}^{(0)} (t)$ is continuous and bounded away from 0 for t > 0. By the classical theory for integral equation (Tricomi, 1985, Chapter 2), there is a unique solution dΛ (t) to the Fredholm integral equation (27), characterized by

d Λ (t) = - \frac{d \tilde{Λ} (t)}{K_{1}^{(0)} (t)} + \int_{0}^{τ} \frac{H (t, u)}{K_{1}^{(0)} (t)} d \tilde{Λ} (u),

where H(t, u) satisfies the equation (22). Finally, the invertibility of the functional Φ follows and its inverse operator Φ⁻¹(Λ) has a form expressed in (21).

Weak Convergence of $\sqrt{n} U_{n} (β_{0}, Λ_{0})$

As the true value (β₀, Λ₀) satisfies U₀(β₀, Λ₀) = 0, we have

\sqrt{n} U_{n} (β_{0}, Λ_{0}) = \sqrt{n} {U_{1 n} (ψ_{0}) - U_{10} (ψ_{0}), U_{2 n} (t, ψ_{0}) - U_{20} (t, ψ_{0})} .

By the multivariate central limit theorem for the summation of independently and identically distributed (i.i.d.) random vectors, $\sqrt{n} {U_{1 n} (ψ_{0}) - U_{10} (ψ_{0})}$ converges in law to Inline graphic , provided that the second moment is finite. The process $\sqrt{n} {U_{2 n} (t, ψ_{0}) - U_{20} (t, ψ_{0})}$ is a sum of i.i.d. processes of bounded variation on [0, τ]. By a lemma for the central limit theorem for processes of bounded variation (van der Vaart and Wellner, 1996, Example 2.11.16), $\sqrt{n} {U_{2 n} (t, ψ_{0}) - U_{20} (t, ψ_{0})}$ converges to a tight Gaussian process, say Inline graphic , provided that the second moment is finite. The weak convergence of $\sqrt{n} U_{n} (β_{0}, Λ_{0})$ follows by the continuous mapping theorem.

Stochastic Approximation

To apply the Z-theorem for the infinite dimensional estimating equations (van der Vaart and Wellner, 1996, Theorem 3.3.1), we need to confirm the following stochastic approximation:

\sqrt{n} {(U_{n} - U_{0}) ({\hat{ψ}}_{n}) - (U_{n} - U_{0}) (ψ_{0})} \equiv \sqrt{n} [{U_{n} (\cdot, {\hat{ψ}}_{n}) - U_{0} (\cdot, {\hat{ψ}}_{n})} - {U_{n} (\cdot, ψ_{0}) - U_{0} (\cdot, ψ_{0})}] = o_{P} (1) .

The function is defined on Inline graphic × , where by Assumption 2, the set is compact that contains β₀, and is a set of nondecreasing function such that for each Λ ∈ , Λ (0) = 0 and Λ (τ) < ∞. Hence contains Λ₀. To apply the Z-theorem, we need Λ in the closed linear subspace generated by the set . The subspace Inline graphic is viewed in the space of functions of bounded variation on [0, τ] endowed with the variation norm by || · ||_v, defined by the total variation of Λ on [0, τ], i.e.,

{| | Λ | |}_{v} \equiv sup \sum_{k = 1}^{m^{'}} ∣ Λ (s_{k}) - Λ (s_{k - 1}) ∣,

where the supremum is taken over all finite partitions of [0, τ], {0 = s₀ < s₁ < ··· < s_m_′ = τ}.

We first derive the score functions for β and Λ (·) for a single observation Inline graphic . We keep in the nations for the score functions to emphasize its dependence on the data. By straightforward calculation, the score function for β is

{\dot{ℓ}}_{1} (O, β, Λ) = \int_{0}^{τ} Z d N (u) - \int_{0}^{τ} Y (u) {Z e}^{β^{'} Z} d Λ (u) + \int_{0}^{τ} Z (\int_{u}^{τ} S (v ∣ Z) d v) e^{β^{'} Z} d Λ (u) / μ_{β, Λ} (Z) .

(28)

For the infinite dimensional parameter Λ (·), consider a submodel defined by dΛ _η = (1 + ηh)dΛ, where h is a bounded and integrable function. By taking the derivative of ℓ_n(β, Λ_η) with respect to η and evaluating it at η = 0, we have the score operator for Λ

{\dot{ℓ}}_{2} (O, β, Λ) (h) = \int_{0}^{τ} h (v) d N (v) - \int_{0}^{τ} Y (v) e^{β^{'} Z} h (v) d Λ (v) + \int_{0}^{τ} {\int_{u}^{τ} S (v ∣ Z) d v} h (u) e^{β^{'} Z} d Λ (u) / μ_{β, Λ} (Z) .

(29)

Taking h(·) = 1(· ≤ t) in (29), we have the score function for Λ

{\dot{ℓ}}_{2} (t; O, β, Λ) = \int_{0}^{t} d N (v) - \int_{0}^{t} Y (v) e^{β^{'} Z} h (v) d Λ (v) + \int_{0}^{t} {\int_{u}^{t} S (v ∣ Z) d v} e^{β^{'} Z} d Λ (u) / μ_{β, Λ} (Z) .

(30)

We can write ${\dot{ℓ}}_{1} (O, β, Λ) = \int_{0}^{τ} Z d {\dot{ℓ}}_{2} (t; O, β, Λ)$ .

Let ℓ̇(t, Inline graphic , ψ) = {ℓ̇₁ ( , ψ), ℓ̇₂(t, , ψ)}, where ψ = (β, Λ). We first introduce some notations from the empirical processes theory. Let ℙ_n be the empirical probability measure. Then

U_{n} (β, Λ) = P_{n} \dot{ℓ} (\cdot, O, ψ) = n^{- 1} \sum_{i = 1}^{n} \dot{ℓ} (\cdot, O_{i}, ψ) .

Denote the empirical process by $G_{n} f = \sqrt{n} (P_{n} f - P_{0} f)$ , where P₀ is the expectation under ψ₀. Note that $\sqrt{n} (U_{n} - U_{0}) (ψ) = G_{n} \dot{ℓ} (\cdot, O, ψ)$ is the empirical process indexed by the class of functions

{\dot{ℓ} (t; O, ψ), ψ \in B \times, \bar{A}, t \in [0, τ]} .

Let the norm ||·|| on Inline graphic = × be defined as ||(β, Λ)|| = |β| + ||Λ||_v. Then the stochastic condition that we want to confirm is

{| | G_{n} \dot{ℓ} (t, O, {\hat{ψ}}_{n}) - G_{n} \dot{ℓ} (t, O, ψ_{0}) | |}_{H} = o_{P} (1) .

To apply the functional central limit theory, we show that the class of functions {ℓ̇(t, Inline graphic , ψ) − ℓ̇(t, , ψ₀): ||ψ|| <, δ, t ∈ [0, τ]} is P₀-Donsker. As ${\dot{ℓ}}_{1} (O, ψ) = \int_{0}^{τ} Z d {\dot{ℓ}}_{2} (t, O, ψ)$ , we only need to show that {ℓ̇₂(t, , ψ) − ℓ̇₂(t, , ψ₀): ||ψ|| < δ, t ∈ [0, τ]} is P₀-Donsker, where

\begin{array}{l} {\dot{ℓ}}_{2} (t, O, ψ) - {\dot{ℓ}}_{2} (t, O, ψ_{0}) = \int_{0}^{t} Y (u) e^{β_{0}^{'} Z} d Λ_{0} (u) - \int_{0}^{t} Y (u) e^{β^{'} Z} d Λ (u) \\ + \int_{0}^{t} {\int_{u}^{τ} S (v ∣ Z) d v} e^{β^{'} Z} d Λ (u) / μ_{Z} (ψ) - \int_{0}^{t} (\int_{u}^{τ} S_{0} (v ∣ Z) d v) e^{β_{0}^{'} Z} d Λ_{0} (u) / μ_{Z} (ψ_{0}) . \end{array}

First, the class of functions exp(β′ Z) for β in the compact set Inline graphic is P₀-Donsker as it is of finite dimension. The class of functions of bounded variation on [0, τ] is P₀-Donsker (van der Vaart, 1998, page 273). As the function Λ (t) is of bounded variation on [0, τ], the set of functions { $\int_{0}^{t} Y (u) e^{β^{'} Z} d Λ$ , t ∈ [0, τ], β, Λ} is P₀-Donsker. Similarly, we can show that the sets of functions { $\int_{0}^{t} [\int_{u}^{τ} S (v ∣ Z) d v] e^{β^{'} Z} d Λ (u)$ , t ∈ [0, τ], β ∈ Inline graphic , Λ ∈ } and {μ_Z(β, Λ), β ∈ , Λ ∈ } are P₀-Donsker. Note that their envelope functions are (τ − u)e^β^′^ZΛ (τ) and τ < ∞, respectively. By Assumption 4, (τ − u)²Λ²(τ)E₀e²^β^′^Z < ∞ for all β and Λ ∈ . By the permanent property of Donsker classes, we proceed to confirm the Donsker property by using the fact that the sums, production and Lipschitz transformations of simple P₀-Donsker classes are still P₀-Donsker.

Furthermore, ||ψ − ψ₀|| → 0, ℓ̇₂(t, Inline graphic , ψ) converges to ℓ̇₂(t, , ψ₀) for each t. The convergence also holds in the square moment by the dominated convergence theorem. It follows that

sup_{t \in [0, τ]} E_{0} {| | \dot{ℓ} (t, O, ψ) - \dot{ℓ} (t, O, ψ_{0}) | |}_{H}^{2} \to 0.

The stochastic approximation of $\sqrt{n} (U_{n} - U_{0}) ({\hat{β}}_{n}, {\hat{Λ}}_{n})$ to $\sqrt{n} (U_{n} - U_{0}) (β_{0}, Λ_{0})$ now follows by a technical lemma of van der Vaart and Wellner (1996, Lemma 3.3.5, p. 311).

Footnotes

B Supplemental Materials

Jumps for NPMLE of baseline cumulative hazard function for length-biased data: A detail description on why the NPMLE has jumps at both censored and uncensored times for length-biased data.

Contributor Information

Jing Qin, Email: jingqin@niaid.nih.gov, Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, NIH Bethesda, Maryland 20892, USA, Phone: 301-451-2436.

Jing Ning, Division of Biostatistics, The University of Texas, Health Science Center at Houston, School of Public Health, Houston, Texas 77030, USA.

Hao Liu, Division of Biostatistics, Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA.

Yu Shen, Department of Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, Texas 77030, USA.

References

Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer; New York: 1992. [Google Scholar]
Asgharian M, M’Lan CE, Wolfson DB. Length-biased Sampling with Right Censoring: an Unconditional Approach. Journal of the American Statistical Association. 2002;97:201–209. [Google Scholar]
Asgharian M, Wolfson DB. Asymptotic Behavior of the Unconditional NPMLE of the Length-biased Survivor Function From Right Censored Prevalent Cohort Data. The Annals of Statistics. 2005;33:2109–2131. [Google Scholar]
Asgharian M, Wolfson DB, Zhang X. Checking Stationarity of the Incidence Rate Using Prevalent Cohort Survival Data. Statistics in Medicine. 2006;25:1751–1767. doi: 10.1002/sim.2326. [DOI] [PubMed] [Google Scholar]
Barlow RE, Proschan F. Statistical Theory of Reliability. New York: Holt, Rinehart & Winston; 1975. [Google Scholar]
Bergeron P-J, Asgharian M, Wolfson DB. Covariate Bias Induced by Length-Biased Sampling of Failure Times. Journal of the American Statistical Association. 2008;103:737–742. [Google Scholar]
Bickel PJ, Ritov Y. Efficient Estimation Using both Direct and Indirect Observations. Theory of Probability and its Applications. 1994;38:194–213. [Google Scholar]
Breslow N. Contribution to the Discussion of the Paper by D. R. Cox. Journal of the Royal Statistical Society B. 1972;34:187–220. [Google Scholar]
Chen HY, Little RJA. Proportional Hazards Regression with Missing Covariates. Journal of the American Statistical Association. 1999;94:896–908. [Google Scholar]
Cox DR. Regression Models and Life Tables (with Discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
Cox DR. Partial Likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
Cox DR, Miller HD. The Theory of Stochastic Processes. London: New York; 1977. [Google Scholar]
De Uña Álvarez J, Otero-Giraldez MS, Alvarez-Llorente G. Estimation Under Length-bias and Right-censoring: An Application to Unemployment Duration Analysis for Married Women. Journal of Applied Statistics. 2003;30:283–291. [Google Scholar]
Dewanji A, Kalbfleisch JD. Estimation of Sojourn Time Distributions for Cyclic sSemi-markov Processes in Equilibrium. Biometrika. 1987;74:281–288. [Google Scholar]
Elbers C, Ridder G. True and Spurious Duration Dependence: the Identifiability of the Proportional Hazard Model. The Review of Economic Studies. 1982;49:403–409. [Google Scholar]
Ferguson TS. A Course in Large Sample Theory. 1 London: Chapman & Hall; 1996. [Google Scholar]
Gail MH, Benichou J. Encyclopedia of Epidemiologic Methods. Wiley; 2000. [Google Scholar]
Gordis L. Epidemiology. Philadelphia, PA: W. B. Saunders Company; 2000. [Google Scholar]
Huang J, Wellner JA. Estimation of a Monotone Density or Monotone Hazard under Random Censoring. Scandinavian Journal of Statistics. 1995;22:3–33. [Google Scholar]
Kalbfleisch JD, Lawless JF. Regression Models for Right Truncated Data with Applications to AIDS Incubation Times and Reporting Lags. Statistica Sinica. 1991;1:19–32. [Google Scholar]
Kalbfleisch JD, Prentice RL. Marginal Likelihoods Based on Cox’s Regression and Life Model. Biometrika. 1973;60:267–278. [Google Scholar]
Keiding N. Age-specific Incidence and Prevalence: a Statistical Perspective (with Discussion) Journal of the Royal Statistical Society, Series A. 1991;154:371–412. [Google Scholar]
Klein J. Semiparametric Estimation of Random Effects Using the Cox Model Based on the EM Algorithm. Biometrics. 1992;48:795–806. [PubMed] [Google Scholar]
Kvam P. Length Bias in the Measurements of Carbon Nanotubes. Technometrics. 2008;50:462–467. [Google Scholar]
Lancaster T. The Econometric Analysis of Transition Data. Cambridge: University Press; 1990. [Google Scholar]
Marshall AW, Proschan F. Maximum Likelihood Estimation for Distributions with Monotone Failure Rate. The Annals of Mathematical Statistics. 1965;36:69–77. [Google Scholar]
McClean S, Devine C. A Nonparametric Maximum Likelihood Estimator for Incomplete Renewal Data. Biometrika. 1995;82:791–803. [Google Scholar]
Meng X-L, Rubin DB. Maximum Likelihood Estimation via the ECM Algorithm: a General Framework. Biometrika. 1993;80:267–278. [Google Scholar]
Murphy SA. Consistency in a Proportional Hazards Model Incorporating a Random Effect. The Annals of Statistics. 1994;22:712–731. [Google Scholar]
Murphy SA. Asymptotic theory for the frailty model. The Annals of Statistics. 1995;23:182–198. [Google Scholar]
Murphy SA, van der Vaart AW. Observed Information in Semi-parametric Models. Bernoulli. 1999;5:381–412. [Google Scholar]
Murphy SA, van der Vaart AW. On Profile Likelihood (with Discussion) Journal of the American Statistical Association. 2000;95:449–485. [Google Scholar]
Nielsen GG, Gill RD, Andersen PK, Sorensen TIA. A Counting Process Approach to Maximum Likelihood Estimation in Frailty Models. Scandinavian Journal of Statistics. 1992;19:25–43. [Google Scholar]
Padgett WJ, Wei LJ. Maximum Likelihood Estimation of a Distribution Function with Increasing Failure Rate Based on Censored Observations. Biometrika. 1980;67:470–474. [Google Scholar]
Parner E. Asymptotic Theory for the Correlated Gamma-frailty Model. The Annals of Statistics. 1998;26:183–214. [Google Scholar]
Qin J, Shen Y. Statistical Methods for Analyzing Right-censored Length-biased Data under Cox Model. Biometrics. 2010;66:382–392. doi: 10.1111/j.1541-0420.2009.01287.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rothenberg TJ. Identification in Parametric Models. Econometrica. 1971;39:577–591. [Google Scholar]
Sansgiry P, Akman O. Transformations of the Lognormal Distribution as a Selection Model. The American Statistician. 2000;54:307–309. [Google Scholar]
Scheike TH, Keiding N. Design and Analysis of Time-to-pregnancy. Statistical Methods in Medical Research. 2006;15:127–140. doi: 10.1191/0962280206sm435oa. [DOI] [PubMed] [Google Scholar]
Simon R. Length-biased Sampling in Etiologic Studies. American Journal of Epidemiology. 1980;111:444–452. doi: 10.1093/oxfordjournals.aje.a112920. [DOI] [PubMed] [Google Scholar]
Terwilliger J, Shannon W, Lathrop G, Nolan J, Goldin L, Chase G, Weeks D. True and False Positive Peaks in Genomewide Scans: Applications of Length-biased Sampling to Linkage Mapping. American Journal of Human Genetics. 1997;61:430–438. doi: 10.1086/514855. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tricomi FG. Integral Equations. New York: Dover Publications; 1985. [Google Scholar]
Tsai WY. Estimation of the Survival Function with Increasing Failure Rate Based on Left Truncated and Right Censored Data. Biometrika. 1988;75:319–324. [Google Scholar]
Tsai WY. Pseudo-partial Likelihood for Proportional Hazards Models with Biased-sampling Data. Biometrika. 2009;96:601–615. doi: 10.1093/biomet/asp026. [DOI] [PMC free article] [PubMed] [Google Scholar]
Turnbull BW. The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data. Journal of the Royal Statistical Society Series B Methodological. 1976;38:290–295. [Google Scholar]
van der Vaart A. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge, UK: Cambridge University Press; 1998. Asymptotic Statistics. [Google Scholar]
van der Vaart AW, Wellner JA. Existence and Consistency of Maximum Likelihood in Upgrade Mixture Models. Journal of Multivariate Analysis. 1992;43:133–146. [Google Scholar]
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer-Verlag; 1996. with applications to statistics. [Google Scholar]
Vardi Y. Nonparametric Estimation in the Presence of Length Bias. The Annals of Statistics. 1982;10:616–620. [Google Scholar]
Vardi Y. Multiplicative Censoring, Renewal Processes, Deconvolution and Decreasing Density: Nonparametric Estimation. Biometrika. 1989;76:751–761. [Google Scholar]
Vardi Y, Zhang CH. Large Sample Study of Empirical Distributions in a Random-Multiplicative Censoring Model. The Annals of Statistics. 1992;20:1022–1039. [Google Scholar]
Wang MC. Nonparametric Estimation From Cross-Sectional Survival Data. Journal of the American Statistical Association. 1991;86:130–143. doi: 10.1080/01621459.1999.10473831. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang MC. Hazards Regression Analysis for Length-biased Data. Biometrika. 1996;83:343–354. [Google Scholar]
Wang MC, Brookmeyer R, Jewell NP. Statistical Models for Prevalent Cohort Data. Biometrics. 1993;49:1–11. [PubMed] [Google Scholar]
Wolfson C, Wolfson DB, Asgharian M, M’Lan CE, Ostbye T, Rockwood K, Hogan DB the Clinical Progression of Dementia Study Group. A Reevaluation of the Duration of Survival after the Onset of Dementia. The New England Journal of Medicine. 2001;344:1111–1116. doi: 10.1056/NEJM200104123441501. [DOI] [PubMed] [Google Scholar]
Zeidler E. Applied Mathematical Sciences. Vol. 109. New York: Springer-Verlag; 1995. Applied Functional Analysis: Main Principles and Their Applications. [Google Scholar]
Zelen M. Forward and Backward Recurrence Times and Length Biased Sampling: Age Specific Models. Lifetime Data Analysis. 2004;10:325–334. doi: 10.1007/s10985-004-4770-1. [DOI] [PubMed] [Google Scholar]
Zelen M, Feinleib M. On the Theory of Screening for Chronic Diseases. Biometrika. 1969;56:601–614. [Google Scholar]
Zeng D, Lin DY. Maximum Likelihood Estimation in Semiparametric Regression Models with Censored Data. Journal of the Royal Statistical Society, Series B. 2007;69:507–564. [Google Scholar]
Zeng D, Lin DY, Yin G. Maximum Likelihood Estimation for the Proportional Odds Model with Random Effects. Journal of the American Statistical Association. 2005;100:470–483. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

websupplementary

NIHMS352232-supplement-websupplementary.pdf^{(96.8KB, pdf)}

[R1] Andersen PK, Borgan O, Gill RD, Keiding N. Statistical Models Based on Counting Processes. Springer; New York: 1992. [Google Scholar]

[R2] Asgharian M, M’Lan CE, Wolfson DB. Length-biased Sampling with Right Censoring: an Unconditional Approach. Journal of the American Statistical Association. 2002;97:201–209. [Google Scholar]

[R3] Asgharian M, Wolfson DB. Asymptotic Behavior of the Unconditional NPMLE of the Length-biased Survivor Function From Right Censored Prevalent Cohort Data. The Annals of Statistics. 2005;33:2109–2131. [Google Scholar]

[R4] Asgharian M, Wolfson DB, Zhang X. Checking Stationarity of the Incidence Rate Using Prevalent Cohort Survival Data. Statistics in Medicine. 2006;25:1751–1767. doi: 10.1002/sim.2326. [DOI] [PubMed] [Google Scholar]

[R5] Barlow RE, Proschan F. Statistical Theory of Reliability. New York: Holt, Rinehart & Winston; 1975. [Google Scholar]

[R6] Bergeron P-J, Asgharian M, Wolfson DB. Covariate Bias Induced by Length-Biased Sampling of Failure Times. Journal of the American Statistical Association. 2008;103:737–742. [Google Scholar]

[R7] Bickel PJ, Ritov Y. Efficient Estimation Using both Direct and Indirect Observations. Theory of Probability and its Applications. 1994;38:194–213. [Google Scholar]

[R8] Breslow N. Contribution to the Discussion of the Paper by D. R. Cox. Journal of the Royal Statistical Society B. 1972;34:187–220. [Google Scholar]

[R9] Chen HY, Little RJA. Proportional Hazards Regression with Missing Covariates. Journal of the American Statistical Association. 1999;94:896–908. [Google Scholar]

[R10] Cox DR. Regression Models and Life Tables (with Discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]

[R11] Cox DR. Partial Likelihood. Biometrika. 1975;62:269–276. [Google Scholar]

[R12] Cox DR, Miller HD. The Theory of Stochastic Processes. London: New York; 1977. [Google Scholar]

[R13] De Uña Álvarez J, Otero-Giraldez MS, Alvarez-Llorente G. Estimation Under Length-bias and Right-censoring: An Application to Unemployment Duration Analysis for Married Women. Journal of Applied Statistics. 2003;30:283–291. [Google Scholar]

[R14] Dewanji A, Kalbfleisch JD. Estimation of Sojourn Time Distributions for Cyclic sSemi-markov Processes in Equilibrium. Biometrika. 1987;74:281–288. [Google Scholar]

[R15] Elbers C, Ridder G. True and Spurious Duration Dependence: the Identifiability of the Proportional Hazard Model. The Review of Economic Studies. 1982;49:403–409. [Google Scholar]

[R16] Ferguson TS. A Course in Large Sample Theory. 1 London: Chapman & Hall; 1996. [Google Scholar]

[R17] Gail MH, Benichou J. Encyclopedia of Epidemiologic Methods. Wiley; 2000. [Google Scholar]

[R18] Gordis L. Epidemiology. Philadelphia, PA: W. B. Saunders Company; 2000. [Google Scholar]

[R19] Huang J, Wellner JA. Estimation of a Monotone Density or Monotone Hazard under Random Censoring. Scandinavian Journal of Statistics. 1995;22:3–33. [Google Scholar]

[R20] Kalbfleisch JD, Lawless JF. Regression Models for Right Truncated Data with Applications to AIDS Incubation Times and Reporting Lags. Statistica Sinica. 1991;1:19–32. [Google Scholar]

[R21] Kalbfleisch JD, Prentice RL. Marginal Likelihoods Based on Cox’s Regression and Life Model. Biometrika. 1973;60:267–278. [Google Scholar]

[R22] Keiding N. Age-specific Incidence and Prevalence: a Statistical Perspective (with Discussion) Journal of the Royal Statistical Society, Series A. 1991;154:371–412. [Google Scholar]

[R23] Klein J. Semiparametric Estimation of Random Effects Using the Cox Model Based on the EM Algorithm. Biometrics. 1992;48:795–806. [PubMed] [Google Scholar]

[R24] Kvam P. Length Bias in the Measurements of Carbon Nanotubes. Technometrics. 2008;50:462–467. [Google Scholar]

[R25] Lancaster T. The Econometric Analysis of Transition Data. Cambridge: University Press; 1990. [Google Scholar]

[R26] Marshall AW, Proschan F. Maximum Likelihood Estimation for Distributions with Monotone Failure Rate. The Annals of Mathematical Statistics. 1965;36:69–77. [Google Scholar]

[R27] McClean S, Devine C. A Nonparametric Maximum Likelihood Estimator for Incomplete Renewal Data. Biometrika. 1995;82:791–803. [Google Scholar]

[R28] Meng X-L, Rubin DB. Maximum Likelihood Estimation via the ECM Algorithm: a General Framework. Biometrika. 1993;80:267–278. [Google Scholar]

[R29] Murphy SA. Consistency in a Proportional Hazards Model Incorporating a Random Effect. The Annals of Statistics. 1994;22:712–731. [Google Scholar]

[R30] Murphy SA. Asymptotic theory for the frailty model. The Annals of Statistics. 1995;23:182–198. [Google Scholar]

[R31] Murphy SA, van der Vaart AW. Observed Information in Semi-parametric Models. Bernoulli. 1999;5:381–412. [Google Scholar]

[R32] Murphy SA, van der Vaart AW. On Profile Likelihood (with Discussion) Journal of the American Statistical Association. 2000;95:449–485. [Google Scholar]

[R33] Nielsen GG, Gill RD, Andersen PK, Sorensen TIA. A Counting Process Approach to Maximum Likelihood Estimation in Frailty Models. Scandinavian Journal of Statistics. 1992;19:25–43. [Google Scholar]

[R34] Padgett WJ, Wei LJ. Maximum Likelihood Estimation of a Distribution Function with Increasing Failure Rate Based on Censored Observations. Biometrika. 1980;67:470–474. [Google Scholar]

[R35] Parner E. Asymptotic Theory for the Correlated Gamma-frailty Model. The Annals of Statistics. 1998;26:183–214. [Google Scholar]

[R36] Qin J, Shen Y. Statistical Methods for Analyzing Right-censored Length-biased Data under Cox Model. Biometrics. 2010;66:382–392. doi: 10.1111/j.1541-0420.2009.01287.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] Rothenberg TJ. Identification in Parametric Models. Econometrica. 1971;39:577–591. [Google Scholar]

[R38] Sansgiry P, Akman O. Transformations of the Lognormal Distribution as a Selection Model. The American Statistician. 2000;54:307–309. [Google Scholar]

[R39] Scheike TH, Keiding N. Design and Analysis of Time-to-pregnancy. Statistical Methods in Medical Research. 2006;15:127–140. doi: 10.1191/0962280206sm435oa. [DOI] [PubMed] [Google Scholar]

[R40] Simon R. Length-biased Sampling in Etiologic Studies. American Journal of Epidemiology. 1980;111:444–452. doi: 10.1093/oxfordjournals.aje.a112920. [DOI] [PubMed] [Google Scholar]

[R41] Terwilliger J, Shannon W, Lathrop G, Nolan J, Goldin L, Chase G, Weeks D. True and False Positive Peaks in Genomewide Scans: Applications of Length-biased Sampling to Linkage Mapping. American Journal of Human Genetics. 1997;61:430–438. doi: 10.1086/514855. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] Tricomi FG. Integral Equations. New York: Dover Publications; 1985. [Google Scholar]

[R43] Tsai WY. Estimation of the Survival Function with Increasing Failure Rate Based on Left Truncated and Right Censored Data. Biometrika. 1988;75:319–324. [Google Scholar]

[R44] Tsai WY. Pseudo-partial Likelihood for Proportional Hazards Models with Biased-sampling Data. Biometrika. 2009;96:601–615. doi: 10.1093/biomet/asp026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Turnbull BW. The Empirical Distribution Function with Arbitrarily Grouped, Censored and Truncated Data. Journal of the Royal Statistical Society Series B Methodological. 1976;38:290–295. [Google Scholar]

[R46] van der Vaart A. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge, UK: Cambridge University Press; 1998. Asymptotic Statistics. [Google Scholar]

[R47] van der Vaart AW, Wellner JA. Existence and Consistency of Maximum Likelihood in Upgrade Mixture Models. Journal of Multivariate Analysis. 1992;43:133–146. [Google Scholar]

[R48] van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer-Verlag; 1996. with applications to statistics. [Google Scholar]

[R49] Vardi Y. Nonparametric Estimation in the Presence of Length Bias. The Annals of Statistics. 1982;10:616–620. [Google Scholar]

[R50] Vardi Y. Multiplicative Censoring, Renewal Processes, Deconvolution and Decreasing Density: Nonparametric Estimation. Biometrika. 1989;76:751–761. [Google Scholar]

[R51] Vardi Y, Zhang CH. Large Sample Study of Empirical Distributions in a Random-Multiplicative Censoring Model. The Annals of Statistics. 1992;20:1022–1039. [Google Scholar]

[R52] Wang MC. Nonparametric Estimation From Cross-Sectional Survival Data. Journal of the American Statistical Association. 1991;86:130–143. doi: 10.1080/01621459.1999.10473831. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] Wang MC. Hazards Regression Analysis for Length-biased Data. Biometrika. 1996;83:343–354. [Google Scholar]

[R54] Wang MC, Brookmeyer R, Jewell NP. Statistical Models for Prevalent Cohort Data. Biometrics. 1993;49:1–11. [PubMed] [Google Scholar]

[R55] Wolfson C, Wolfson DB, Asgharian M, M’Lan CE, Ostbye T, Rockwood K, Hogan DB the Clinical Progression of Dementia Study Group. A Reevaluation of the Duration of Survival after the Onset of Dementia. The New England Journal of Medicine. 2001;344:1111–1116. doi: 10.1056/NEJM200104123441501. [DOI] [PubMed] [Google Scholar]

[R56] Zeidler E. Applied Mathematical Sciences. Vol. 109. New York: Springer-Verlag; 1995. Applied Functional Analysis: Main Principles and Their Applications. [Google Scholar]

[R57] Zelen M. Forward and Backward Recurrence Times and Length Biased Sampling: Age Specific Models. Lifetime Data Analysis. 2004;10:325–334. doi: 10.1007/s10985-004-4770-1. [DOI] [PubMed] [Google Scholar]

[R58] Zelen M, Feinleib M. On the Theory of Screening for Chronic Diseases. Biometrika. 1969;56:601–614. [Google Scholar]

[R59] Zeng D, Lin DY. Maximum Likelihood Estimation in Semiparametric Regression Models with Censored Data. Journal of the Royal Statistical Society, Series B. 2007;69:507–564. [Google Scholar]

[R60] Zeng D, Lin DY, Yin G. Maximum Likelihood Estimation for the Proportional Odds Model with Random Effects. Journal of the American Statistical Association. 2005;100:470–483. [Google Scholar]

PERMALINK

Maximum Likelihood Estimations and EM Algorithms with Length-biased Data

Jing Qin

Jing Ning

Hao Liu

Yu Shen

SUMMARY

1. INTRODUCTION

2. A NEW EM ALGORITHM FOR ESTIMATING NONPARAMETRIC SURVIVAL FUNCTION

Remark 1

Remark 2

Remark 3

Remark 4

Lemma 1

3. NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATION WITH INCREASING FAILURE RATE

4. MLE UNDER COX REGRESSION MODEL

4.1 Full Likelihood and Score Functions

4.2. MLE and EM Algorithm

4.3. Asymptotic Properties

Theorem 1

Theorem 2

4.4. Variance Estimation

5. SIMULATIONS

5.1. Estimating a Distribution Function with an Increasing Failure Rate

Table 1.

5.2. Estimating Regression Coefficients Under the Cox Model

Table 2.

6. A REAL DATA EXAMPLE

Figure 1.

Table 3.

7. CONCLUDING REMARKS

Supplementary Material

Acknowledgments

A APPENDIX

A.1 Assumptions

A.2 Proof of the lemma on the convergence of τ̂ = tk to τ

A.3 Consistency of Λ̂n with increasing failure rate

Proposition 1

A.4 Consistency: Proof of Theorem 1

A.5 Asymptotic Normality: Proof of Theorem 2

Fréchet Derivative and its Invertibility

Weak Convergence of nUn(β0,Λ0)

Stochastic Approximation

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.2 Proof of the lemma on the convergence of τ̂ = t_k to τ

A.3 Consistency of Λ̂_n with increasing failure rate

Weak Convergence of $\sqrt{n} U_{n} (β_{0}, Λ_{0})$