Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2014 Jan 14.
Published in final edited form as: Ann Stat. 2011;39(6):2795–3443.

A SIEVE M-THEOREM FOR BUNDLED PARAMETERS IN SEMIPARAMETRIC MODELS, WITH APPLICATION TO THE EFFICIENT ESTIMATION IN A LINEAR MODEL FOR CENSORED DATA*

Ying Ding 1, Bin Nan 1
PMCID: PMC3890689  NIHMSID: NIHMS359465  PMID: 24436500

Abstract

In many semiparametric models that are parameterized by two types of parameters – a Euclidean parameter of interest and an infinite-dimensional nuisance parameter, the two parameters are bundled together, i.e., the nuisance parameter is an unknown function that contains the parameter of interest as part of its argument. For example, in a linear regression model for censored survival data, the unspecified error distribution function involves the regression coefficients. Motivated by developing an efficient estimating method for the regression parameters, we propose a general sieve M-theorem for bundled parameters and apply the theorem to deriving the asymptotic theory for the sieve maximum likelihood estimation in the linear regression model for censored survival data. The numerical implementation of the proposed estimating method can be achieved through the conventional gradient-based search algorithms such as the Newton-Raphson algorithm. We show that the proposed estimator is consistent and asymptotically normal and achieves the semiparametric efficiency bound. Simulation studies demonstrate that the proposed method performs well in practical settings and yields more efficient estimates than existing estimating equation based methods. Illustration with a real data example is also provided.

Keywords and phrases: Accelerated failure time model, B-spline, bundled parameters, efficient score function, semiparametric efficiency, sieve maximum likelihood estimation

1. Introduction

In a semiparametric model that is parameterized by two types of parameters – a finite-dimensional Euclidean parameter and an infinite-dimensional parameter, oftentimes the infinite-dimensional parameter is considered as a nuisance parameter, and the two parameters are separated. In many interesting statistical models, however, the parameter of interest and the nuisance parameter are bundled together, a terminology used by [12] when they reviewed the linear models under interval censoring, which means that the infinite-dimensional parameter is an unknown function of the parameter of interest. For example, in a linear regression model for censored survival data, the unspecified error distribution function, often treated as a nuisance parameter, is a function of the regression coefficients. Other examples include the single index model and the Cox regression model with an unspecified link function.

There is a rich literature of asymptotic distributional theories for M-estimation in a variety of semiparametric models with well separated parameters, see e.g. [9, 10, 11, 23, 29, 32], among many others. Though many methodologies of M-estimation for bundled parameters have been proposed in the literature, general asymptotic distributional theories for such problems are still lacking. The only estimation theories for bundled parameters we are aware of are the sieve generalized method of moment of [1] and the estimating equation approach of [5, 18].

In this article, we consider an extension of existing asymptotic distributional theories to accommodate situations where the estimation criteria are parameterized with bundled parameters. The proposed theory has similar flavor of Theorem 2 in [5], but they are different because the latter requires an existing uniform consistent estimator of the infinite dimensional nuisance parameter with a convergence rate faster than n−1/4, which is then treated as a fixed function of the parameter of interest in their estimating procedure, while we need to simultaneously estimate both parameters through a sieve parameter space; furthermore, their existing nuisance parameter estimator needs to satisfy their condition (2.6), which is usually hard to verify when its convergence rate is slower than n−1/2. Our proposed theory is general enough to cover a wide range of problems for bundled parameters including afore-mentioned single index model, the Cox model with unknown link function, and linear model under different censoring mechanisms. Rigorous proofs for each of the models, however, will take lengthy derivations. We only use the efficient estimation in the semiparametric linear regression model with right censored data as an illustrative example that motivates such a theoretical development, and will present results for other models elsewhere. Note that the considered example can not be directly put into the framework of restricted moments due to right censoring, thus can not be handled by the method of [1].

Suppose that the failure time transformed by a known monotone transformation is linearly related to a set of covariates, where the failure time is subject to right censoring. Let Ti denote the transformed failure time and Ci denote the transformed censoring time by the same transformation for subject i, i = 1, ⋯, n. Let Yi = min(Ti, Ci) and Δi = I(TiCi). Then the semiparametric linear model we consider here can be written as

Ti=Xiβ0+e0,i,i=1,,n, (1.1)

where the errors e0,i are independent and identically distributed (i.i.d.) with an unspecified distribution. When the failure time is log-transformed, this model corresponds to the well-known accelerated failure time model [15]. Here we assume that (Xi, Ci), i = 1, …, n, are i.i.d. and independent of e0,i. This is a common assumption for linear models with censored survival data, which is particularly needed in [21] to derive the efficient score function for β0. Such an assumption, however, is stronger than necessary in the usual linear regression without censoring, for which the error is only required to be uncorrelated with covariates (see e.g. [3]). We also avoid trivial transformations such as log(0) so that we always have Yi’s bounded from below.

The semiparametric linear regression model relates the failure time to the covariates directly. It provides a straightforward interpretation of the data and serves as an attractive alternative to the Cox model [6] in many applications. Several estimators of the regression parameters have been proposed in the literature since late 70’s, including the rank-based estimators (see e.g. [19], [28], [25], [30], [13], [14]) and the Buckley-James estimator (see e.g. [2], [20], [16]). There are two major challenges in the estimation for such a linear model: (1) the estimating functions in the aforementioned methods are discrete, leading to potential multiple solutions as well as numerical difficulties; (2) none of the aforementioned methods is efficient. Recently, [31] developed a kernel-smoothed profile likelihood estimating procedure for the accelerated failure time model. In this article, we consider a sieve maximum likelihood approach for model (1.1) for censored data. The proposed approach is much intuitive, easy to implement numerically, and asymptotically efficient.

It is easy to see that T and C are independent conditional on X under the assumption e0 ⊥ (C, X). Hence the joint density function of Z = (Y, Δ, X) can be written as

fY,Δ,X(y,δ,x)=λ0(yxβ0)δexp{Λ0(yxβ0)}H(y,δ,x), (1.2)

where Λ0(·) is the true cumulative hazard function for the error term e0 and λ0(·) is its derivative. H (y, δ, x) only depends on the conditional distribution of C given X and the marginal distribution of X, and is free of β0 and λ0. To simplify the notation, we will ignore the factor H from the likelihood function. Then for i.i.d. observations (Yi, Δi, Xi), i = 1, ⋯, n, from equation (1.2) we obtain the log likelihood function for β and λ as

ln(β,λ)=n1i=1n{Δilog{λ(Yi+Xiβ)}I(Yit)λ(tXiβ)dt}. (1.3)

The log likelihood given in (1.3) apparently is a semiparametric model, where the argument of the nuisance parameter λ involves β, thus β and λ are bundled parameters. To keep the positivity of λ, let g(·) = log λ(·). Then the log likelihood function for β and g, using the counting process notation, can be written as

ln(β,g)=n1i=1n{g(tXiβ)dNi(t)I(Yit)eg(tXiβ)dt}, (1.4)

where Ni(t) = ΔiI(Yit) is the counting process for subject i.

We propose a new approach by directly maximizing the log likelihood function in a sieve space in which function g(·) is approximated by B-splines. Numerically, the estimator can be easily obtained by the Newton-Raphson algorithm or any gradient-based search algorithms. We show that the proposed estimator is consistent and asymptotically normal, and the limiting covariance matrix reaches the semiparametric efficiency bound, which can be estimated either by inverting the information matrix based on the efficient score function of the regression parameters derived by [21], or by inverting the observed information matrix of all parameters, taking into account that we are also estimating the nuisance parameters in the sieve space for the log hazard function.

2. The sieve M-theorem on the asymptotic normality of semiparametric estimation for bundled parameters

In this section, we extend the general theorem introduced by [29], which deals with the asymptotic normality of semiparametric M-estimators of regression parameters when convergence rate of the estimator for nuisance parameters can be slower than n−1/2. In their theorem, the parameters of interest and the nuisance parameters are assumed to be separated. We consider a more general setting where the nuisance parameter can be a function of the parameters of interest. The theorem is crucial in the proof of asymptotic normality given in Theorem 4.2 for our proposed estimators.

Some empirical process notation will be used from now on. We denote Pf = ∫ f(z) dP(z) and nf=n1i=1nf(Zi), where P is a probability measure and ℙn is an empirical probability measure, and denote 𝔾nf = n1/2(ℙnP)f. Given i.i.d. observations Z1, Z2, ⋯, Zn ∈ 𝒵, we estimate the unknown parameters (β, ζ(·, β)) by maximizing an objective function for (β, ζ(·, β)), n1i=1nm(β,ζ(·,β);Zi)=nm(β,ζ(·,β);Z), where β is the parameter of interest and ζ(·, β) is the nuisance parameter that can be a function of β. Here “ · ” denotes the other arguments of ζ besides β, which can be some components of Z ∈ 𝒵. If the objective function m is the log-likelihood function of a single observation, then the estimator becomes the semiparametric maximum likelihood estimator. Here we adopt similar notation in [29].

Let θ = (β, ζ(·, β)), β ∈ ℬ ⊂ ℝd and ζ ∈ , where ℬ is the parameter space of β and ℋ is a class of functions mapping from 𝒵 × ℬ to ℝ. Let Θ = ℬ×ℋ be the parameter space of θ. Define a distance between θ1, θ2 ∈ Θ by

d(θ1,θ2)={|β2β1|2+ζ2(·,β2)ζ1(·,β1)2}1/2,

where | · | is the Euclidean distance and ‖ · ‖ is some norm. Let Θn be the sieve parameter space, a sequence of increasing subsets of the parameter space Θ growing dense in Θ as n → ∞. We aim to find θ̂n ∈ Θn such that d(θ̂n, θ0) = op(1) and β̂n is asymptotically normal.

For any fixed ζ(·, β) ∈ ℋ, let {ζη(·, β) : η in a neighborhood of 0 ∈ ℝ} be a smooth curve in ℋ running through ζ(·, β) at η = 0, i.e., ζη(·, β)|η=0 = ζ(·, β). Assume all ζ(·, β) ∈ ℋ are at least twice-differentiable with respect to β, and denote

={h:h(·,β)=ζη(·,β)η|η=0,ζη,β}.

Assume the objective function m is twice Frechet differentiable. Since for a small δ, we have ζ(·, β + δ) − ζ(·, β) = ζ̇β(·, β)δ + o(δ), here ζ̇β(·, β) = ∂ζ(·, β)/∂β, then by the definition of functional derivatives it follows that

limδ01δ{m(β,ζ(·,β+δ);z)m(β,ζ(·,β);z)}=limδ01δ{m(β,ζ(·,β)+ζ̇β(·,β)δ+o(δ);z)m(β,ζ(·,β)+ζ̇β(·,β)δ;z)}+limδ01δ{m(β,ζ(·,β)+ζ̇β(·,β)δ;z)m(β,ζ(·,β);z)}=limδ02(β,ζ(·,β)+ζ̇β(·,β)δ;z)[o(δ)/δ]+2(β,ζ(·,β);z)[ζ̇β(·,β)]=2(β,ζ(·,β);z)[ζ̇β(·,β)],

where the subscript 2 indicates that the derivatives are taken with respect to the second argument of the function. The last equality holds because

limδ02(β,ζ(·,β)+ζ̇β(·,β)δ;z)[o(δ)/δ]=0.

Similarly we have

limδ01δ{2(β,ζ(·,β+δ);z)[h(·,β)]2(β,ζ(·,β);z)[h(·,β)]}=22(β,ζ(·,β);z)[h(·,β),ζ̇β(·,β)]

and

limδ01δ{2(β,ζ(·,β);z)[h(·,β+δ)]2(β,ζ(·,β);z)[h(·,β)]}=2(β,ζ(·,β);z)[β(·,β)].

Thus according to the chain rule of the functional derivatives, we have

β(β,ζ(·,β);z)=m(β,ζ(·,β);z)β=1(β,ζ(·,β);z)+2(β,ζ(·,β);z)[ζ̇β(·,β)],ζ(β,ζ(·,β);z)[h]=m(β,(ζ+ηh)(·,β);z)η|η=0=2(β,ζ(·,β);z)[h(·,β)],ββ(β,ζ(·,β);z)=2m(β,ζ(·,β);z)ββ=β(β,ζ(·,β);z)β=11(β,ζ(·,β);z)+12(β;ζ(·,β);z)[ζ̇β(·,β)]+21(β,ζ(·,β);z)[ζ̇β(·,β)]+22(β,ζ(·,β);z)[ζ̇β(·,β),ζ̇β(·,β)]+2(β,ζ(·,β);z)[ζ̈ββ(·,β)],βζ(β,ζ(·,β);z)[h]=β(β,(ζ+ηh)(·,β);z)η|η=0=12(β,ζ(·,β);z)[h(·,β)]+22(β,ζ(·,β);z)[ζ̇β(·,β),h(·,β)]+2(β,ζ(·,β);z)[β(·,β)],ζβ(β,ζ(·,β);z)[h]=2(β,ζ(·,β);z)[h(·,β)]β=21(β,ζ(·,β);z)[h(·,β)]+22(β,ζ(·,β);z)[h(·,β),ζ̇β(·,β)]+2(β,ζ(·,β);z)[β(·,β)],ζζ(β,ζ(·,β);z)[h1,h2]=22(β,ζ(·,β);z)[h1(·,β),h2(·,β)].

As noted before, the subscript 1 or 2 in the derivatives indicates that the derivatives are taken with respect to the first or the second argument of the function, and h inside the square brackets is a function denoting the direction of the functional derivative with respect to ζ. Note that for the second derivatives βζ and ζβ, we implicitly require the direction h to be a differentiable function with respect to β. It is easily seen that when ζ is free of β, all the above derivatives reduce to that in [29]. Following [29], we also define

β(β,ζ(·,β))=Pβ(β,ζ(·,β);Z),ζ(β,ζ(·,β))[h]=Pζ(β,ζ(·,β);Z)[h],β,n(β,ζ(·,β))=nβ(β,ζ(·,β);Z),ζ,n(β,ζ(·,β))[h]=nζ(β,ζ(·,β);Z)[h],ββ(β,ζ(·,β))=Pββ(β,ζ(·,β);Z),ζζ(β,ζ(·,β))[h,h]=Pζζ(β,ζ(·,β);Z)[h,h],

and

βζ(β,ζ(·,β))[h]=ζβ(β,ζ(·,β))[h]=Pβζ(β,ζ(·,β);Z)[h].

Furthermore, for h = (h1, h2, ⋯, hd)′ ∈ ℍd, we denote

ζ(β,ζ(·,β);z)[h]=(ζ(β,ζ(·,β);z)[h1],,ζ(β,ζ(·,β);z)[hd]),βζ(β,ζ(·,β);z)[h]=(βζ(β,ζ(·,β);z)[h1],,βζ(β,ζ(·,β);z)[hd]),ζβ(β,ζ(·,β);z)[h]=(ζβ(β,ζ(·,β);z)[h1],,ζβ(β,ζ(·,β);z)[hd]),ζζ(β,ζ(·,β);z)[h,h]=(ζζ(β,ζ(·,β);z)[h1,h],,ζζ(β,ζ(·,β);z)[hd,h]),

and define correspondingly

ζ(β,ζ(·,β))[h]=Pζ(β,ζ(·,β);Z)[h],ζ,n(β,ζ(·,β))[h]=nζ(β,ζ(·,β);Z)[h],βζ(β,ζ(·,β))[h]=Pβζ(β,ζ(·,β);Z)[h],ζβ(β,ζ(·,β))[h]=Pζβ(β,ζ(·,β);Z)[h],ζζ(β,ζ(·,β))[h,h]=Pζζ(β,ζ(·,β);Z)[h,h].

To obtain the asymptotic normality result for the sieve M-estimator β̂n, the assumptions we will make in the following look similar to those in [29], but all the derivatives with respect to β involve the chain rule and hence are more complicated, which is the key difference to [29]. Additionally, we focus on sieve estimators in the sieve parameter space. We list the following assumptions:

  • A1

    (Rate of convergence) For an estimator θ̂n = (β̂n, ζ̂n(·, β̂n)) ∈ Θn and the true parameter θ0 = (β0, ζ0(·, β0)) ∈ Θ, d(θ̂n, θ0) = Op(n−ξ) for some ξ > 0.

  • A2

    β0, ζ0(·, β0)) = 0 and ζ0, ζ0(·, β0))[h] = 0 for all h ∈ ℍ.

  • A3
    (Positive information) There exists an h*=(h1*,,hd*), where hj* for j = 1, ⋯, d, such that
    βζ(β0,ζ0(·,β0))[h]ζζ(β0,ζ0(·,β0))[h*,h]=0
    for all h ∈ ℍ. Furthermore, the matrix
    A=ββ(β0,ζ0(·,β0))+ζβ(β0,ζ0(·,β0))[h*]=P{ββ(β0,ζ0(·,β0);Z)ζβ(β0,ζ0(·,β0);Z)[h*]}
    is nonsingular.
  • A4
    The estimator (β̂n, ζ̂n(·, β̂n)) satisfies
    β,n(β̂n,ζ̂n(·,β̂n))=op(n1/2)andζ,n(β̂n,ζ̂n(·,β̂n))[h*]=op(n1/2).
  • A5
    (Stochastic equicontinuity) For some C > 0,
    supd(θ,θ0)Cnξ,θΘn|n(β,nβ)(β,ζ(·,β))n(β,nβ)(β0,ζ0(·,β0))|=op(1)
    and
    supd(θ,θ0)Cnξ,θΘn|n(ζ,nζ)(β,ζ(·,β))[h*(·,β)]n(ζ,nζ)(β0,ζ0(·,β0))[h*(·,β)]|=op(1).
  • A6
    (Smoothness of the model) For some α > 1 satisfying αξ > 1/2, and for θ in a neighborhood of θ0 : {θ : d(θ, θ0) ≤ Cn−ξ, θ ∈ Θn},
    |β(β,ζ(·,β))β(β0,ζ0(·,β0))ββ(β0,ζ0(·,β0))(ββ0)βζ(β0,ζ0(·,β0))[ζ(·,β)ζ0(·,β0)]|=O(dα(θ,θ0))
    and
    |ζ(β,ζ(·,β))[h*(·,β)]ζ(β0,ζ0(·,β0))[h*(·,β0)]ζβ(β0,ζ0(·,β0))[h*(·,β0)](ββ0)ζζ(β0,ζ0(·,β0))[h*(·,β0),ζ(·,β)ζ0(·,β0)]|=O(dα(θ,θ0)).

Note that ξ in A1 depends on the entropy of the sieve parameter space for ζ and can not be arbitrarily small – it is controlled by the smoothness of the model in A6. The convergence rate in A1 needs to be achieved prior to obtaining asymptotic normality. A2 is a common assumption for the maximum likelihood estimation and usually holds. The direction h* in A3 may be found through the equation in A3. It is the least favorable direction when m is the likelihood function. A4 and A5 are usually verified either by the Donsker property or the maximal inequality of [27]. A6 can be obtained by the Taylor expansion. The following theorem is an extension to Theorem 6.1 in [29] when the infinite-dimensional parameter ζ is a function of the finite-dimensional parameter β.

Theorem 2.1. Suppose that assumptions A1–A6 hold. Then

n(β̂nβ0)=A1nnm*(β0,ζ0(·,β0);Z)+op(1)dN(0,A1B(A1)),

where

m*(β0,ζ0(·,β0);z)=β(β0,ζ0(·,β0);z)ζ(β0,ζ0(·,β0);z)[h*],B=P{m*(β0,ζ0(·,β0);Z)2},

and A is given in assumption A3. Here a⊗2 = aa′.

Proof. The proof follows similarly along the proof of Theorem 6.1 in [29]. Assumptions A1 and A5 yield

n(β,nβ)(β̂n,ζ̂n(·,β̂n))n(β,nβ)(β0,ζ0(·,β0))=op(1).

Since β,n(β̂n, ζ̂n(·, β̂n)) = op(n−1/2) by A4 and β0, ζ0(·, β0)) = 0 by A2, we have

nβ(β̂n,ζ̂n(·,β̂n))+nβ,n(β0,ζ0(·,β0))=op(1).

Similarly,

nζ(β̂n,ζ̂n(·,β̂n))[h*(·,β̂n)]+nζ,n(β0,ζ0(·,β0))[h*(·,β0)]=op(1).

Combining these equalities and assumption A6 yields

ββ(β0,ζ0(·,β0))(β̂nβ0)+βζ(β0,ζ0(·,β0))[ζ̂n(·,β̂n)ζ0(·,β0)]+β,n(β0,ζ0(·,β0))+O(dα(θ̂n,θ0))=op(n1/2) (2.1)

and

ζβ(β0,ζ0(·,β0))[h*(·,β0)](β̂nβ0)+ζζ(β0,ζ0(·,β0))[h*(·,β0),ζ̂n(·,β̂n)ζ0(·,β0)]+ζ,n(β0,ζ0(·,β0))[h*(·,β0)]+O(dα(θ̂n,θ0))=op(n1/2). (2.2)

Since α > 1 with αξ > 1/2, the rate of convergence assumption A1 implies nO(dα(θ̂n,θ0))=op(1), then (2.1)(2.2) together with A3 yields

(ββ(β0,ζ0(·,β0))ζβ(β0,ζ0(·,β0))[h*(·,β0)])(β̂nβ0)=(β,n(β0,ζ0(·,β0))ζ,n(β0,ζ0(·,β0))[h*(·,β0)])+op(n1/2),

that is,

A(β̂nβ0)=nm*(β0,ζ0(·,β0);Z)+op(n1/2).

This yields

n(β̂nβ0)=A1nnm*(β0,ζ0(·,β0);Z)+op(1)dN(0,A1B(A1)).

3. Back to the linear model: the sieve maximum likelihood estimation

By taking logarithm to the positive function λ(·) in (1.3), the function g(·) in (1.4) is no longer restricted to be positive, which eases the estimation. We now describe the spline-based sieve maximum likelihood estimation for model (1.1). Under the regularity conditions C.1–C.3 stated in Section 4, we know that the observed residual times {YiXiβ:β,i=1,,n} are confined in some finite interval. Let [a, b] be an interval of interest, where −∞ < a < b < ∞. Let TKn = {t1, ⋯, tKn } be a set of partition points of [a, b] with Kn = O(nν) and max1≤jKn+1 |tjtj−1| = O(n−ν) for some constant ν ∈ (0, 1/2). Let 𝒮n(TKn, Kn, p) be the space of polynomial splines of order p ≥ 1 defined in [22][Definition 4.1]. According to [22][Corollary 4.10], there exist a set of B-spline basis functions {Bj, 1 ≤ jqn} with qn = Kn + p such that for any s ∈ 𝒮n(TKn, Kn, p), we can write

s(t)=j=1qnγjBj(t), (3.1)

where we follow [24] by requiring maxj=1,…,qnj | ≤ cn that is allowed to grow with n slowly enough.

Let γ = (γ1, …, γqn)′. Under suitable smoothness assumptions, g0(·) = log λ0(·) can be well approximated by some function in 𝒮n(TKn, Kn, p). Therefore, we seek a member of 𝒮n(TKn, Kn, p) together with a value of β ∈ ℬ that maximizes the log likelihood function. Specifically, let θ̂n = (β̂n, γ̂n) be the value that maximizes

ln(β,γ)=n1i=1n[j=1qnγjBj(tXiβ)dNi(t)I(Yit)exp{j=1qnγjBj(tXiβ)}dt]. (3.2)

Taking the first order derivatives of ln(β, γ) with respect to β and γ and setting them to zero, we can obtain the score equations. Since the integrals here are univariate integrals, their numerical implementation can be easily done by the one-dimensional Gaussian-quadrature method. Newton-Raphson algorithm or any other gradient-based search algorithms can be applied to solve the score equations for all parameters θ = (β, γ), e.g.,

θ(m+1)=θ(m)H(θ(m))1·S(θ(m))

where θ(m) = (β(m), γ(m)) is the parameter estimate from the mth iteration, and

S(θ)=(ln(β,γ)βln(β,γ)γ),H(θ)=(2ln(β,γ)ββ2ln(β,γ)βγ2ln(β,γ)γβ2ln(β,γ)γγ)

are the score function and Hessian matrix of parameter θ. For any fixed β and n, it is clearly seen that ln(β, γ) in (3.2) is concave with respect to γ and goes to −∞ if any γj approaches either ∞ or −∞, hence γ̂n must be bounded which yields an estimator of s in 𝒮n(TKn, Kn, p).

As stated in the next section, the distribution of β̂n can be approximated by a normal distribution. One way to estimate the variance matrix of β̂n is to approximate the (inverse of the) information matrix based on the efficient score function for β0 by plugging in the estimated parameters (β̂n, λ̂n(·)). The consistency of such a variance estimator is given in Theorem 4.3. Another way is to invert the observed information matrix from the last Newton-Raphson iteration, taking into account that we are also estimating the nuisance parameter γ. The consistency of the latter approach may be proved in a similar way as Example 4 in [23] or via Theorem 2.2 in [8], and we leave detailed derivation to interested readers. Simulations indicate that both estimators work reasonably well.

4. Asymptotic results

Denote εβ = YX′β and ε0 = YX′β0. We assume the following regularity conditions:

  • (C.1)

    The true parameter β0 belongs to the interior of a compact set ℬ ⊆ ℝd.

  • (C.2)

    (a) The covariate X takes values in a bounded subset 𝒳 ⊆ ℝd; (b) E(XX′) is nonsingular.

  • (C.3)

    There is a truncation time τ < ∞ such that, for some constant δ, P0 > τ|X) ≥ δ > 0 almost surely with respect to the probability measure of X. This implies that Λ0(τ) ≤ −log δ < ∞.

  • (C.4)
    The error e0’s density f and its derivative are bounded and
    ((t)/f(t))2f(t)dt<.
  • (C.5)
    The conditional density of C given X and its derivative ġC|X are uniformly bounded for all possible values of X. That is,
    supx𝒳gC|X(t|X=x)K1,supx𝒳|ġC|X(t|X=x)|K2
    for all t ≤ τ with some constants K1, K2 > 0, where τ is the truncation time defined in Condition C.3.
  • (C.6)
    Let 𝒢p denote the collection of bounded functions g on [a, b] with bounded derivatives g(j), j = 1, …, k, and the kth derivative g(k) satisfies the following Lipschitz continuity condition:
    |g(k)(s)g(k)(t)|L|st|mfors,t[a,b],
    where k is a positive integer and m ∈ (0, 1] such that p = k + m ≥ 3, and L < ∞ is an unknown constant. The true log hazard function g0(·) = log λ0(·) belongs to 𝒢p, where [a, b] is a bounded interval.
  • (C.7)

    For some η ∈ (0, 1), u′V ar(X0)u ≥ ηu′E(XX′|ε0)u almost surely for all u ∈ ℝd.

Condition C.1 is a common regularity assumption that has been imposed in the literature, see e.g. [16]. Conditions C.2(a) and C.3–C.4 were also assumed in [25]. Condition C.5 implies Condition B in [25]. In Condition C.6, we require p ≥ 3 to provide desirable controls of the spline approximation error rates of the first and second derivatives of g0 (see Corollary 6.21 of [22]), which are needed in verifying Assumptions A4–A6. Condition C.7 was also proposed for the panel count data model in [29]. As noted in their Remark 3.4, this Condition C.7 can be justified in many applications when Condition C.2(b) is satisfied. The bounded interval [a, b] in C.6 may be chosen as a = infy,x(yx′β0) > −∞ and b = τ < ∞ under C.1–C.3, which is what we use in the following.

Now define the collection of functions ℋp as follows:

p={ζ(·,β):ζ(t,x,β)=g(ψ(t,x,β)),g𝒢p,t[a,b],x𝒳,β},

where

ψ(t,x,β)=tx(ββ0)

and 𝒢p is defined in C.6. Here ζ is a composite function of g composed with ψ. Note that ζ(t, x, β0) = g(t). Then for ζ(·, β) ∈ ℋp we define the following norm

ζ(·,β)2={𝒳ab{g(tx(ββ0))}2dΛ0(t)dFX(x)}1/2. (4.1)

We also have the following collection of scores

={h:h(·,β)=ζη(·,β)η|η=0=w(ψ(·,β)),ζηp},

in which h(t, x, β) = w(ψ(t, x, β)) = w(tx′(β − β0)).

For any θ1 = (β1, ζ1(·, β1)) and θ2 = (β2, ζ2(·, β2)) in the space of Θp = ℬ × ℋp, define the following distance

d(θ1,θ2)={|β1β2|2+ζ1(·,β1)ζ2(·,β2)22}1/2. (4.2)

Let 𝒢np=𝒮n(TKn,Kn,p). Denote

np={ζ(·,β):ζ(t,x,β)=g(ψ(t,x,β)),g𝒢np,t[a,b],x𝒳,β}

and Θnp=×np. Clearly npn+1pp for all n ≥ 1. The sieve estimator θ̂n = (β̂n, ζ̂n(·, β̂n)), where ζ̂n (t, x, βn) = ĝn(tx′(β̂n − β0)), is the maximizer of the empirical log-likelihood n−1ln(θ; Z) over the sieve space Θnp. The following theorem gives the convergence rate of the proposed estimator θ̂n to the true parameter θ0 = (β0, ζ0(·, β0)) = (β0, g0).

Theorem 4.1. Let Kn = O(nν), where ν satisfies the restriction 12(1+p)<ν<12p with p being the smoothness parameter defined in Condition C.6. Suppose Conditions C.1–C.7 hold and the failure time T follows model (1.1), then

d(θ̂n,θ0)=Op{nmin(pν,(1ν)/2)},

where d(·, ·) is defined in (4.2).

Remark. It is worth pointing out that the sieve space 𝒢np does not have to be restricted to the B-spline space – it can be any sieve space as long as the estimator θ̂n×np satisfies the conditions of Theorem 1 in [24]. We refer to [4] for a comprehensive discussion of the sieve estimation for semiparametric models in general sieve spaces. Our choice of the B-spline space is primarily motivated by its simplicity of numerical implementation, which is a tremendous advantage of the proposed approach over exiting numerical methods for the accelerated failure time models, in particular, the linear programming approach.

We provide a proof of Theorem 4.1 in the online Supplementary Material by checking the conditions of Theorem 1 in [24]. Theorem 4.1 implies that if ν = 1/(1 + 2p), d(θ̂n, θ0) = Op(np/(1+2p)) which is the optimal convergence rate in the nonparametric regression setting. Although the overall convergence rate is slower than n−1/2, the next theorem states that the proposed estimator of the regression parameter is still asymptotically normal and semiparametrically efficient.

Theorem 4.2. Given the following efficient score function for the censored linear model derived by [21]:

lβ0*(Y,Δ,X)={XP(X|YXβ0t)}{λ̇0λ0(t)}dM(t),

where

M(t)=ΔI(YXβ0t)tI(YXβ0s)λ0(s)ds

is the failure counting process martingale and

P(X|YXβ0t)=P{XI(YXβ0t)}P{I(YXβ0t)}

was shown by [20]. Suppose that the conditions in Theorem 4.1 hold and I(β0)=P{lβ0*(Y,Δ,X)2} is nonsingular, then

n1/2(β̂nβ0)=n1/2I1(β0)i=1nlβ0*(Yi,Δi,Xi)+op(1)N(0,I1(β0))

in distribution.

The proof of Theorem 4.2 is where we need to apply our general sieve M-theorem proposed in Section 2. We prove by checking the assumptions A1–A6. Details are provided in Section 7. The following theorem gives consistency of the variance estimator based on the above efficient score.

Theorem 4.3. Suppose the conditions in Theorem 4.2 hold. Denote

lβ̂n*(Y,Δ,X)={X(t;β̂n)}{g^˙n(t)}d(t),where(t;β̂n)=n{XI(YXβ̂nt)}n{I(YXβ̂nt)}and(t)=ΔI(YXβ̂nt)tI(YXβ̂ns)exp{ĝn(s)}ds.

Then n{lβ̂n*(Y,Δ,X)2}P{lβ0*(Y,Δ,X)2}=I(β0) in probability.

It is clearly seen that (t, β̂n) in Theorem 4.3 estimates P(X|YX′β0t) in Theorem 4.2. The proof of Theorem 4.3 is provided in the Supplementary Material.

5. Numerical examples

5.1. Simulations

Extensive simulations are carried out to evaluate the finite sample performance of the proposed method. In the simulation studies, failure times are generated from the model

logT=2+X1+X2+e0,

where X1 is Bernoulli with success probability 0.5, X2 is independent normal with mean 0 and standard deviation 0.5 truncated at ±2. This is the same model used by [14] and [31]. We consider six error distributions: standard normal; standard extreme-value; mixtures of N(0, 1) and N(0, 32) with mixing probabilities (0.5,0.5) and (0.95,0.05), denoted by 0.5N(0, 1) + 0.5N(0, 32) and 0.95N(0, 1)+0.05N(0, 32), respectively; Gumbel(−0.5μ, 0.5) with μ being the Euler constant and 0.5N(0, 1) + 0.5N(−1, 0.52). The first four distributions were also considered by [31]. Similarly to [31], the censoring times are generated from Uniform [0, c] distribution, where c is chosen to produce a 25% censoring rate. We set the sample size n to 200, 400 and 600.

We choose cubic B-splines with one interior knot for n = 200 and 400, and two interior knots for n = 600. We perform the sieve maximum likelihood analysis and obtain the estimates of the slope parameters using the Newton-Raphson algorithm that updates (β, γ) iteratively. We stop iteration when the change of parameter estimates or the gradient value is less than a pre-specified tolerance value that is set to be 10−5 in our simulations. Log-rank and Gehan-weighted estimators are included for efficiency comparisons. We calculate the theoretical semiparametric efficiency bound I−10), and scale it by the sample size, i.e., σ*=I1(β0)/n, which serves as the reference standard error under the fully efficient situation. Table 1 summarizes the results of these studies based on 1000 simulated datasets. The bias of the proposed estimators of β1 and β2 are negligible. Both variance estimation procedures, denoted as 1SEE (the standard error estimates by inverting the information matrix based on the efficient score function) and 2SEE (the standard error estimates by inverting the observed information matrix of all parameters including nuisance parameters), yield nice standard error estimates for the parameter estimators comparing to the empirical standard error SE, and the 95% confidence intervals have proper coverage probabilities, especially when the sample size is large. For the N(0, 1) error and the two mixtures of normal errors that are also considered in [31], the proposed estimators are more efficient than the log-rank estimators and have similar variances to the Gehan-weighted estimators. For the standard extreme-value error, the proposed estimators are more efficient than the Gehan-weighted estimator and similar to the log-rank estimator that is known to be the most efficient estimator under this particular error distribution. For the Gumbel(−0.5μ, 0.5) and 0.5N(0, 1) + 0.5N(−1, 0.52) errors, the proposed estimators are more efficient than the other two estimators. Under all six error distributions, the standard errors of the proposed estimators are close to the efficient theoretical standard errors. The sample averages of the estimates for λ0 under different simulation settings are reasonably close to corresponding true curves (results not shown here, see [7] for details).

Table 1.

Summary statistics for the simulation studies. The true slope parameters are β1 = 1 and β2 = 1. (a): N (0, 1); (b): standard extreme-value; (c): 0.5N(0, 1) + 0.5N(0, 32); (d): 0.95N(0, 1) + 0.05N(0, 32); (e): Gumbel (−0.5μ, 0.5); (f): 0.5N(0, 1) + 0.5N(−1, 0.52).

Err.
dist
B-spline MLE Log-rank Gehan

n Bias SE 1SEE (CP) 2SEE (CP) Bias SE Bias SE σ*
(a) 200 β1 .003 .168 .149 (.912) .155 (.924) .000 .170 .002 .159 .155
β2 .003 .167 .153 (.928) .156 (.928) .004 .171 .002 .160 .156
400 β1 .006 .110 .108 (.948) .110 (.950) .005 .115 .008 .108 .110
β2 .001 .110 .109 (.944) .110 (.945) .002 .116 .001 .109 .110
600 β1 .001 .092 .088 (.939) .090 (.943) .001 .096 .002 .093 .090
β2 .005 .091 .089 (.945) .090 (.944) .005 .097 .003 .092 .090
(b) 200 β1 −.009 .180 .154 (.894) .161 (.903) −.008 .168 −.007 .190 .165
β2 .004 .182 .162 (.903) .163 (.915) .005 .170 .005 .195 .169
400 β1 .000 .126 .113 (.914) .115 (.923) −.001 .124 .000 .143 .117
β2 .008 .118 .116 (.934) .116 (.938) .010 .116 .012 .135 .120
600 β1 .001 .102 .093 (.919) .094 (.923) .001 .100 .000 .114 .095
β2 .011 .098 .095 (.944) .095 (.945) .011 .097 .007 .114 .098
(c) 200 β1 .014 .300 .281 (.930) .279 (.924) −.020 .315 −.019 .292 .259
β2 .000 .306 .285 (.916) .282 (.918) .002 .317 .002 .288 .260
400 β1 .034 .199 .206 (.955) .200 (.949) .002 .218 .002 .197 .183
β2 −.003 .207 .208 (.949) .202 (.942) −.001 .222 −.002 .200 .184
600 β1 .035 .168 .171 (.957) .165 (.949) .003 .185 .001 .163 .150
β2 −.007 .169 .172 (.956) .166 (.956) −.004 .190 −.002 .168 .150
(d) 200 β1 −.013 .172 .157 (.926) .164 (.927) −.010 .181 −.007 .166 .167
β2 −.004 .180 .160 (.908) .164 (.913) −.005 .184 −.005 .173 .166
400 β1 .003 .119 .113 (.944) .116 (.948) .004 .126 .006 .117 .118
β2 .003 .117 .114 (.942) .116 (.953) .004 .126 .003 .115 .118
600 β1 −.003 .097 .093 (.948) .095 (.952) −.002 .105 .002 .097 .096
β2 .001 .096 .094 (.942) .095 (.944) .002 .105 .003 .094 .096
(e) 200 β1 .004 .080 .077 (.944) .078 (.946) −.001 .111 .004 .088 .079
β2 −.001 .083 .080 (.929) .078 (.934) .000 .114 .000 .091 .080
400 β1 −.005 .055 .055 (.946) .055 (.951) −.003 .079 −.004 .061 .056
β2 .003 .055 .056 (.954) .056 (.950) .003 .081 .003 .063 .056
600 β1 −.003 .047 .045 (.940) .045 (.938) .000 .067 −.001 .052 .045
β2 −.001 .047 .046 (.944) .045 (.943) −.002 .066 −.001 .051 .046
(f) 200 β1 −.002 .126 .117 (.918) .120 (.929) −.002 .159 −.001 .128 .119
β2 .000 .133 .120 (.917) .121 (.926) .002 .164 .001 .134 .116
400 β1 −.002 .087 .084 (.949) .085 (.950) .003 .114 .000 .091 .084
β2 .004 .086 .086 (.951) .086 (.953) .003 .111 .004 .090 .082
600 β1 .003 .074 .070 (.929) .070 (.931) .005 .101 .001 .074 .069
β2 .003 .074 .070 (.936) .070 (.936) .009 .104 .004 .075 .067

5.2. A real data example

We use the Stanford heart transplant data [17] as an illustrative example. This dataset was also analyzed by [14] using their proposed least squares estimators. Following their analysis, we consider the same two models: the first one regresses the base-10 logarithm of the survival time on age at transplant and T5 mismatch score for the 157 patients with complete records on T5 measure, and the second one regresses the base-10 logarithm of the survival time on age and age2. There were 55 censored patients. We fit these two models using the proposed method with five cubic B-spline basis functions.

We report the parameter estimates and the standard error estimates in Table 2 and compare them with the Gehan-weighted estimators reported by [14] and the Buckley-James estimators reported by [17]. For the first model, the parameter estimates for the age effect are fairly similar among all estimators and the standard error estimate from the proposed method tends to be smaller, while the parameter estimates for the T5 mismatch score vary across different estimators with none of them being significant at the 0.05 level. The disparity of the T5 effect may be due to what was pointed out by [17]: the accelerated failure time model with age and T5 as covariates does not fit the data ideally. For the second model with age and age2 being the covariates, the point estimates are very similar across all methods and the standard error estimates from the proposed method are the smallest.

Table 2.

Regression parameter estimates and standard error estimates for the Stanford heart transplant data. The proposed estimators are compared with Gehan-weighted estimators reported in [13] and Buckley-James estimators reported in [17].

B-spline MLE Gehan-weighted Buckley-James



Covariate Est. SE Est. SE Est. SE
M. 1 Age −0.0237 0.0068 −0.0211 0.0106 −0.015 0.008
T5 −0.2118 0.1271 −0.0265 0.1507 −0.003 0.134
M. 2 Age 0.1022 0.0245 0.1046 0.0474 0.107 0.037
Age2 −0.0016 0.0004 −0.0017 0.0006 −0.0017 0.0005

6. Discussion

By applying the proposed general sieve M-estimation theory for semiparametric models with bundled parameters, we are able to derive the asymptotic distribution for the sieve maximum likelihood estimator in a linear regression model where the response variable is subject to right censoring. By providing a both statistically and computationally efficient estimating procedure, this work makes the linear model a more viable alternative to the Cox proportional hazards model. Comparing to the existing methods for estimating β in a linear model, the proposed method has three advantages. Firstly, the estimating functions are smooth functions in contrast to the discrete estimating functions in the existing estimation methods, thus the root search is easier and can be done fast by conventional iterative methods such as the Newton-Raphson algorithm. Secondly, the standard error estimates are obtained directly by inverting either the efficient information matrix for the regression parameters or the observed information matrix of all parameters, either method is more computationally tractable compared to the re-sampling techniques. Thirdly, the proposed estimator achieves the semiparametric efficiency bound.

The proposed general sieve M-estimation theory can also be applied to other statistical models, for example, the single index model, the Cox model with an unknown link function, and the linear model under different censoring mechanisms. Such research is undergoing and will be presented elsewhere.

7. Proof of Theorem 4.2

Empirical process theory developed in [26, 27] will be heavily involved in the proof. We use the symbol ≲ to denote that the left hand side is bounded above by a constant times the right hand side and ≳ to denote that the left hand side is bounded below by a constant times the right hand side. For notational simplicity, we drop the superscript * in the outer probability measure P* whenever an outer probability applies.

7.1. Technical lemmas

We first introduce several lemmas that will be used for the proofs of Theorems 4.1, 4.2 and 4.3. Proofs of these lemmas are provided in the online Supplementary Material.

Lemma 7.1. Under Conditions C.1–C.3 and C.6, the log-likelihood

l(β,ζ(·,β);Z)=Δg(ε0X(ββ0))ab1(ε0t)exp{g(tX(ββ0))}dt,

where ε0 = YX′β0, has bounded and continuous first and second derivatives with respect to β ∈ ℬ and ζ(·, β) ∈ ℋp.

Lemma 7.2. For g0 ∈ 𝒢p, there exists a function g0,n𝒢np such that

g0,ng0=O(npν).

Lemma 7.3. Let θ0,n = (β0, ζ0,n(·, β0)) with ζ0,n(·, β0) ≡ g0,n defined in Lemma 7.2. Denote n={l(θ;z)l(θ0,n;z):θΘnp}. Assume that Conditions C.1–C.3 and C.6 hold, then the ε-bracketing number associated with ‖ · ‖ norm forn is bounded by (1/ε)cqn+d, i.e., N[ ](ε, ℱn, ‖ · ‖) ≲ (1/ε)cqn+d for some constant c > 0.

Lemma 7.4. Let hj*(t,x,β)=wj*(ψ(t,x,β)), where hj*(t,x,β0)=wj*(t)=ġ0(t)P(Xj|ε0t), j = 1, …, d. Assume Conditions C.1–C.6 hold, then there exists hj,n*(t,x,β)=wj,n*(ψ(t,x,β))n2 such that hj,n*hj*=O(n2ν), or equivalently, wj,n*wj*=O(n2ν) where wj,n*𝒢n2.

Lemma 7.5. For hj* defined in Lemma 7.4, denote the class of functions

nj(η)={ζ(θ;z)[hj*hj]:θΘnp,hjn2,d(θ,θ0)η,hjhj*η}.

Assume Conditions C.1–C.6 hold, then N[](ε,nj(η),·)(η/ε)cqn+d for some constant c > 0.

Lemma 7.6. For j = 1, ⋯, d, define the following two classes of functions

n,jβ(η)={βj(θ;z)βj(θ0;z):θΘnp,d(θ,θ0)η,ġ(ψ(·,β))ġ0(ψ(·,β0))2η},

and

n,jζ(η)={ζ(θ;z)[hj*(·,β)]ζ(θ0;z)[hj*(·,β0)]:θΘnp,d(θ,θ0)η},

where l̇βj (θ; Z) is the jth element of l̇β(θ; Z), ġ(·) denotes the derivative of g(·), and hj* is defined in Lemma 7.5. Assume Conditions C.1–C.6 hold, then N[](ε,n,jβ(η),·)(η/ε)c1qn+d and N[](ε,n,jζ(η),·)(η/ε)c2qn+d for some constants c1, c2 > 0.

7.2. Proof of Theorem 4.2

We prove the theorem by checking Assumptions A1–A6 in Section 2. Here the criterion function of a single observation is the log-likelihood function l(β, ζ(·, β); Z). So instead of m, we use l to denote the criterion function. By Theorem 4.1 we know that Assumption A1 holds with ξ = min(pν, (1 − ν)/2) and the norm ‖ · ‖2 defined in (4.1). A2 automatically holds for the scores. For A3, we need to find an h*=(h1*,,hd*) with h* (t, x, β0) = w* (t) such that

βζ(β0,ζ0(·,β0))[h]ζζ(β0,ζ0(·,β0))[h*,h]=P{βζ(β0,ζ0(·,β0);Z)[h]ζζ(β0,ζ0(·,β0);Z)[h*,h]}=0

for all h ∈ ℍ with h(t, x, β) = w(tx′(β − β0)). Note that

P{βζ(β0,ζ0(·,β0);Z)[h]ζζ(β0,ζ0(·,β0);Z)[h*,h]}=P{X[Δ(ε0)ab1(ε0t)exp{g0(t)}(t)dt]+ab1(ε0t)exp{g0(t)}w(t)[Xġ0(t)+w*(t)]dt}.

Since P{ζ0, ζ0(·, β0); Z)[h]|X} = 0 for all h ∈ ℍ, replacing h(·, β0) by we have

P{X[Δ(ε0)ab1(ε0t)exp{g0(t)}(t)dt]}=P{X·P[Δ(ε0)ab1(ε0t)exp{g0(t)}(t)dt|X]}=P{X·0}=0.

Hence we only need to find a w* such that

P{ab1(ε0t)exp{g0(t)}w(t)[Xġ0(t)+w*(t)]dt}=abexp{g0(t)}w(t){ġ0(t)P[1(ε0t)X]+w*(t)P[1(ε0t)]}dt=0.

One obvious choice for w* (or h*) is

h*(t,x,β0)=w*(t)=ġ0(t)P[1(ε0t)X]P[1(ε0t)]=ġ0(t)P(X|ε0t). (7.1)

Then it follows

β(β0,ζ0(·,β0);Z)ζ(β0,ζ0(·,β0);Z)[h*]=Δ{ġ0(YXβ0)}{XP(X|ε0YXβ0)}1(YXβ0t){XP(X|ε0t)}{ġ0(t)}exp{g0(t)}dt={XP(X|ε0t)}{ġ0(t)}dM(t)=lβ0*(Y,Δ,X),

which is the efficient score function for β0 originally derived by [21], where

M(t)=ΔI(YXβ0t)=tI(YXβ0s)exp{g0(s)}ds.

By the fact of zero-mean for a score function, it is straightforward to verify the following equalities:

Pβζ(β,ζ(·,β);Z)[h]=P{β(β,ζ(·,β);Z)ζ(β,ζ(·,β);Z)[h]},Pζβ(β,ζ(·,β);Z)[h]=P{ζ(β,ζ(·,β);Z)[h]β(β,ζ(·,β);Z)},Pββ(β,ζ(·,β);Z)=P{β(β,ζ(·,β);Z)β(β,ζ(·,β);Z)},Pζζ(β,ζ(·,β);Z)[h1,h2]=P{ζ(β,ζ(·,β);Z)[h1]ζ(β,ζ(·,β);Z)[h2]}.

Then together with the fact that

P{βζ(β0,ζ0(·,β0);Z)[h*]ζζ(β0,ζ0(·,β0);Z)[h*,h*]}=0,

the matrix A in Assumption A3 of Theorem 2.1 is given by

A=P{ββ(β0,ζ0(·,β0);Z)+ζβ(β0,ζ0(·,β0);Z)[h*]+βζ(β0,ζ0(·,β0);Z)[h*]ζζ(β0,ζ0(·,β0);Z)[h*,h*]}=P{β(β0,ζ0(·,β0);Z)β(β0,ζ0(·,β0);Z)ζ(β0,ζ0(·,β0);Z)[h*]β(β0,ζ0(·,β0);Z)β(β0,ζ0(·,β0);Z)ζ(β0,ζ0(·,β0);Z)[h*]+ζ(β0,ζ0(·,β0);Z)[h*]ζ(β0,ζ0(·,β0);Z)[h*]}=P{β(β0,ζ0(·,β0);Z)ζ(β0,ζ0(·,β0);Z)[h*]}2=Plβ0*(Y,Δ,X)2,

which is the information matrix for β0.

To verify A4, we note that the first part automatically holds since β̂n satisfies the score equation β,n(β̂n, ζ̂n(·, β̂n)) = ℙnβ (β̂n, ζ̂n(·, β̂n); Z) = 0. Next we shall show that

ζ,n(β̂n,ζ̂n(·,β̂n))[hj*]=n{Δwj*(YXβ̂n)1(Yt)exp{ζ̂n(t,X,β̂n)}wj*(tXβ̂n)dt}=op(n1/2),

where wj*(t)=ġ0(t)P(Xj|ε0t), j = 1, ⋯, d, is the jth component of w* (t) given in (7.1). According to Lemma 7.4, there exists hj,n*n2 such that hj*hj,n*=O(n2ν). Then by the score equation for γ: γ,n (β̂n, γ̂n) = ℙnγ(β̂n, γ̂n; Z) = 0 and the fact that wj,n*(t) can be written as wj,n*(t)=k=1qnγj,k*Bk(t) for some coefficients {γj,1*,,γj,qn*} and the basis functions Bk(t) of the spline space, it follows that

n{Δwj,n*(YXβ̂n)1(Yt)exp{ζ̂n(t,X,β̂n)}wj,n*(tXβ̂n)dt}=0.

So it suffices to show that for each 1 ≤ jd,

In=nζ(β̂n,ζ̂n(·,β̂n);Z)[hj*hj,n*]=op(n1/2).

Since P{ζ(β0,ζ0(·,β0);Z)[hj*hj,n*]}=0, we decompose In into In = I1n + I2n, where

I1n=(nP)ζ(β̂n,ζ̂n(·,β̂n);Z)[hj*hj,n*]

and

I2n=P{ζ(β̂n,ζ̂n(·,β̂n);Z)[hj*hj,n*]ζ(β0,ζ0(·,β0);Z)[hj*hj,n*]}.

We will show that I1n and I2n are both op(n−1/2).

First consider I1n. According to Lemma 7.5, the ε-bracketing number associated with ‖ · ‖ norm for the class nj(η) defined in Lemma 7.5 is bounded by (η/ε)cqn+d. This implies that

logN[](ε,nj(η),L2(P))logN[](ε,nj(η),·)qnlog(η/ε),

which leads to the bracketing integral

J[](η,nj(η),L2(P))=0η1+logN[](ε,nj(η),L2(P))dεqn1/2η.

Now we pick η to be ηn = O{n−min(2ν,(1−ν)/2)}, then

hj*hj,n*=O(n2ν)O{nmin(2ν,(1ν)/2)}=ηn,

and since p ≥ 3,

d(θ̂n,θ0)=Op{nmin(pν,(1ν)/2)}Op{nmin(2ν,(1ν)/2)}=ηn.

Therefore, ζ(β̂n,ζ̂n(·,β̂n);z)[hj*hj,n*]nj(ηn). Denote tβ = tX′(β−β0) for notational simplicity, for any ζ(θ;Z)[hj*h]nj(ηn), it follows that

P{ζ(θ;Z)[hj*h]}2=P{Δ(wj*w)(εβ)+ab1(ε0t)exp{g(tβ)}(wj*w)(tβ)dt}2wj*w2+P{abexp{2g(tβ)}(wj*w)2(tβ)dt}wj*w2+wj*w2abP[exp{2g(tβ)}]dt,

where the first inequality holds because of the Cauchy-Schwartz inequality. Since wj*wηn, by the same argument of [24] on page 591 for slowly growing cn (their ln), e.g. cn=o(log(ηn1)), we know that ζ(θ;Z)[hj*h] is bounded by some constant 0 < M < ∞ and P{ζ(θ;Z)[hj*h]}2ηn for a slightly enlarged ηn obtained by a fine adjustment of ν. Then by the maximal inequality in Lemma 3.4.2 of [27], it follows that

EP𝔾nnj(ηn)J[](ηn,nj(ηn),L2(P))(1+J[](ηn,nj(ηn),L2(P))ηn2nM)qn1/2ηn+qnn1/2=O{nν/2min(2ν,(1ν)/2)}+O(nν1/2)=O{nmin(3ν/2,1/2ν)}+O(nν1/2)=o(1),

where the last equality holds because 0 < ν < 1/2. Thus by the Markov’s inequality, I1n=n1/2𝔾nζ(θ̂n;Z)[hj*hj,n*]=op(n1/2).

Next for I2n, the Taylor expansion for ζ(θ̂n;Z)[hj*hj,n*] at θ0 yields

ζ(β̂n,ζ̂n(·,β̂n);Z)[hj*hj,n*]ζ(β0,ζ0(·,β0);Z)[hj*hj,n*]=(β̂nβ0)βζ(β̃n,ζ̃n(·,β̃n);Z)[hj*hj,n*]+ζζ(β̃n,ζ̃n(·,β̃n);Z)[hj*hj,n*,ζ̂nζ0],

where (β̃n, ζ̃n(·, β̃n)) is between (β0, ζ0(·, β0)) and (β̂n, ζ̂n(·, β̂n)). Then it follows that

|βζ(β̃n,ζ̃n(·,β̃n);Z)[hj*hj,n*]|=|X{Δ(j*j,n*)(εβ̃n)ab1(ε0t)exp{n(tβ̃n)}[(j*j,n*)(tβ̃n)+g˜˙n(tβ̃n)(wj*wj,n*)(tβ̃n)]dt}|j*j,n*+j*j,n*{abexp{n(tβ̃n)}dt}+wj*wj,n*{abexp{n(tβ̃n)}g˜˙n(tβ̃n)dt}j*j,n*+wj*wj,n*=O(nν)+O(n2ν)=O(nν),

where the second inequality holds because n and its first derivative g˜˙n are bounded (or growing with n slowly enough so it can be effectively treated as bounded based on the same argument of [24] on page 591), and the last equality holds due to the Corollary 6.21 of [22] that j*j,n*=O(n(21)ν)=O(nν). Thus,

|(β̂nβ0)βζ(β̃n,ζ̃n(·,β̃n);Z)[hj*hj,n*]|=|β̂nβ0|·O(nν)=Op{nmin(pν,(1ν)/2)}·O(nν)=Op{nmin((p+1)ν,(1+3ν)/2)}.

Also,

|ζζ(β̃n,ζ̃n(·,β̃n);Z)[hj*hj,n*,ζ̂nζ0]|=|ab1(ε0t)exp{n(tβ̃n)}(wj*wj,n*)(tβ̃n)(ĝng0)(tβ̃n)dt|wj*wj,n*·{abexp{n(tβ̃n)}(ĝng0)(tβ̃n)dt}=wj*wj,n*·I3n.

By the Cauchy-Schwartz inequality and the boundedness of n, we have

P{I3n}2=P{abexp{n(tβ̃n)}(ĝng0)(tβ̃n)dt}2𝒳ab(ĝng0)2(tβ̃)dΛ0(t)dFX(x)=ζ̂n(·,β̃n)ζ0(·,β̃n)22|β̃nβ̂n|2+ζ̂n(·,β̂n)ζ0(·,β0)22+|β0β̃n|2|β̂nβ0|2+ζ̂n(·,β̂n)ζ0(·,β0)22=d(θ̂n,θ0)2

Hence |I3n| ≲ d(θ̂n, θ0) and

|ζζ(β̃n,n;Z)[hj*hj,n*,ζ̂nζ0]|wj*wj,n*·d(θ̂n,θ0)=O(n2ν)·Op{nmin(pν,(1ν)/2)}=Op{nmin((p+2)ν,(1+3ν)/2)}.

Since 12(1+p)<ν<11+2p, it follows that I2n = O{n−min((p+1)ν,(1+3ν)/2)} = o(n−1/2). Thus In = I1n + I2n = op(n−1/2) and Condition A4 holds.

Now we verify Assumption A5. First by Lemma 7.6, the ε-bracketing numbers for the classes of functions n,jβ(η) and n,jζ(η) are both bounded by (η/ε)cqn+d, which implies that the corresponding ε-bracketing integrals are both bounded by qn1/2η, i.e.,

J[](η,n,jβ(η),L2(P))qn1/2ηandJ[](η,n,jζ(η),L2(P))qn1/2η.

Then for βj (θ; z) − βj0; z), by applying the Cauchy-Schwartz inequality, together with subtracting and adding the terms ġ0), eg0(tβ) ġ(tβ), eg0(t) ġ(tβ) and eg0(t) ġ0(tβ), we have

{βj(θ;Z)βj(θ0;Z)}2={ΔXj[ġ(εβ)ġ0(ε0)]+Xjab1(ε0t)[eg(tβ)ġ(tβ)eg0(t)ġ0(t)]dt}2{Δ[ġ(εβ)ġ0(ε0)]2}+{ab[eg(tβ)ġ(tβ)eg0(t)ġ0(t)]2dt}{Δ[ġ(εβ)ġ(ε0)]2}+{Δ[ġ(ε0)ġ0(ε0)]2}+ab{[eg(tβ))eg0(tβ)]2+[eg0(tβ)eg0(t)]2}ġ2(tβ)dt+abe2g0(t){[ġ(tβ)ġ0(tβ)]2+e2g0(t)[ġ0(tβ)ġ0(t)]2}dt=B1+B2+B3+B4.

For B1, since is bounded and the largest eigenvalue of P(XX′) satisfies 0 < λd < ∞ by Condition C.2(b), it follows that

PB1P[(YXβ̃)X(ββ0)]2P[X(ββ0)]2λd|ββ0|2|ββ0|2η2.

For B2, we have

PB2𝒳{ab(ġ(t)ġ0(t))2dΛ0(t)}dFX(x)=ġ((ψ(·,β0))ġ0(ψ(·,β0))22|ββ0|2+ġ(ψ(·,β))ġ0(ψ(·,β0))22η2.

For B3, by using the mean value theorem, it follows that

PB3=P{ab{[e(tβ)(gg0)(tβ)]2+[eg0(tβ̃)X(ββ0)]2}ġ2(tβ)dt}𝒳ab(gg0)2(tβ)dΛ0(t)dFX(x)+P[X(ββ0)]2ζ(·,β)ζ0(·,β0)22+|ββ0|2η2,

where = g0 + ξ(gg0) for some 0 < ξ < 1 and thus is bounded. Finally for B4, by the mean value theorem, it follows that

PB4=P{abe2g0(t){[ġ(tβ)ġ0(tβ)]2+e2g0(t)[ġ0(tβ)ġ0(t)]2}dt}𝒳ab(ġġ0)2(tβ)dΛ0(t)dFX(x)+Pab[0(tβ̃)X(ββ0)]2dtġ(ψ(·,β))ġ0(ψ(·,β))22+P[X(ββ0)]2ġ(ψ(·,β))ġ0(ψ(·,β0))22+|ββ0|2η2.

Therefore we have P{βj (θ; Z) − βj0; Z)}2 ≲ η2. Using the similar argument, we can show that P{ζ(θ;Z)[hj*]ζ(θ0;Z)[hj*]}2η2. By Lemma 7.1, we also have ‖βj (θ; Z)− βj0; Z)‖ and ζ(θ;Z)[hj*]ζ(θ0;Z)[hj*] are both bounded. Now we pick η as ηn = O{n−min((p−1)ν, (1−ν)/2)}, then by the maximal inequality in Lemma 3.4.2 of [27], it follows that

EP𝔾nn,jβ(ηn)qn1/2ηn+qnn1/2=O{nmax((32p)ν,ν12)}+O(nν12)=o(1),

where the last equality holds since p ≥ 3 and ν<12. Similarly, we have EP𝔾nn,jζ(ηn)=o(1). Thus for ξ = min(pν, (1 − ν)/2) and Cn−ξ = O{n−min(pν,(1−ν)/2)}, by the Markov’s inequality,

supd(θ,θ0)Cnξ𝔾n{βj(β,ζ(·,β);Z)βj(β0,ζ0(·,β0);Z)}=op(1),supd(θ,θ0)Cnξ𝔾n{ζ(β,ζ(·,β);Z)[hj*]ζ(β0,ζ0(·,β0);Z)[hj*]}=op(1).

This completes the verification of Assumption A5.

Finally, Assumption A6 can be verified by using the Taylor expansion. Since the proofs for the two equations in A6 are essentially identical, we just prove the first equation. In a neighborhood of θ0:{θ:d(θ,θ0)Cnξ,θΘnp} with ξ = min(pν, (1 − ν)/2), the Taylor expansion for β(θ; Z) yields

β(θ;Z)=β(θ0;Z)+ββ(θ̃;Z)(ββ0)+βg(θ̃;Z)[ζ(·,β)ζ0(·,β0)]=β(θ0;Z)+ββ(θ0;Z)(ββ0)+βζ(θ0;Z)[ζ(·,β)ζ0(·,β0)]+{ββ(θ̃;Z)(ββ0)ββ(θ0;Z)(ββ0)}+{βζ(θ̃;Z)[ζ(·,β)ζ0(·,β0)]βζ(θ0;Z)[ζ(·,β)ζ0(·,β0)]},

where θ̃ = (β̃, ζ̃(·, β̃)) is a midpoint between θ0 and θ. So

P{β(θ;Z)β(θ0;Z)ββ(θ0;Z)(ββ0)βζ(θ0;Z)[ζ(·,β)ζ0(·,β0)]}=P{ββ(θ̃;Z)ββ(θ0;Z)}(ββ0)+P{βζ(θ̃;Z)[ζ(·,β)ζ0(·,β0)]βζ(θ0;Z)[ζ(·,β)ζ0(·,β0)]}.

Then by direct calculation we have

P|ββ(θ̃;Z)ββ(θ0;Z)|P|XXΔ{g˜¨(εβ̃)0(ε0)}|+P{XX|ab1(ε0t){exp{(tβ̃)}g˜¨(tβ̃)exp{g0(t)}0(t)}dt+ab1(ε0t){exp{(tβ̃)}g˜˙2(tβ̃)exp{g0(t)}ġ02(t)}dt|}P|Δ{g˜¨(εβ̃)0(ε0)}|+P{ab|exp{(tβ̃)}g˜¨(tβ̃)exp{g0(t)}0(t)|dt}+P{ab|exp{(tβ̃)}g˜˙2(tβ̃)exp{g0(t)}ġ02(t)|dt}=C1+C2+C3.

By applying the similar argument that we used before for verifying A5 and Condition C.6, we can show

C1|ββ0|+(ψ(·,β))0(ψ(·,β0))2=O(nξ)+O{nmin((p2)ν,(1ν)/2)}.

Similarly, we can show

C2|ββ0|+(ψ(·,β))0(ψ(·,β0))2=O(nξ)+O{nmin((p2)ν,(1ν)/2)}

and

C3|ββ0|+ġ(ψ(·,β))ġ0(ψ(·,β0))2=O(nξ)+O{nmin((p1)ν,(1ν)/2)},

where ξ = min(pν, (1 − ν)/2). Therefore,

P|ββ(θ̃;Z)ββ(θ0;Z)|=O{nmin((p2)ν,(1ν)/2)}

and thus

P|ββ(θ̃;Z)ββ(θ0;Z)|(ββ0)=O{nmin((p2)ν,(1ν)/2)}·O{nmin(pν,(1ν)/2)}=O{nmin(2(p1)ν,12+(p52)ν,1ν)}=o(n1/2),

where the last equality holds since p ≥ 3, so 2(p1)ν>p1p+112,12+(p52)ν>12 and 1ν>12. Similarly we can show

P|βζ(θ̃;Z)[ζ(·,β)ζ0(·,β0)]βζ(θ0;Z)[ζ(·,β)ζ0(·,β0)]|=O{nmin(2(p1)ν,12+(p52)ν,1ν)}=o(n1/2).

Therefore, we have

|P{β(θ;Z)β(θ0;Z)ββ(θ0;Z)(ββ0)βζ(θ̃;Z)[ζ(·,β)ζ0(·,β0)]}|=O{nmin(2(p1)ν,12+(p52)ν,1ν)}=O(nαξ),

where α=min(2(p1)ν,12+(p52)ν,1ν)/min(pν,1ν2)>1 and αξ > 1/2.

Therefore, we have verified all six assumptions and thus we have

n(β̂nβ0)=A1nnlβ0*(β0,ζ0(·,β0);Z)+op(1)N(0,A1B(A1)),

where lβ0*(θ0;Z)=β(θ0;Z)ζ(θ0;Z)[h*] is the efficient score function for β0 and A=P{lβ0*(Y,Δ,X)}2=I(β0), which is shown when verifying A3. Hence A = B and A−1B(A−1)′ = A−1 = I−10), and

nnlβ0*(θ0;Z)=n12i=1nlβ0*(Yi,Δi,Xi).

Thus we complete the proof of Theorem 4.2.

Supplementary Material

Supplement Material

Acknowledgements

The authors would like to thank two referees and an associate editor for their very helpful comments.

Footnotes

*

Supported in part by NSF Grant DMS-0706700. Nan’s research is also supported in part by NSF grant DMS-1007590 and NIH grant R01-AG036802.

SUPPLEMENTARY MATERIAL

Additional proofs. The supplementary document contains proofs of technical lemmas and Theorems 4.1 and 4.3.

Contributor Information

Ying Ding, Email: yingding@umich.edu.

Bin Nan, Email: bnan@umich.edu.

References

  • 1.Ai C, Chen X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]
  • 2.Buckley J, James I. Linear Regression with Censored Data. Biometrika. 1979;66:429–436. [Google Scholar]
  • 3.Chamberlain G. Asymptotic Efficiency in Estimation with Conditional Moment Restrictions. Journal of Econometrics. 1987;34:305–334. [Google Scholar]
  • 4.Chen X. Large Sample Sieve Estimation of Semi-nonparametric Models. In: Heckman JJ, Leamer EE, editors. Handbook of Econometrics, Volumn 6B. Elsevier; 2007. pp. 5549–5632. [Google Scholar]
  • 5.Chen X, Linton O, Van Keilegom I. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71:1591–1608. [Google Scholar]
  • 6.Cox DR. Regression Models and Lifetables. Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
  • 7.Ding Y. Ph.D. Thesis, Biostatistics. University of Michigan; 2010. Some New Insights about the Accelerated Failure Time Model. [Google Scholar]
  • 8.He X, Shao Q-M. On parameters of increasing dimensions. Journal of Multivariate Analysis. 2000;73:120–135. [Google Scholar]
  • 9.He X, Xue H, N-Z S. Sieve maximum likelihood estimation for doubly semiparametric zero-inflated Poisson models. Journal of Multivariate Analysis. 2010;101:2026–2038. doi: 10.1016/j.jmva.2010.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Huang J. Efficient Estimation for the Proportional Hazards Model with Interval Censoring. The Annals of Statistics. 1996;24:540–568. [Google Scholar]
  • 11.Huang J. Efficient Estimation of the Partly Linear Additive Cox Model. The Annals of Statistics. 1999;27:1536–1563. [Google Scholar]
  • 12.Huang J, Wellner JA. Interval censored survival data: a review of recent progress. Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis, Lecture Notes in Statistics. 1997;123:123–169. [Google Scholar]
  • 13.Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based Inference for the Accelerated Failure Time Model. Biometrika. 2003;90:341–353. [Google Scholar]
  • 14.Jin Z, Lin DY, Ying Z. On Least-Squares Regression with Censored Data. Biometrika. 2006;93:147–161. [Google Scholar]
  • 15.Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd edition. Hoboken, NJ: Wiley; 2002. [Google Scholar]
  • 16.Lai TL, Ying Z. Large Sample Theory of a Modified Buckley-James Estimator for Regression Analysis with Censored Data. The Annals of Statistics. 1991;10:1370–1402. [Google Scholar]
  • 17.Miller RG, Halpern J. Regression with censored data. Biometrika. 1982;69:521–531. [Google Scholar]
  • 18.Nan B, Kalbfleisch JD, Yu M. Asymptotic theory for the semiparametric accelerated failure time model with missing data. The Annals of Statistics. 2009;37:2351–2376. [Google Scholar]
  • 19.Prentice RL. Linear Rank Tests with Right Censored Data. Biometrika. 1978;65:167–179. [Google Scholar]
  • 20.Ritov Y. Estimation in a Linear Regression Model with Censored Data. The Annals of Statistics. 1990;18:303–328. [Google Scholar]
  • 21.Ritov Y, Wellner JA. Censoring, Martingales and the Cox model. In: Prabhu NU, editor. Statistical Inference from Stochastic Processes. Providence, RI: American Mathematical Society; 1988. pp. 191–219. [Google Scholar]
  • 22.Schumaker L. Spline Functions: Basic Theory. New York: Wiley; 1981. [Google Scholar]
  • 23.Shen X. On methods of sieves and penalization. The Annals of Statistics. 1997;25:2555–2591. [Google Scholar]
  • 24.Shen X, Wong WH. Convergence Rate of Sieve Estimates. The Annals of Statistics. 1994;22:580–615. [Google Scholar]
  • 25.Tsiatis AA. Estimating Regresion Parameters Using Linear Rank Tests for Censored Data. The Annals of Statistics. 1990;18:354–372. [Google Scholar]
  • 26.van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]
  • 27.van der Vaart AW, Wellner JA. Weak Convergence and Em- pirical Processes. New York: Springer; 1996. [Google Scholar]
  • 28.Wei LJ, Ying Z, Lin DY. Linear Regression Analysis of Censored Survival Data Based on Rank Tests. Biometrika. 1990;77:845–851. [Google Scholar]
  • 29.Wellner JA, Zhang Y. Two Likelihood-based Semiparametric Estimation Methods for Panel Count Data with Covariates. The Annals of Statistics. 2007;35:2106–2142. [Google Scholar]
  • 30.Ying Z. A large Sample Study of Rank Estimation for Censored Regression Data. The Annals of Statistics. 1993;21:76–99. [Google Scholar]
  • 31.Zeng D, Lin DY. Efficient Estimation for the Accelerated Failure Time Model. Journal of the American Statistical Association. 2007;102:1387–1396. [Google Scholar]
  • 32.Zhang Y, Hua L, Huang J. A Spline-Based Semiparametric Maximum Likelihood Estimation Method for the Cox Model with Interval-Censored Data. Scandinavian Journal of Statistics. 2010;37:338–354. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement Material

RESOURCES