Abstract
In many semiparametric models that are parameterized by two types of parameters – a Euclidean parameter of interest and an infinite-dimensional nuisance parameter, the two parameters are bundled together, i.e., the nuisance parameter is an unknown function that contains the parameter of interest as part of its argument. For example, in a linear regression model for censored survival data, the unspecified error distribution function involves the regression coefficients. Motivated by developing an efficient estimating method for the regression parameters, we propose a general sieve M-theorem for bundled parameters and apply the theorem to deriving the asymptotic theory for the sieve maximum likelihood estimation in the linear regression model for censored survival data. The numerical implementation of the proposed estimating method can be achieved through the conventional gradient-based search algorithms such as the Newton-Raphson algorithm. We show that the proposed estimator is consistent and asymptotically normal and achieves the semiparametric efficiency bound. Simulation studies demonstrate that the proposed method performs well in practical settings and yields more efficient estimates than existing estimating equation based methods. Illustration with a real data example is also provided.
Keywords and phrases: Accelerated failure time model, B-spline, bundled parameters, efficient score function, semiparametric efficiency, sieve maximum likelihood estimation
1. Introduction
In a semiparametric model that is parameterized by two types of parameters – a finite-dimensional Euclidean parameter and an infinite-dimensional parameter, oftentimes the infinite-dimensional parameter is considered as a nuisance parameter, and the two parameters are separated. In many interesting statistical models, however, the parameter of interest and the nuisance parameter are bundled together, a terminology used by [12] when they reviewed the linear models under interval censoring, which means that the infinite-dimensional parameter is an unknown function of the parameter of interest. For example, in a linear regression model for censored survival data, the unspecified error distribution function, often treated as a nuisance parameter, is a function of the regression coefficients. Other examples include the single index model and the Cox regression model with an unspecified link function.
There is a rich literature of asymptotic distributional theories for M-estimation in a variety of semiparametric models with well separated parameters, see e.g. [9, 10, 11, 23, 29, 32], among many others. Though many methodologies of M-estimation for bundled parameters have been proposed in the literature, general asymptotic distributional theories for such problems are still lacking. The only estimation theories for bundled parameters we are aware of are the sieve generalized method of moment of [1] and the estimating equation approach of [5, 18].
In this article, we consider an extension of existing asymptotic distributional theories to accommodate situations where the estimation criteria are parameterized with bundled parameters. The proposed theory has similar flavor of Theorem 2 in [5], but they are different because the latter requires an existing uniform consistent estimator of the infinite dimensional nuisance parameter with a convergence rate faster than n−1/4, which is then treated as a fixed function of the parameter of interest in their estimating procedure, while we need to simultaneously estimate both parameters through a sieve parameter space; furthermore, their existing nuisance parameter estimator needs to satisfy their condition (2.6), which is usually hard to verify when its convergence rate is slower than n−1/2. Our proposed theory is general enough to cover a wide range of problems for bundled parameters including afore-mentioned single index model, the Cox model with unknown link function, and linear model under different censoring mechanisms. Rigorous proofs for each of the models, however, will take lengthy derivations. We only use the efficient estimation in the semiparametric linear regression model with right censored data as an illustrative example that motivates such a theoretical development, and will present results for other models elsewhere. Note that the considered example can not be directly put into the framework of restricted moments due to right censoring, thus can not be handled by the method of [1].
Suppose that the failure time transformed by a known monotone transformation is linearly related to a set of covariates, where the failure time is subject to right censoring. Let Ti denote the transformed failure time and Ci denote the transformed censoring time by the same transformation for subject i, i = 1, ⋯, n. Let Yi = min(Ti, Ci) and Δi = I(Ti ≤ Ci). Then the semiparametric linear model we consider here can be written as
(1.1) |
where the errors e0,i are independent and identically distributed (i.i.d.) with an unspecified distribution. When the failure time is log-transformed, this model corresponds to the well-known accelerated failure time model [15]. Here we assume that (Xi, Ci), i = 1, …, n, are i.i.d. and independent of e0,i. This is a common assumption for linear models with censored survival data, which is particularly needed in [21] to derive the efficient score function for β0. Such an assumption, however, is stronger than necessary in the usual linear regression without censoring, for which the error is only required to be uncorrelated with covariates (see e.g. [3]). We also avoid trivial transformations such as log(0) so that we always have Yi’s bounded from below.
The semiparametric linear regression model relates the failure time to the covariates directly. It provides a straightforward interpretation of the data and serves as an attractive alternative to the Cox model [6] in many applications. Several estimators of the regression parameters have been proposed in the literature since late 70’s, including the rank-based estimators (see e.g. [19], [28], [25], [30], [13], [14]) and the Buckley-James estimator (see e.g. [2], [20], [16]). There are two major challenges in the estimation for such a linear model: (1) the estimating functions in the aforementioned methods are discrete, leading to potential multiple solutions as well as numerical difficulties; (2) none of the aforementioned methods is efficient. Recently, [31] developed a kernel-smoothed profile likelihood estimating procedure for the accelerated failure time model. In this article, we consider a sieve maximum likelihood approach for model (1.1) for censored data. The proposed approach is much intuitive, easy to implement numerically, and asymptotically efficient.
It is easy to see that T and C are independent conditional on X under the assumption e0 ⊥ (C, X). Hence the joint density function of Z = (Y, Δ, X) can be written as
(1.2) |
where Λ0(·) is the true cumulative hazard function for the error term e0 and λ0(·) is its derivative. H (y, δ, x) only depends on the conditional distribution of C given X and the marginal distribution of X, and is free of β0 and λ0. To simplify the notation, we will ignore the factor H from the likelihood function. Then for i.i.d. observations (Yi, Δi, Xi), i = 1, ⋯, n, from equation (1.2) we obtain the log likelihood function for β and λ as
(1.3) |
The log likelihood given in (1.3) apparently is a semiparametric model, where the argument of the nuisance parameter λ involves β, thus β and λ are bundled parameters. To keep the positivity of λ, let g(·) = log λ(·). Then the log likelihood function for β and g, using the counting process notation, can be written as
(1.4) |
where Ni(t) = ΔiI(Yi ≤ t) is the counting process for subject i.
We propose a new approach by directly maximizing the log likelihood function in a sieve space in which function g(·) is approximated by B-splines. Numerically, the estimator can be easily obtained by the Newton-Raphson algorithm or any gradient-based search algorithms. We show that the proposed estimator is consistent and asymptotically normal, and the limiting covariance matrix reaches the semiparametric efficiency bound, which can be estimated either by inverting the information matrix based on the efficient score function of the regression parameters derived by [21], or by inverting the observed information matrix of all parameters, taking into account that we are also estimating the nuisance parameters in the sieve space for the log hazard function.
2. The sieve M-theorem on the asymptotic normality of semiparametric estimation for bundled parameters
In this section, we extend the general theorem introduced by [29], which deals with the asymptotic normality of semiparametric M-estimators of regression parameters when convergence rate of the estimator for nuisance parameters can be slower than n−1/2. In their theorem, the parameters of interest and the nuisance parameters are assumed to be separated. We consider a more general setting where the nuisance parameter can be a function of the parameters of interest. The theorem is crucial in the proof of asymptotic normality given in Theorem 4.2 for our proposed estimators.
Some empirical process notation will be used from now on. We denote Pf = ∫ f(z) dP(z) and , where P is a probability measure and ℙn is an empirical probability measure, and denote 𝔾nf = n1/2(ℙn − P)f. Given i.i.d. observations Z1, Z2, ⋯, Zn ∈ 𝒵, we estimate the unknown parameters (β, ζ(·, β)) by maximizing an objective function for (β, ζ(·, β)), , where β is the parameter of interest and ζ(·, β) is the nuisance parameter that can be a function of β. Here “ · ” denotes the other arguments of ζ besides β, which can be some components of Z ∈ 𝒵. If the objective function m is the log-likelihood function of a single observation, then the estimator becomes the semiparametric maximum likelihood estimator. Here we adopt similar notation in [29].
Let θ = (β, ζ(·, β)), β ∈ ℬ ⊂ ℝd and ζ ∈ ℋ, where ℬ is the parameter space of β and ℋ is a class of functions mapping from 𝒵 × ℬ to ℝ. Let Θ = ℬ×ℋ be the parameter space of θ. Define a distance between θ1, θ2 ∈ Θ by
where | · | is the Euclidean distance and ‖ · ‖ is some norm. Let Θn be the sieve parameter space, a sequence of increasing subsets of the parameter space Θ growing dense in Θ as n → ∞. We aim to find θ̂n ∈ Θn such that d(θ̂n, θ0) = op(1) and β̂n is asymptotically normal.
For any fixed ζ(·, β) ∈ ℋ, let {ζη(·, β) : η in a neighborhood of 0 ∈ ℝ} be a smooth curve in ℋ running through ζ(·, β) at η = 0, i.e., ζη(·, β)|η=0 = ζ(·, β). Assume all ζ(·, β) ∈ ℋ are at least twice-differentiable with respect to β, and denote
Assume the objective function m is twice Frechet differentiable. Since for a small δ, we have ζ(·, β + δ) − ζ(·, β) = ζ̇β(·, β)δ + o(δ), here ζ̇β(·, β) = ∂ζ(·, β)/∂β, then by the definition of functional derivatives it follows that
where the subscript 2 indicates that the derivatives are taken with respect to the second argument of the function. The last equality holds because
Similarly we have
and
Thus according to the chain rule of the functional derivatives, we have
As noted before, the subscript 1 or 2 in the derivatives indicates that the derivatives are taken with respect to the first or the second argument of the function, and h inside the square brackets is a function denoting the direction of the functional derivative with respect to ζ. Note that for the second derivatives m̈βζ and m̈ζβ, we implicitly require the direction h to be a differentiable function with respect to β. It is easily seen that when ζ is free of β, all the above derivatives reduce to that in [29]. Following [29], we also define
and
Furthermore, for h = (h1, h2, ⋯, hd)′ ∈ ℍd, we denote
and define correspondingly
To obtain the asymptotic normality result for the sieve M-estimator β̂n, the assumptions we will make in the following look similar to those in [29], but all the derivatives with respect to β involve the chain rule and hence are more complicated, which is the key difference to [29]. Additionally, we focus on sieve estimators in the sieve parameter space. We list the following assumptions:
-
A1
(Rate of convergence) For an estimator θ̂n = (β̂n, ζ̂n(·, β̂n)) ∈ Θn and the true parameter θ0 = (β0, ζ0(·, β0)) ∈ Θ, d(θ̂n, θ0) = Op(n−ξ) for some ξ > 0.
-
A2
Ṡβ(β0, ζ0(·, β0)) = 0 and Ṡζ(β0, ζ0(·, β0))[h] = 0 for all h ∈ ℍ.
-
A3(Positive information) There exists an , where for j = 1, ⋯, d, such that
for all h ∈ ℍ. Furthermore, the matrix
is nonsingular. -
A4The estimator (β̂n, ζ̂n(·, β̂n)) satisfies
-
A5(Stochastic equicontinuity) For some C > 0,
and -
A6(Smoothness of the model) For some α > 1 satisfying αξ > 1/2, and for θ in a neighborhood of θ0 : {θ : d(θ, θ0) ≤ Cn−ξ, θ ∈ Θn},
and
Note that ξ in A1 depends on the entropy of the sieve parameter space for ζ and can not be arbitrarily small – it is controlled by the smoothness of the model in A6. The convergence rate in A1 needs to be achieved prior to obtaining asymptotic normality. A2 is a common assumption for the maximum likelihood estimation and usually holds. The direction h* in A3 may be found through the equation in A3. It is the least favorable direction when m is the likelihood function. A4 and A5 are usually verified either by the Donsker property or the maximal inequality of [27]. A6 can be obtained by the Taylor expansion. The following theorem is an extension to Theorem 6.1 in [29] when the infinite-dimensional parameter ζ is a function of the finite-dimensional parameter β.
Theorem 2.1. Suppose that assumptions A1–A6 hold. Then
where
and A is given in assumption A3. Here a⊗2 = aa′.
Proof. The proof follows similarly along the proof of Theorem 6.1 in [29]. Assumptions A1 and A5 yield
Since Ṡβ,n(β̂n, ζ̂n(·, β̂n)) = op(n−1/2) by A4 and Ṡβ(β0, ζ0(·, β0)) = 0 by A2, we have
Similarly,
Combining these equalities and assumption A6 yields
(2.1) |
and
(2.2) |
Since α > 1 with αξ > 1/2, the rate of convergence assumption A1 implies , then (2.1) – (2.2) together with A3 yields
that is,
This yields
3. Back to the linear model: the sieve maximum likelihood estimation
By taking logarithm to the positive function λ(·) in (1.3), the function g(·) in (1.4) is no longer restricted to be positive, which eases the estimation. We now describe the spline-based sieve maximum likelihood estimation for model (1.1). Under the regularity conditions C.1–C.3 stated in Section 4, we know that the observed residual times are confined in some finite interval. Let [a, b] be an interval of interest, where −∞ < a < b < ∞. Let TKn = {t1, ⋯, tKn } be a set of partition points of [a, b] with Kn = O(nν) and max1≤j≤Kn+1 |tj − tj−1| = O(n−ν) for some constant ν ∈ (0, 1/2). Let 𝒮n(TKn, Kn, p) be the space of polynomial splines of order p ≥ 1 defined in [22][Definition 4.1]. According to [22][Corollary 4.10], there exist a set of B-spline basis functions {Bj, 1 ≤ j ≤ qn} with qn = Kn + p such that for any s ∈ 𝒮n(TKn, Kn, p), we can write
(3.1) |
where we follow [24] by requiring maxj=1,…,qn |γj | ≤ cn that is allowed to grow with n slowly enough.
Let γ = (γ1, …, γqn)′. Under suitable smoothness assumptions, g0(·) = log λ0(·) can be well approximated by some function in 𝒮n(TKn, Kn, p). Therefore, we seek a member of 𝒮n(TKn, Kn, p) together with a value of β ∈ ℬ that maximizes the log likelihood function. Specifically, let θ̂n = (β̂n, γ̂n) be the value that maximizes
(3.2) |
Taking the first order derivatives of ln(β, γ) with respect to β and γ and setting them to zero, we can obtain the score equations. Since the integrals here are univariate integrals, their numerical implementation can be easily done by the one-dimensional Gaussian-quadrature method. Newton-Raphson algorithm or any other gradient-based search algorithms can be applied to solve the score equations for all parameters θ = (β, γ), e.g.,
where θ(m) = (β(m), γ(m)) is the parameter estimate from the mth iteration, and
are the score function and Hessian matrix of parameter θ. For any fixed β and n, it is clearly seen that ln(β, γ) in (3.2) is concave with respect to γ and goes to −∞ if any γj approaches either ∞ or −∞, hence γ̂n must be bounded which yields an estimator of s in 𝒮n(TKn, Kn, p).
As stated in the next section, the distribution of β̂n can be approximated by a normal distribution. One way to estimate the variance matrix of β̂n is to approximate the (inverse of the) information matrix based on the efficient score function for β0 by plugging in the estimated parameters (β̂n, λ̂n(·)). The consistency of such a variance estimator is given in Theorem 4.3. Another way is to invert the observed information matrix from the last Newton-Raphson iteration, taking into account that we are also estimating the nuisance parameter γ. The consistency of the latter approach may be proved in a similar way as Example 4 in [23] or via Theorem 2.2 in [8], and we leave detailed derivation to interested readers. Simulations indicate that both estimators work reasonably well.
4. Asymptotic results
Denote εβ = Y − X′β and ε0 = Y − X′β0. We assume the following regularity conditions:
-
(C.1)
The true parameter β0 belongs to the interior of a compact set ℬ ⊆ ℝd.
-
(C.2)
(a) The covariate X takes values in a bounded subset 𝒳 ⊆ ℝd; (b) E(XX′) is nonsingular.
-
(C.3)
There is a truncation time τ < ∞ such that, for some constant δ, P(ε0 > τ|X) ≥ δ > 0 almost surely with respect to the probability measure of X. This implies that Λ0(τ) ≤ −log δ < ∞.
-
(C.4)The error e0’s density f and its derivative ḟ are bounded and
-
(C.5)The conditional density of C given X and its derivative ġC|X are uniformly bounded for all possible values of X. That is,
for all t ≤ τ with some constants K1, K2 > 0, where τ is the truncation time defined in Condition C.3. -
(C.6)Let 𝒢p denote the collection of bounded functions g on [a, b] with bounded derivatives g(j), j = 1, …, k, and the kth derivative g(k) satisfies the following Lipschitz continuity condition:
where k is a positive integer and m ∈ (0, 1] such that p = k + m ≥ 3, and L < ∞ is an unknown constant. The true log hazard function g0(·) = log λ0(·) belongs to 𝒢p, where [a, b] is a bounded interval. -
(C.7)
For some η ∈ (0, 1), u′V ar(X|ε0)u ≥ ηu′E(XX′|ε0)u almost surely for all u ∈ ℝd.
Condition C.1 is a common regularity assumption that has been imposed in the literature, see e.g. [16]. Conditions C.2(a) and C.3–C.4 were also assumed in [25]. Condition C.5 implies Condition B in [25]. In Condition C.6, we require p ≥ 3 to provide desirable controls of the spline approximation error rates of the first and second derivatives of g0 (see Corollary 6.21 of [22]), which are needed in verifying Assumptions A4–A6. Condition C.7 was also proposed for the panel count data model in [29]. As noted in their Remark 3.4, this Condition C.7 can be justified in many applications when Condition C.2(b) is satisfied. The bounded interval [a, b] in C.6 may be chosen as a = infy,x(y − x′β0) > −∞ and b = τ < ∞ under C.1–C.3, which is what we use in the following.
Now define the collection of functions ℋp as follows:
where
and 𝒢p is defined in C.6. Here ζ is a composite function of g composed with ψ. Note that ζ(t, x, β0) = g(t). Then for ζ(·, β) ∈ ℋp we define the following norm
(4.1) |
We also have the following collection of scores
in which h(t, x, β) = w(ψ(t, x, β)) = w(t − x′(β − β0)).
For any θ1 = (β1, ζ1(·, β1)) and θ2 = (β2, ζ2(·, β2)) in the space of Θp = ℬ × ℋp, define the following distance
(4.2) |
Let . Denote
and . Clearly for all n ≥ 1. The sieve estimator θ̂n = (β̂n, ζ̂n(·, β̂n)), where ζ̂n (t, x, βn) = ĝn(t − x′(β̂n − β0)), is the maximizer of the empirical log-likelihood n−1ln(θ; Z) over the sieve space . The following theorem gives the convergence rate of the proposed estimator θ̂n to the true parameter θ0 = (β0, ζ0(·, β0)) = (β0, g0).
Theorem 4.1. Let Kn = O(nν), where ν satisfies the restriction with p being the smoothness parameter defined in Condition C.6. Suppose Conditions C.1–C.7 hold and the failure time T follows model (1.1), then
where d(·, ·) is defined in (4.2).
Remark. It is worth pointing out that the sieve space does not have to be restricted to the B-spline space – it can be any sieve space as long as the estimator satisfies the conditions of Theorem 1 in [24]. We refer to [4] for a comprehensive discussion of the sieve estimation for semiparametric models in general sieve spaces. Our choice of the B-spline space is primarily motivated by its simplicity of numerical implementation, which is a tremendous advantage of the proposed approach over exiting numerical methods for the accelerated failure time models, in particular, the linear programming approach.
We provide a proof of Theorem 4.1 in the online Supplementary Material by checking the conditions of Theorem 1 in [24]. Theorem 4.1 implies that if ν = 1/(1 + 2p), d(θ̂n, θ0) = Op(n−p/(1+2p)) which is the optimal convergence rate in the nonparametric regression setting. Although the overall convergence rate is slower than n−1/2, the next theorem states that the proposed estimator of the regression parameter is still asymptotically normal and semiparametrically efficient.
Theorem 4.2. Given the following efficient score function for the censored linear model derived by [21]:
where
is the failure counting process martingale and
was shown by [20]. Suppose that the conditions in Theorem 4.1 hold and is nonsingular, then
in distribution.
The proof of Theorem 4.2 is where we need to apply our general sieve M-theorem proposed in Section 2. We prove by checking the assumptions A1–A6. Details are provided in Section 7. The following theorem gives consistency of the variance estimator based on the above efficient score.
Theorem 4.3. Suppose the conditions in Theorem 4.2 hold. Denote
Then in probability.
It is clearly seen that X̄ (t, β̂n) in Theorem 4.3 estimates P(X|Y − X′β0 ≥ t) in Theorem 4.2. The proof of Theorem 4.3 is provided in the Supplementary Material.
5. Numerical examples
5.1. Simulations
Extensive simulations are carried out to evaluate the finite sample performance of the proposed method. In the simulation studies, failure times are generated from the model
where X1 is Bernoulli with success probability 0.5, X2 is independent normal with mean 0 and standard deviation 0.5 truncated at ±2. This is the same model used by [14] and [31]. We consider six error distributions: standard normal; standard extreme-value; mixtures of N(0, 1) and N(0, 32) with mixing probabilities (0.5,0.5) and (0.95,0.05), denoted by 0.5N(0, 1) + 0.5N(0, 32) and 0.95N(0, 1)+0.05N(0, 32), respectively; Gumbel(−0.5μ, 0.5) with μ being the Euler constant and 0.5N(0, 1) + 0.5N(−1, 0.52). The first four distributions were also considered by [31]. Similarly to [31], the censoring times are generated from Uniform [0, c] distribution, where c is chosen to produce a 25% censoring rate. We set the sample size n to 200, 400 and 600.
We choose cubic B-splines with one interior knot for n = 200 and 400, and two interior knots for n = 600. We perform the sieve maximum likelihood analysis and obtain the estimates of the slope parameters using the Newton-Raphson algorithm that updates (β, γ) iteratively. We stop iteration when the change of parameter estimates or the gradient value is less than a pre-specified tolerance value that is set to be 10−5 in our simulations. Log-rank and Gehan-weighted estimators are included for efficiency comparisons. We calculate the theoretical semiparametric efficiency bound I−1(β0), and scale it by the sample size, i.e., , which serves as the reference standard error under the fully efficient situation. Table 1 summarizes the results of these studies based on 1000 simulated datasets. The bias of the proposed estimators of β1 and β2 are negligible. Both variance estimation procedures, denoted as 1SEE (the standard error estimates by inverting the information matrix based on the efficient score function) and 2SEE (the standard error estimates by inverting the observed information matrix of all parameters including nuisance parameters), yield nice standard error estimates for the parameter estimators comparing to the empirical standard error SE, and the 95% confidence intervals have proper coverage probabilities, especially when the sample size is large. For the N(0, 1) error and the two mixtures of normal errors that are also considered in [31], the proposed estimators are more efficient than the log-rank estimators and have similar variances to the Gehan-weighted estimators. For the standard extreme-value error, the proposed estimators are more efficient than the Gehan-weighted estimator and similar to the log-rank estimator that is known to be the most efficient estimator under this particular error distribution. For the Gumbel(−0.5μ, 0.5) and 0.5N(0, 1) + 0.5N(−1, 0.52) errors, the proposed estimators are more efficient than the other two estimators. Under all six error distributions, the standard errors of the proposed estimators are close to the efficient theoretical standard errors. The sample averages of the estimates for λ0 under different simulation settings are reasonably close to corresponding true curves (results not shown here, see [7] for details).
Table 1.
Err. dist |
B-spline MLE | Log-rank | Gehan | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
n | Bias | SE | 1SEE (CP) | 2SEE (CP) | Bias | SE | Bias | SE | σ* | ||
(a) | 200 | β1 | .003 | .168 | .149 (.912) | .155 (.924) | .000 | .170 | .002 | .159 | .155 |
β2 | .003 | .167 | .153 (.928) | .156 (.928) | .004 | .171 | .002 | .160 | .156 | ||
400 | β1 | .006 | .110 | .108 (.948) | .110 (.950) | .005 | .115 | .008 | .108 | .110 | |
β2 | .001 | .110 | .109 (.944) | .110 (.945) | .002 | .116 | .001 | .109 | .110 | ||
600 | β1 | .001 | .092 | .088 (.939) | .090 (.943) | .001 | .096 | .002 | .093 | .090 | |
β2 | .005 | .091 | .089 (.945) | .090 (.944) | .005 | .097 | .003 | .092 | .090 | ||
(b) | 200 | β1 | −.009 | .180 | .154 (.894) | .161 (.903) | −.008 | .168 | −.007 | .190 | .165 |
β2 | .004 | .182 | .162 (.903) | .163 (.915) | .005 | .170 | .005 | .195 | .169 | ||
400 | β1 | .000 | .126 | .113 (.914) | .115 (.923) | −.001 | .124 | .000 | .143 | .117 | |
β2 | .008 | .118 | .116 (.934) | .116 (.938) | .010 | .116 | .012 | .135 | .120 | ||
600 | β1 | .001 | .102 | .093 (.919) | .094 (.923) | .001 | .100 | .000 | .114 | .095 | |
β2 | .011 | .098 | .095 (.944) | .095 (.945) | .011 | .097 | .007 | .114 | .098 | ||
(c) | 200 | β1 | .014 | .300 | .281 (.930) | .279 (.924) | −.020 | .315 | −.019 | .292 | .259 |
β2 | .000 | .306 | .285 (.916) | .282 (.918) | .002 | .317 | .002 | .288 | .260 | ||
400 | β1 | .034 | .199 | .206 (.955) | .200 (.949) | .002 | .218 | .002 | .197 | .183 | |
β2 | −.003 | .207 | .208 (.949) | .202 (.942) | −.001 | .222 | −.002 | .200 | .184 | ||
600 | β1 | .035 | .168 | .171 (.957) | .165 (.949) | .003 | .185 | .001 | .163 | .150 | |
β2 | −.007 | .169 | .172 (.956) | .166 (.956) | −.004 | .190 | −.002 | .168 | .150 | ||
(d) | 200 | β1 | −.013 | .172 | .157 (.926) | .164 (.927) | −.010 | .181 | −.007 | .166 | .167 |
β2 | −.004 | .180 | .160 (.908) | .164 (.913) | −.005 | .184 | −.005 | .173 | .166 | ||
400 | β1 | .003 | .119 | .113 (.944) | .116 (.948) | .004 | .126 | .006 | .117 | .118 | |
β2 | .003 | .117 | .114 (.942) | .116 (.953) | .004 | .126 | .003 | .115 | .118 | ||
600 | β1 | −.003 | .097 | .093 (.948) | .095 (.952) | −.002 | .105 | .002 | .097 | .096 | |
β2 | .001 | .096 | .094 (.942) | .095 (.944) | .002 | .105 | .003 | .094 | .096 | ||
(e) | 200 | β1 | .004 | .080 | .077 (.944) | .078 (.946) | −.001 | .111 | .004 | .088 | .079 |
β2 | −.001 | .083 | .080 (.929) | .078 (.934) | .000 | .114 | .000 | .091 | .080 | ||
400 | β1 | −.005 | .055 | .055 (.946) | .055 (.951) | −.003 | .079 | −.004 | .061 | .056 | |
β2 | .003 | .055 | .056 (.954) | .056 (.950) | .003 | .081 | .003 | .063 | .056 | ||
600 | β1 | −.003 | .047 | .045 (.940) | .045 (.938) | .000 | .067 | −.001 | .052 | .045 | |
β2 | −.001 | .047 | .046 (.944) | .045 (.943) | −.002 | .066 | −.001 | .051 | .046 | ||
(f) | 200 | β1 | −.002 | .126 | .117 (.918) | .120 (.929) | −.002 | .159 | −.001 | .128 | .119 |
β2 | .000 | .133 | .120 (.917) | .121 (.926) | .002 | .164 | .001 | .134 | .116 | ||
400 | β1 | −.002 | .087 | .084 (.949) | .085 (.950) | .003 | .114 | .000 | .091 | .084 | |
β2 | .004 | .086 | .086 (.951) | .086 (.953) | .003 | .111 | .004 | .090 | .082 | ||
600 | β1 | .003 | .074 | .070 (.929) | .070 (.931) | .005 | .101 | .001 | .074 | .069 | |
β2 | .003 | .074 | .070 (.936) | .070 (.936) | .009 | .104 | .004 | .075 | .067 |
5.2. A real data example
We use the Stanford heart transplant data [17] as an illustrative example. This dataset was also analyzed by [14] using their proposed least squares estimators. Following their analysis, we consider the same two models: the first one regresses the base-10 logarithm of the survival time on age at transplant and T5 mismatch score for the 157 patients with complete records on T5 measure, and the second one regresses the base-10 logarithm of the survival time on age and age2. There were 55 censored patients. We fit these two models using the proposed method with five cubic B-spline basis functions.
We report the parameter estimates and the standard error estimates in Table 2 and compare them with the Gehan-weighted estimators reported by [14] and the Buckley-James estimators reported by [17]. For the first model, the parameter estimates for the age effect are fairly similar among all estimators and the standard error estimate from the proposed method tends to be smaller, while the parameter estimates for the T5 mismatch score vary across different estimators with none of them being significant at the 0.05 level. The disparity of the T5 effect may be due to what was pointed out by [17]: the accelerated failure time model with age and T5 as covariates does not fit the data ideally. For the second model with age and age2 being the covariates, the point estimates are very similar across all methods and the standard error estimates from the proposed method are the smallest.
Table 2.
B-spline MLE | Gehan-weighted | Buckley-James | |||||
---|---|---|---|---|---|---|---|
Covariate | Est. | SE | Est. | SE | Est. | SE | |
M. 1 | Age | −0.0237 | 0.0068 | −0.0211 | 0.0106 | −0.015 | 0.008 |
T5 | −0.2118 | 0.1271 | −0.0265 | 0.1507 | −0.003 | 0.134 | |
M. 2 | Age | 0.1022 | 0.0245 | 0.1046 | 0.0474 | 0.107 | 0.037 |
Age2 | −0.0016 | 0.0004 | −0.0017 | 0.0006 | −0.0017 | 0.0005 |
6. Discussion
By applying the proposed general sieve M-estimation theory for semiparametric models with bundled parameters, we are able to derive the asymptotic distribution for the sieve maximum likelihood estimator in a linear regression model where the response variable is subject to right censoring. By providing a both statistically and computationally efficient estimating procedure, this work makes the linear model a more viable alternative to the Cox proportional hazards model. Comparing to the existing methods for estimating β in a linear model, the proposed method has three advantages. Firstly, the estimating functions are smooth functions in contrast to the discrete estimating functions in the existing estimation methods, thus the root search is easier and can be done fast by conventional iterative methods such as the Newton-Raphson algorithm. Secondly, the standard error estimates are obtained directly by inverting either the efficient information matrix for the regression parameters or the observed information matrix of all parameters, either method is more computationally tractable compared to the re-sampling techniques. Thirdly, the proposed estimator achieves the semiparametric efficiency bound.
The proposed general sieve M-estimation theory can also be applied to other statistical models, for example, the single index model, the Cox model with an unknown link function, and the linear model under different censoring mechanisms. Such research is undergoing and will be presented elsewhere.
7. Proof of Theorem 4.2
Empirical process theory developed in [26, 27] will be heavily involved in the proof. We use the symbol ≲ to denote that the left hand side is bounded above by a constant times the right hand side and ≳ to denote that the left hand side is bounded below by a constant times the right hand side. For notational simplicity, we drop the superscript * in the outer probability measure P* whenever an outer probability applies.
7.1. Technical lemmas
We first introduce several lemmas that will be used for the proofs of Theorems 4.1, 4.2 and 4.3. Proofs of these lemmas are provided in the online Supplementary Material.
Lemma 7.1. Under Conditions C.1–C.3 and C.6, the log-likelihood
where ε0 = Y − X′β0, has bounded and continuous first and second derivatives with respect to β ∈ ℬ and ζ(·, β) ∈ ℋp.
Lemma 7.2. For g0 ∈ 𝒢p, there exists a function such that
Lemma 7.3. Let θ0,n = (β0, ζ0,n(·, β0)) with ζ0,n(·, β0) ≡ g0,n defined in Lemma 7.2. Denote . Assume that Conditions C.1–C.3 and C.6 hold, then the ε-bracketing number associated with ‖ · ‖∞ norm for ℱn is bounded by (1/ε)cqn+d, i.e., N[ ](ε, ℱn, ‖ · ‖∞) ≲ (1/ε)cqn+d for some constant c > 0.
Lemma 7.4. Let , where , j = 1, …, d. Assume Conditions C.1–C.6 hold, then there exists such that , or equivalently, where .
Lemma 7.5. For defined in Lemma 7.4, denote the class of functions
Assume Conditions C.1–C.6 hold, then for some constant c > 0.
Lemma 7.6. For j = 1, ⋯, d, define the following two classes of functions
and
where l̇βj (θ; Z) is the jth element of l̇β(θ; Z), ġ(·) denotes the derivative of g(·), and is defined in Lemma 7.5. Assume Conditions C.1–C.6 hold, then and for some constants c1, c2 > 0.
7.2. Proof of Theorem 4.2
We prove the theorem by checking Assumptions A1–A6 in Section 2. Here the criterion function of a single observation is the log-likelihood function l(β, ζ(·, β); Z). So instead of m, we use l to denote the criterion function. By Theorem 4.1 we know that Assumption A1 holds with ξ = min(pν, (1 − ν)/2) and the norm ‖ · ‖2 defined in (4.1). A2 automatically holds for the scores. For A3, we need to find an with h* (t, x, β0) = w* (t) such that
for all h ∈ ℍ with h(t, x, β) = w(t − x′(β − β0)). Note that
Since P{l̇ζ (β0, ζ0(·, β0); Z)[h]|X} = 0 for all h ∈ ℍ, replacing h(·, β0) by ẇ we have
Hence we only need to find a w* such that
One obvious choice for w* (or h*) is
(7.1) |
Then it follows
which is the efficient score function for β0 originally derived by [21], where
By the fact of zero-mean for a score function, it is straightforward to verify the following equalities:
Then together with the fact that
the matrix A in Assumption A3 of Theorem 2.1 is given by
which is the information matrix for β0.
To verify A4, we note that the first part automatically holds since β̂n satisfies the score equation Ṡβ,n(β̂n, ζ̂n(·, β̂n)) = ℙnl̇β (β̂n, ζ̂n(·, β̂n); Z) = 0. Next we shall show that
where , j = 1, ⋯, d, is the jth component of w* (t) given in (7.1). According to Lemma 7.4, there exists such that . Then by the score equation for γ: Ṡγ,n (β̂n, γ̂n) = ℙnl̇γ(β̂n, γ̂n; Z) = 0 and the fact that can be written as for some coefficients and the basis functions Bk(t) of the spline space, it follows that
So it suffices to show that for each 1 ≤ j ≤ d,
Since , we decompose In into In = I1n + I2n, where
and
We will show that I1n and I2n are both op(n−1/2).
First consider I1n. According to Lemma 7.5, the ε-bracketing number associated with ‖ · ‖∞ norm for the class defined in Lemma 7.5 is bounded by (η/ε)cqn+d. This implies that
which leads to the bracketing integral
Now we pick η to be ηn = O{n−min(2ν,(1−ν)/2)}, then
and since p ≥ 3,
Therefore, . Denote tβ = t − X′(β−β0) for notational simplicity, for any , it follows that
where the first inequality holds because of the Cauchy-Schwartz inequality. Since , by the same argument of [24] on page 591 for slowly growing cn (their ln), e.g. , we know that is bounded by some constant 0 < M < ∞ and for a slightly enlarged ηn obtained by a fine adjustment of ν. Then by the maximal inequality in Lemma 3.4.2 of [27], it follows that
where the last equality holds because 0 < ν < 1/2. Thus by the Markov’s inequality, .
Next for I2n, the Taylor expansion for at θ0 yields
where (β̃n, ζ̃n(·, β̃n)) is between (β0, ζ0(·, β0)) and (β̂n, ζ̂n(·, β̂n)). Then it follows that
where the second inequality holds because g̃n and its first derivative are bounded (or growing with n slowly enough so it can be effectively treated as bounded based on the same argument of [24] on page 591), and the last equality holds due to the Corollary 6.21 of [22] that . Thus,
Also,
By the Cauchy-Schwartz inequality and the boundedness of g̃n, we have
Hence |I3n| ≲ d(θ̂n, θ0) and
Since , it follows that I2n = O{n−min((p+1)ν,(1+3ν)/2)} = o(n−1/2). Thus In = I1n + I2n = op(n−1/2) and Condition A4 holds.
Now we verify Assumption A5. First by Lemma 7.6, the ε-bracketing numbers for the classes of functions and are both bounded by (η/ε)cqn+d, which implies that the corresponding ε-bracketing integrals are both bounded by , i.e.,
Then for l̇βj (θ; z) − l̇βj (θ0; z), by applying the Cauchy-Schwartz inequality, together with subtracting and adding the terms ġ(ε0), eg0(tβ) ġ(tβ), eg0(t) ġ(tβ) and eg0(t) ġ0(tβ), we have
For B1, since g̈ is bounded and the largest eigenvalue of P(XX′) satisfies 0 < λd < ∞ by Condition C.2(b), it follows that
For B2, we have
For B3, by using the mean value theorem, it follows that
where g̃ = g0 + ξ(g − g0) for some 0 < ξ < 1 and thus is bounded. Finally for B4, by the mean value theorem, it follows that
Therefore we have P{l̇βj (θ; Z) − l̇βj (θ0; Z)}2 ≲ η2. Using the similar argument, we can show that . By Lemma 7.1, we also have ‖l̇βj (θ; Z)− l̇βj (θ0; Z)‖∞ and are both bounded. Now we pick η as ηn = O{n−min((p−1)ν, (1−ν)/2)}, then by the maximal inequality in Lemma 3.4.2 of [27], it follows that
where the last equality holds since p ≥ 3 and . Similarly, we have . Thus for ξ = min(pν, (1 − ν)/2) and Cn−ξ = O{n−min(pν,(1−ν)/2)}, by the Markov’s inequality,
This completes the verification of Assumption A5.
Finally, Assumption A6 can be verified by using the Taylor expansion. Since the proofs for the two equations in A6 are essentially identical, we just prove the first equation. In a neighborhood of with ξ = min(pν, (1 − ν)/2), the Taylor expansion for l̇β(θ; Z) yields
where θ̃ = (β̃, ζ̃(·, β̃)) is a midpoint between θ0 and θ. So
Then by direct calculation we have
By applying the similar argument that we used before for verifying A5 and Condition C.6, we can show
Similarly, we can show
and
where ξ = min(pν, (1 − ν)/2). Therefore,
and thus
where the last equality holds since p ≥ 3, so and . Similarly we can show
Therefore, we have
where and αξ > 1/2.
Therefore, we have verified all six assumptions and thus we have
where is the efficient score function for β0 and , which is shown when verifying A3. Hence A = B and A−1B(A−1)′ = A−1 = I−1(β0), and
Thus we complete the proof of Theorem 4.2.
Supplementary Material
Acknowledgements
The authors would like to thank two referees and an associate editor for their very helpful comments.
Footnotes
Supported in part by NSF Grant DMS-0706700. Nan’s research is also supported in part by NSF grant DMS-1007590 and NIH grant R01-AG036802.
SUPPLEMENTARY MATERIAL
Additional proofs. The supplementary document contains proofs of technical lemmas and Theorems 4.1 and 4.3.
Contributor Information
Ying Ding, Email: yingding@umich.edu.
Bin Nan, Email: bnan@umich.edu.
References
- 1.Ai C, Chen X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]
- 2.Buckley J, James I. Linear Regression with Censored Data. Biometrika. 1979;66:429–436. [Google Scholar]
- 3.Chamberlain G. Asymptotic Efficiency in Estimation with Conditional Moment Restrictions. Journal of Econometrics. 1987;34:305–334. [Google Scholar]
- 4.Chen X. Large Sample Sieve Estimation of Semi-nonparametric Models. In: Heckman JJ, Leamer EE, editors. Handbook of Econometrics, Volumn 6B. Elsevier; 2007. pp. 5549–5632. [Google Scholar]
- 5.Chen X, Linton O, Van Keilegom I. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71:1591–1608. [Google Scholar]
- 6.Cox DR. Regression Models and Lifetables. Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
- 7.Ding Y. Ph.D. Thesis, Biostatistics. University of Michigan; 2010. Some New Insights about the Accelerated Failure Time Model. [Google Scholar]
- 8.He X, Shao Q-M. On parameters of increasing dimensions. Journal of Multivariate Analysis. 2000;73:120–135. [Google Scholar]
- 9.He X, Xue H, N-Z S. Sieve maximum likelihood estimation for doubly semiparametric zero-inflated Poisson models. Journal of Multivariate Analysis. 2010;101:2026–2038. doi: 10.1016/j.jmva.2010.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Huang J. Efficient Estimation for the Proportional Hazards Model with Interval Censoring. The Annals of Statistics. 1996;24:540–568. [Google Scholar]
- 11.Huang J. Efficient Estimation of the Partly Linear Additive Cox Model. The Annals of Statistics. 1999;27:1536–1563. [Google Scholar]
- 12.Huang J, Wellner JA. Interval censored survival data: a review of recent progress. Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis, Lecture Notes in Statistics. 1997;123:123–169. [Google Scholar]
- 13.Jin Z, Lin DY, Wei LJ, Ying Z. Rank-based Inference for the Accelerated Failure Time Model. Biometrika. 2003;90:341–353. [Google Scholar]
- 14.Jin Z, Lin DY, Ying Z. On Least-Squares Regression with Censored Data. Biometrika. 2006;93:147–161. [Google Scholar]
- 15.Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2nd edition. Hoboken, NJ: Wiley; 2002. [Google Scholar]
- 16.Lai TL, Ying Z. Large Sample Theory of a Modified Buckley-James Estimator for Regression Analysis with Censored Data. The Annals of Statistics. 1991;10:1370–1402. [Google Scholar]
- 17.Miller RG, Halpern J. Regression with censored data. Biometrika. 1982;69:521–531. [Google Scholar]
- 18.Nan B, Kalbfleisch JD, Yu M. Asymptotic theory for the semiparametric accelerated failure time model with missing data. The Annals of Statistics. 2009;37:2351–2376. [Google Scholar]
- 19.Prentice RL. Linear Rank Tests with Right Censored Data. Biometrika. 1978;65:167–179. [Google Scholar]
- 20.Ritov Y. Estimation in a Linear Regression Model with Censored Data. The Annals of Statistics. 1990;18:303–328. [Google Scholar]
- 21.Ritov Y, Wellner JA. Censoring, Martingales and the Cox model. In: Prabhu NU, editor. Statistical Inference from Stochastic Processes. Providence, RI: American Mathematical Society; 1988. pp. 191–219. [Google Scholar]
- 22.Schumaker L. Spline Functions: Basic Theory. New York: Wiley; 1981. [Google Scholar]
- 23.Shen X. On methods of sieves and penalization. The Annals of Statistics. 1997;25:2555–2591. [Google Scholar]
- 24.Shen X, Wong WH. Convergence Rate of Sieve Estimates. The Annals of Statistics. 1994;22:580–615. [Google Scholar]
- 25.Tsiatis AA. Estimating Regresion Parameters Using Linear Rank Tests for Censored Data. The Annals of Statistics. 1990;18:354–372. [Google Scholar]
- 26.van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 1998. [Google Scholar]
- 27.van der Vaart AW, Wellner JA. Weak Convergence and Em- pirical Processes. New York: Springer; 1996. [Google Scholar]
- 28.Wei LJ, Ying Z, Lin DY. Linear Regression Analysis of Censored Survival Data Based on Rank Tests. Biometrika. 1990;77:845–851. [Google Scholar]
- 29.Wellner JA, Zhang Y. Two Likelihood-based Semiparametric Estimation Methods for Panel Count Data with Covariates. The Annals of Statistics. 2007;35:2106–2142. [Google Scholar]
- 30.Ying Z. A large Sample Study of Rank Estimation for Censored Regression Data. The Annals of Statistics. 1993;21:76–99. [Google Scholar]
- 31.Zeng D, Lin DY. Efficient Estimation for the Accelerated Failure Time Model. Journal of the American Statistical Association. 2007;102:1387–1396. [Google Scholar]
- 32.Zhang Y, Hua L, Huang J. A Spline-Based Semiparametric Maximum Likelihood Estimation Method for the Cox Model with Interval-Censored Data. Scandinavian Journal of Statistics. 2010;37:338–354. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.