Abstract
We study the absolute penalized maximum partial likelihood estimator in sparse, high-dimensional Cox proportional hazards regression models where the number of time-dependent covariates can be larger than the sample size. We establish oracle inequalities based on natural extensions of the compatibility and cone invertibility factors of the Hessian matrix at the true regression coefficients. Similar results based on an extension of the restricted eigenvalue can be also proved by our method. However, the presented oracle inequalities are sharper since the compatibility and cone invertibility factors are always greater than the corresponding restricted eigenvalue. In the Cox regression model, the Hessian matrix is based on time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors, and the restricted eigenvalue as well, are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants for time-dependent covariates, including cases where the number of covariates is of greater order than the sample size. Consequently, the compatibility and cone invertibility factors can be treated as positive constants in our oracle inequalities.
Keywords and phrases: Proportional hazards, regression, absolute penalty, regularization, oracle inequality, survival analysis
1. Introduction
The Cox (1972) proportional hazards model is widely used in the regression analysis of censored survival data, notably in identifying risk factors in epidemiological studies and treatment effects in clinical trials when the outcome variable is time to event. In a traditional biomedical study, the number of covariates p is usually relatively small as compared with the sample size n. Theoretical properties of the maximum partial likelihood estimator in the fixed p and large n setting have been well established. For example, Tsiatis (1981) proved the asymptotic normality of the maximum partial likelihood estimator. Andersen and Gill (1982) formulated the Cox model in the context of the more general counting process framework and studied the asymptotic properties of the maximum partial likelihood estimator using martingale techniques. These results provide a solid foundation for the applications of the Cox model in a diverse range of problems where time to event is the outcome of interest.
In recent years, technological advancement has resulted in the proliferation of massive high-throughput and high-dimensional genomic data in studies that attempt to find genetic risk factors for disease and clinical outcomes, such as the age of disease onset or time to death. Finding genetic risk factors for survival is fundamental to modern biostatistics, since survival is an important clinical endpoint. However, in such problems, the standard approach to the Cox model is not applicable, since the the number of potential genetic risk factors is typically much larger than the sample size. In addition, traditional variable selection methods such as subset selection are not computationally feasible when p >> n.
The ℓ1-penalized least squares estimator, or the Lasso, was introduced by Tibshirani (1996). In the wavelet setting, the ℓ1 penalized method was introduced by Chen, Donoho and Saunders (1998) as Basis Pursuit. Since then, the Lasso has emerged as a widely used approach to variable selection and estimation in sparse, high-dimensional statisticalproblems. It has also been extended to the Cox model (Tibshirani, 1997). Gui and Li (2005) implemented the LARS algorithm (Efron et al., 2004) to approximate the Lasso in the Cox regression model and applied their method to survival data with microarray gene expression covariates. Their work demonstrated the effectiveness of the Lasso for variable selection in the Cox model in a p >> n setting.
There exists a substantial literature on the Lasso and other penalized methods for survival models with a fixed number of covariates p. Zhang and Lu (2007) considered an adaptive Lasso for the Cox model and showed that, under certain regularity conditions and with a suitable choice of the penalty parameter, their method possesses the asymptotic oracle property when the maximum partial likelihood estimator is used as the initial estimator. Fan and Li (2002) proposed the use of the smoothly clipped absolute deviation (SCAD) penalty (Fan, 1997; Fan and Li, 2001) for variable selection and estimation in the Cox model which may include a frailty term. With a suitable choice of the penalty parameter, they showed that a local maximizer of the penalized log-partial likelihood has an asymptotic oracle property under certain regularity conditions on the Hessian of the log-partial likelihood and the censoring mechanism.
Extensive efforts have been focused upon the analysis of regularization methods for variable selection in the p >> n setting. In particular, considerable progress has been made in theoretical understanding of the Lasso. However, most results are developed in the linear regression model. Greenshtein and Ritov (2004) studied the prediction performance of the Lasso in high-dimensional linear regression. Meinshausen and Bühlmann (2006) showed that, for neighborhood selection in the Gaussian graphical model, under a neighborhood stability condition and certain additional regularity conditions, the Lasso is consistent even when p/n → ∞. Zhao and Yu (2006) formalized the neighborhood stability condition in the context of linear regression as a strong irrepresentable condition on the design matrix. Oracle inequalities for the prediction and estimation error of the Lasso was developed in Bunea, Tsybakov and Wegkamp (2007), Zhang and Huang (2008), Meinshausen and Yu (2009), Bickel, Ritov and Tsybakov (2009), Zhang (2009) and Ye and Zhang (2010), among many others.
A number of papers analyzed penalized methods beyond linear regression. Fan and Peng (2004) established oracle properties for a local solution of concave penalized estimator in a general setting with n >> p → ∞. van de Geer (2008) studied the Lasso in high-dimensional generalized linear models (GLM) and obtained prediction and ℓ1 estimation error bounds. Negahban, Ravikumar, Wainwright and Yu (2010) studied penalized M-estimators with a general class of regularizers, including an ℓ2 error bound for the Lasso in GLM under a restricted convexity and other regularity conditions. Bradic et al (2011) made significant progress by extending the results of Fan and Li (2001) to a more general class of penalties in the Cox regression model with large p under different sets of regularity conditions. Huang and Zhang (2012) studied weighted absolute penalty and its adaptive, multistage application in GLM.
In view of the central role of the Cox model in survival analysis, its widespread applications and the proliferation of p >> n data, it is of great interest to understand the properties of the related Lasso approach. The main goal of the present paper is to establish theoretical properties for the Lasso in the Cox model when p >> n. Specifically, we extend certain basic inequalities from linear regression to the case of the Cox regression. We generalize the compatibility and cone invertibility factors from the linear regression model and establish oracle inequalities for the Lasso in the Cox regression model in terms of these factors at the true parameter value. Moreover, we prove that the compatibility and cone invertibility factors can be treated as constants under mild regularity conditions.
A main feature of our results is that they are derived under the more general counting process formulation of the Cox model with potentially a larger number of time-dependent covariates than the sample size. This formulation is useful because it “permits a regression analysis of the intensity of a recurrent event allowing for complicated censoring patterns and time-dependent covariates” (Andersen and Gill, 1982).
A second main feature of our results is that the regularity conditions on the counting processes and time-dependent covariates are directly imposed on the compatibility and cone invertibility factors of the Hessian of the negative log-partial likelihood evaluated at the true regression coefficients. Under such regularity conditions, the Lasso estimator is proven to live in a neighborhood where the ratio between the estimated and true hazards is uniformly bounded away from zero and infinity. This allows unbounded and near zero ratios between the true and baseline hazards. Our analysis can be also used to prove oracle inequalities based on the restricted eigenvalue. However, since the compatibility and cone invertibility factors are greater than the corresponding restricted eigenvalue (van de Geer and Bühlmann, 2009; Ye and Zhang, 2010), the presented results are sharper.
A third main feature of our results is that the compatibility and cone invertibility factors used, and the smaller corresponding restricted eigenvalue, are proven to be greater than a fixed positive constant under mild conditions on the counting processes and time-dependent covariates, including cases where p >> n. In the Cox regression model, the Hessian matrix is based on weighted averages of the cross-products of time-dependent covariates in censored risk sets, so that the compatibility and cone invertibility factors and the restricted eigenvalue are random variables even when they are evaluated for the Hessian at the true regression coefficients. Under mild conditions, we prove that these quantities are bounded from below by positive constants as certain truncated population versions of them. Thus, the compatibility and cone invertibility factors can be treated as constants in our oracle inequalities.
The main results of this paper and the analytical methods used for deriving them are identical to those in its predecessor submitted in November 2011, with Section 4 as an exception. The difference in Section 4 is that the lower bound for the compatibility and cone invertibility factors and the restricted eigenvalue is improved to allow time-dependent covariates.
During the revision process of our paper, we became aware of a number of papers on hazard regression with censored data. Kong and Nan (2012) took an approach of van de Geer (2008) to derive prediction and ℓ1 error bounds for the Lasso in the Cox proportional hazards regression under a quite different set of conditions from us. For example, they required an ℓ1 bound on the regression coefficients to guarantee a uniformly bounded ratio between hazard functions under consideration. Lemler (2012) considered the joint estimation of the baseline hazard function and regression coefficients in the Cox model. As a result, Lemler’s (2012) error bounds for regression coefficients are of greater order than ours when the intrinsic dimension of the unknown baseline hazard function is of greater order than the number of nonzero regression coefficients. Gaïffas and Guilloux (2012) considered a quadratic loss function in place of a negative log-likelihood function in an additive hazards model. A nice feature of the additive hazards model is that the quadratic loss actually produces unbiased linear estimation equations so that the analysis of the Lasso is similar to that of linear regression. The oracle inequalities in these three papers and ours can be all viewed as non-asymptotic. Unlike our paper, none of these three papers consider time-dependent covariates or constant lower bounds of the restricted eigenvalue or related key factors for the analysis of the Lasso.
The rest of this paper is organized as follows. In Section 2 we provide basic notation and model specifications. In Section 3 we develop oracle inequalities for the Lasso in the Cox model. In Section 4 we study the compatibility and cone invertibility factors and the corresponding restricted eigenvalue of the Hessian of the log-partial likelihood in the Cox model. In Section 5 we make some additional remarks. All proofs are provided either right after the statement of the result or deferred to the Appendix.
2. Cox model with the ℓ1 penalty
Following Andersen and Gill (1982), consider an n-dimensional counting process N(n)(t) = (N1(t), …, Nn(t)), t ≥ 0, where Ni(t) counts the number of observed events for the ith individual in the time interval [0, t]. The sample paths of N1, …, Nn are step functions, zero at t = 0, with jumps of size +1 only. Furthermore, no two components jump at the same time. For t = 0, let ℱt be the σ-filtration representing all the information available up to time t. Assume that for {ℱt, t ≥ 0}, N(n) has predictable compensator Λ(n) = (Λ1, …, Λn) with
(2.1) |
where βo is a p-vector of true regression coefficients, Λ0 is an unknown baseline cumulative hazard function and, for each i, Yi(t) ∈ {0, 1} is a predictable at risk indicator process that can be constructed from data, and Zi(t) = (Zi,1(t), …, Zi,p(t))′ is a p-dimensional vector-valued predictable covariate process. In this setting the σ-filtration can be the natural ℱt = σ{Ni(s), Yi(s),Zi(s); s ≤ t, i = 1, …, n} or a richer one. We are interested in the problem of variable selection in sparse, high-dimensional settings where p, the number of possible covariates, is large, but the number of important covariates is relatively small.
2.1. Maximum partial likelihood estimator with ℓ1 penalty
Define logarithm of the Cox partial likelihood for survival experience at time t,
where . The log-partial likelihood function is C(β, ∞) = limt→∞ C(β, t). Let ℓ(β) = −C(β; ∞)/n. The maximum partial likelihood estimator is the value that minimizes ℓ(β).
An approach to variable selection in sparse, high-dimensional settings for the Cox model is to minimize an ℓ1-penalized negative log-partial likelihood criterion,
(2.2) |
(Tibshirani, 1997), where λ ≥ 0 is a penalty parameter. Henceforth, we use notation for 1 ≤ q < ∞, |β|∞ = max1≤i≤p |βi| and |β|0 = #{j : βj ≠ 0}. For a given λ, the ℓ1-penalized maximum partial likelihood estimator, or the Lasso estimator hereafter, is defined as
(2.3) |
2.2. The Karush-Kuhn-Tucker conditions
The Lasso estimator can be characterized by the Karush-Kuhn-Tucker (KKT) conditions. Since the log-partial likelihood belongs to an exponential family, ℓ(β) must be convex in β and so is L(β; λ). It follows that a vector β̂ = (β̂1, …, β̂p)′ is a solution to (2.3) if and only if the following KKT conditions hold:
(2.4) |
where ℓ̇(β) = (ℓ̇1(β), …, ℓ̇p(β))′ = ∂ℓ(β)/∂β is the gradient of ℓ(β). The necessity and sufficiency of (2.4) can be proved by subdifferentiation of the convex penalized loss (2.2). This does not require strict convexity.
The KKT conditions indicate that the Lasso in the Cox regression model may be analyzed in a similar way to the Lasso in linear regression. As can be seen in the subsequent developments, such analysis can be carried out by proving that |ℓ̇(βo)|∞ is sufficiently small and the Hessian of ℓ(β) does not vanish for a sparse β at the true β = βo. The (local) martingales for the counting process will play a crucial role to ensure that these requirements are satisfied.
2.3. Additional notation
Since the Λi are compensators,
are (local) martingales with predictable variation/covariation processes
For a vector υ, let υ⊗0 = 1 ∈ IR, υ⊗1 = υ and υ⊗2 = υυ′. Define
where . By differentiation and rearrangement of terms, it can be shown as in Anderson and Gill (1982) that the gradient of ℓ(β) is
(2.5) |
and the Hessian matrix of ℓ(β) is
(2.6) |
3. Oracle inequalities
In this section, we derive oracle inequalities for the estimation error of Lasso in the Cox regression model. Let βo be the vector of true regression coefficients and define and do = |𝒪|, where |𝒰| for a set 𝒰 denotes its cardinality.
Making use of the KKT conditions (2.4), we first develop a basic inequality involving the symmetric Bregman divergence and ℓ1 estimation error in the support 𝒪 of βo and its complement. The symmetric Bregman divergence, defined as
can be viewed as symmetric partial Kullback-Leibler distance between the partial likelihood at β̂ and β. Thus, Ds(β̂, β) can be viewed as a measure of prediction performance. The basic inequality, given in Lemma 3.1 below, serves as a vehicle for establishing the desired oracle inequalities.
Lemma 3.1. Let β̂ be defined as in (2.3), θ̃ = β̂ − βo and z* = |ℓ̇(βo)|∞. Then, the following inequalities hold,
(3.1) |
where θ̃𝒪 and θ̃𝒪c denote the subvectors of θ̃ of components in 𝒪 and 𝒪c, respectively. In particular, for any ξ > 1, |θ̃𝒪c|1 ≤ ξ|θ̃𝒪|1 in the event z* ≤ (ξ − 1)/(ξ + 1)λ.
It follows from Lemma 3.1 that in the event z* ≤ (ξ − 1)/(ξ + 1)λ, the estimation error θ̃ = β̂ − βo belongs to the cone
(3.2) |
In linear regression, the invertibility of the Gram matrix in the same cone, expressed in terms of restricted eigenvalues and related quantities, has been used to control the estimation error of the Lasso. In what follows, we prove that a direct extension of the compatibility and cone invertibility factors can be used to control the estimation error of the Lasso in the Cox regression.
For the cone in (3.2) and a given p × p nonnegative-definite matrix Σ̅, define
(3.3) |
as the compatibility factor (van de Geer, 2007; van de Geer and Bühlmann, 2009), and
(3.4) |
as the weak cone invertibility factor (Ye and Zhang, 2010). These quantities are closely related to the restricted eigenvalue (Bickel, Ritov and Tsybakov, 2009; Koltchinskii, 2009),
(3.5) |
In linear regression, the Hessian of the squared loss is taken as Σ̅, and the oracle inequalities established in the papers cited in the above paragraph can be summarized as follows: in the event z* = |X′(y − Xβo)/n|∞ ≤ λ(ξ − 1)/(ξ + 1),
(3.6) |
and
(3.7) |
In the Cox regression model, we still take the Hessian of the log-partial likelihood as Σ̅, in fact the Hessian at the true βo, so that (3.3) and (3.4) become
(3.8) |
The reason for using these factors is that they yield somewhat sharper oracle inequalities than the restricted eigenvalue. It follows from that F2(ξ, 𝒪) ≥κ(ξ, 𝒪)RE(ξ, 𝒪) and κ(ξ, 𝒪) ≥ RE(ξ, 𝒪). Therefore, the first inequality of (3.7) is subsumed by the second with q = 2. Moreover, it is possible to have κ(ξ, 𝒪) >> RE(ξ, 𝒪) (van de Geer and Bühlmann, 2009), and consequently, the ℓ2 error bound based on the cone invertibility factor may be of sharper order that the one based on the restricted eigenvalue. The following theorem extends (3.6) and (3.7) from the linear regression model to the proportional hazards regression model. Let
(3.9) |
Let and Fq(ξ, 𝒪) be as in (3.8).
Theorem 3.1. Let τ = K(ξ + 1)doλ/{2κ2(ξ, 𝒪)} with a certain K > 0. Suppose condition (3.9) holds and τ ≤ 1/e. Then, in the event |ℓ̇(βo)|∞ ≤ (ξ − 1)/(ξ + 1)λ,
(3.10) |
and
(3.11) |
where η ≤ 1 is the smaller solution of ηe−η = τ.
Compared with (3.6) and (3.7), the new inequalities (3.10) and (3.11) contain an extra factor eη ≤ e. This is due to the nonlinearity in the Cox regression score equation. Aside from this factor, the error bounds for the Cox regression have the same form as those for linear regression, except for an improvement of a factor of 4ξ/(1 + ξ) ≥ 2 for the ℓ1 oracle inequality.
The theorem assumes condition (3.9), which asserts |Zi(t) − Zi′(t)|∞ ≤ K uniformly in {t, i, i′}. This condition is a consequence of the uniform boundedness of the individual covariates, and is reasonable in most practical situations (e.g. single-nucleotide polymorphism data). In the case where the covariates are normal variables with uniformly bounded variance, the condition holds with K = Kn,p of order.
From an analytical perspective, an important feature of (3.10) and (3.11) is that the constant factors (3.3) and (3.4) are both defined with the true βo in (3.8). No condition is imposed on the gradient and Hessian of the log-partial likelihood for β ≠ βo. In other words, the key condition τ < 1/e, expressed in terms of {K, do, λ} and the compatibility factor κ2(ξ, 𝒪) at the true βo, is sufficient to guarantee the error bounds in Theorem 3.1. Thus, our results are much simpler to state and conditions easier to verify than existing ones requiring regularity conditions in a neighborhood of βo in the Cox regression model. This feature of Theorem 3.1 plays a crucial role in our derivation of lower bounds for κ2(ξ, 𝒪) and Fq(ξ, 𝒪) for time-dependent covariates in Section 4. We note that the local martingale structure is valid only at the true βo.
To prove Theorem 3.1, we develop a sharpened version of an inequality of Hjort and Pollard (1993). This inequality, given in Lemma 3.2 below, explicitly controls the symmetric Bregman-divergence and Hessian of the log-partial likelihood in a neighborhood of β. Based on this relationship, Theorem 3.1 is proved using the definition of the quantities in (3.8) and the membership of the error β̂ − βo in the cone 𝒞 (ξ, 𝒪) (3.2). For two symmetric matrices A and B, A ≤ B means B − A is nonnegative-definite.
Lemma 3.2. Let ℓ(β) and its Hessian ℓ̈(β) be as in (2.2) and (2.6). Then,
(3.12) |
where ηb = max0≤s≤1 maxi,j|b′Zi(s) − b′Zj(s)|. Moreover,
(3.13) |
Under the conditions of Theorem 3.1, the factors e±ηb and e±2ηb in the inequalities in Lemma 3.2 are bounded by positive constants. These factors lead to the factor eη for η ≤ 1 (and thus eη ≤ e) in the upper bounds in (3.10) and (3.11).
Since the oracle inequalities in Theorem 3.1 are guaranteed to hold only within the event |ℓ̇(βo)|∞ ≤ (ξ − 1)/(ξ +1)λ, a probabilistic upper bound is needed for |ℓ̇(βo)|∞. Lemma 3.3 below provides such a probability bound. Similar inequalities can be found in de la Peña (1999).
Lemma 3.3. (i) Let with [−1, 1]-valued predictable processes ai(s). Then,
(3.14) |
(ii) Suppose that maxi≤n supt≤0 maxj≤p |Zi,j(t) − Z̄n,j(t, βo)|∞ ≤ K, where Z̄n,j(t, βo) are the components of Z̄n(t, βo). Let ℓ̇(β) be the gradient in (2.5). Then,
(3.15) |
In particular, if maxi≤n Ni(1) ≤ 1, then P{|ℓ̇(βo)|∞ > Kx} ≤ 2pe−nx2/2.
The following theorem states an upper bound of the estimation error, which follows directly from Theorem 3.1 and Lemma 3.3.
Theorem 3.2. Suppose (3.9) holds and Ni(∞) ≤ 1 for all i ≤ n and t ≥ 0. Let ξ > 1 and with a small ε > 0 (e.g. ε = 1%). Let Cκ > 0 satisfying . Let η ≤ 1 be the smaller solution of ηe−η = τ. Then, for any CF,q > 0,
all hold with at least probability P{κ(ξ, 𝒪) ≥ Cκ, Fq(ξ, 𝒪) ≥ CF,q} − ε.
It is noteworthy that this theorem gives an upper bound of the estimation error for all the ℓq norms with q ≥ 1. From this theorem, for the ℓq error |β̂ − βo|q with q ≥ 1 to be small with high probability, we need to ensure that doλ → 0 as n → ∞. This requires . If do is bounded, then p can be as large as eo(n).
Bradic et al (2011) considered estimation as well as variable selection and oracle properties for general concave penalties, including the Lasso. Their broader scope seems to have led to more elaborate statements and some key conditions harder to verify than those of Theorems 3.1 and 3.2, e.g., their Condition 2 (i) on a uniformly small spectrum bound between S(2)(t, β1) and its population version for a sparse β1 in a neighborhood of βo.
Proof of Theorem 3.2. Let C0 = 1 and in Lemma 3.3. The probability of the event |ℓ̇(βo)|∞ > (ξ − 1)/(ξ + 1)λ is at most ε. The desired result follows directly from Theorem 3.1.
4. Compatibility and invertibility factors and restricted eigenvalues
In Section 3, the oracle inequalities in Theorems 3.1 and 3.2 are expressed in terms of the compatibility and weak cone invertibility factors. However, as mentioned in the introduction, these quantities are still random variables. This section provides sufficient conditions under which they can be treated as constants. Since these factors appear in the denominator of error bounds, it suffices to bound them from below. We also derive a lower bound for the restricted eigenvalue to facilitate further analysis of the Cox model in high-dimension. We will prove that these quantities are bounded from below by the population version of their certain truncated versions.
Compared with linear regression, our problem poses two additional difficulties in the Cox model: (a) time dependence of covariates, and (b) stochastic integration of the Hessian over random risk sets. Fortunately, the compatibility and weak cone invertibility factors in Theorems 3.1 and 3.2 involve only the Hessian of the log-partial likelihood at the true βo, so that a martingale argument can be used.
To simplify the statement of our results, we use ϕ(ξ, 𝒪, Σ̅) to denote any of the following quantities:
(4.1) |
where κ(ξ, 𝒪; Σ̅), Fq(ξ, 𝒪; Σ̅), and RE(ξ, 𝒪; Σ̅) are as in (3.3), (3.4) and (3.5) respectively. If we make a claim about ϕ(ξ, 𝒪; Σ̅), we mean that the claim holds for any quantity it represents. Let ϕmin denote the smallest eigenvalue. The following lemma provides some key properties ϕ(ξ, 𝒪; Σ̅) used in the derivation of its lower bounds.
Lemma 4.1. Let κ(ξ, 𝒪;Σ̅), Fq(ξ, 𝒪; Σ̅), RE(ξ, 𝒪; Σ̅) and ϕ(ξ, 𝒪; Σ̅) be as in (4.1). Let Σ̅jk be the elements of Σ̅ and Σ̅ be another nonnegative-definite matrix with elements Σjk.
(i) For 1 ≤ q ≤ 2,
(ii) ϕ(ξ, 𝒪; Σ̅) ≥ ϕ(ξ, 𝒪; Σ) − do(ξ + 1)2 max1≤j≤k≤p |Σ̅jk − Σjk|.
(iii) If Σ̅ ≥ Σ, then ϕ(ξ, 𝒪; Σ̅) ≥ ϕ(ξ, 𝒪; Σ).
Proof of Lemma 4.1. By the Hölder inequality, . Since |b|1 ≤ (1 + ξ)|b𝒪|1 in the cone and we have
This and yields part (i) by the definition of the quantities involved. Part (ii) follows from and . Part (iii) follows immediately from the definition of the quantities in (4.1).
It follows from Lemma 4.1 (ii) and (iii) that quantities of type ϕ(ξ, 𝒪; Σ̅) in (4.1) can be bounded from below in two ways. The first is to bound the matrix Σ̅ from below and the second is to approximate Σ̅ under the supreme norm for its elements. In the p >> n setting, our problem is essentially the rank deficiency of Σ̅ to begin with, so that its lower bound is still rank deficient. However, a lower bound of the random matrix Σ̅ = ℓ̈(βo), e.g. a certain truncated version of it, may have a smaller variability to allow an approximation by its population version. This is our basic idea. In fact, our analysis takes advantage of this argument twice to remove different sources of randomness.
According to our plan described in the previous paragraph, we first choose a suitable truncation of Σ̅ = ℓ̈(βo) as a lower bound of the matrix. This is done by truncating the maximum event time under consideration. It follows from (2.6) that for t* > 0,
(4.2) |
This allows us to remove the randomness from the counting process by replacing the average counting measure n−1 N̄ (t) by its compensator Rn(s, βo)dΛ0(s), where Λ0 is the baseline cumulative hazard function. This approximation of ℓ̈(βo; t*) can be written as
(4.3) |
To completely remove the randomness with Σ̅(t*), we apply the method again by truncating the weights with Rn(s, βo). For M > 0, define
(4.4) |
where with
We will prove that the matrix (4.4) is a lower bound of (4.3). Suppose {Yi(t), Zi(t), t ≥ 0} are iid stochastic precesses from {Y(t), Z(t), t ≥ 0}. The population version of (4.4) is then
(4.5) |
where with
The analysis outlined above leads to the following main result of this section. For ξ ≥ 1 and 𝒪 ⊂ {1, …, p} with |𝒪| = do, let ϕ(ξ, 𝒪, Σ̅) represent all quantities of interest given in (4.1), κ(ξ, 𝒪) and Fq(ξ, 𝒪) be the compatibility and weak cone invertibility factors in (3.8) with the Hessian ℓ̈(βo) in (2.6) at the true β, and RE(ξ, 𝒪; Σ̅) be as in (3.5). Let .
Theorem 4.1. Suppose {Yi(t), Zi(t), t ≥ 0} are iid processes from {Y(t), Z(t), t ≥ 0} with supt P{|Zi(t) − Z(t)|∞ ≤ K} = P{maxi Ni(∞) ≤ 1} = 1. Let {t*, M} be positive constants and r* = EY (t*) min{M, exp(Z′(t*)βo)}. Then,
(4.6) |
with at least probability 1 − 3ε, where C1 = 1 + Λ0(t*), C2 = (2/r*)Λ0(t*), and tn,p,ε is the solution of . Consequently, for 1 ≤ q ≤ 2,
(4.7) |
with at least probability 1 − 3ε, where ρ* = ϕmin(Σ(t*; M)) with the matrix in (4.5).
Theorem 4.1 implies that the compatibility and cone invertibility factors and the restricted eigenvalue can be all treated as constants in high-dimensional Cox model with time-dependent covariates. We note that is of smaller order than Ln(p(p + 1)/ε) so that the lower bounds in (4.6) and (4.7) depend on the choice of t* and M marginally through C1 and ρ*. If do is sufficiently small as assumed in Theorem 3.2, the right-hand side of (4.7) can be treated as ρ*/2. It is reasonable to treat ρ* as a constant since it is the smallest eigenvalue of a population integrated covariance matrix in (4.5).
In the proof of Theorem 4.1, the martingale exponential inequality in Lemma 3.3 is used to bound the difference between (4.2) and (4.3). The following Bernstein inequality for V-statistics is used to bound the difference between (4.4) and (4.5). This inequality can be viewed as an extension of the Hoeffding (1963) inequality for sums of bounded independent variables and non-degenerate U-statistics.
Lemma 4.2. Let Xi be a sequence of independent stochastic processes and fi,j be functions of Xi and Xj with |fi,j| ≤ 1. Suppose fi,j are degenerate in the sense of E[fi,j|Xi] = E[fi,j|Xj] = 0 for all i ≠ j. Let . Then,
where εn(t) = e−(nt2/2)/(1+t/3).
Our discussion focuses on the quantities in (4.1) for the Hessian matrix Σ̅ = ℓ̈(βo) evaluated at the true vector of coefficients. Still, through Lemma 3.2, Theorem 4.1 also provide lower bounds for these quantities at any β not far from the true βo in terms of the ℓ1 distance. We formally state this result in the following corollary.
Corollary 4.1. Let ϕ(ξ, 𝒪; Σ̅) represent any quantities in (4.1). Then,
where ηb = sups maxi,j |b′Zi(s) − b′Zj(s)|. Consequently, when |Zi(s) − Zj(s)|∞ ≥ K,
under the conditions of Theorem 4.1.
It is worthwhile to point out that unlike typical “small ball” analysis based on Taylor expansion, Corollary 4.1 provides non-asymptotic control of the quantities in an ℓ1 ball of constant size. Since b′Σ̅b appears in the numerator of the quantities represented by ϕ(ξ, 𝒪; Σ̅), Corollary 4.1 follows immediately from Theorem 4.1 and (3.13). It implies that the Hessian has sufficient invertibility properties in the analysis of the Lasso when the estimator is not far from the true βo in ℓ1 distance. On the other hand, if the Hessian has sufficient invertibility properties in a ball of fixed size, non-asymptotic error bounds for the Lasso estimator can be established. This “chicken and egg” problem is directly solved in the proof of Theorem 3.1.
5. Concluding remarks
This paper deals with the Cox proportional hazards regression model when the number of time-dependent covariates p is potentially much larger than the sample size n. The ℓ1 penalty is used to regularize the log-partial likelihood function. Error bounds parallel to those of the Lasso in linear regression are established. In establishing these bounds, we extend the notion of the restricted eigenvalue and compatibility and cone invertibility factors to the Cox model. We show that these quantities indeed provide useful error bounds.
An important issue is the choice of the penalty level λ. Theorem 3.2 requires a λ slightly larger than , where K is a uniform upper bound for the range of individual real covariates. This indicates that the Lasso is tuning insensitive since the theoretical choice does not depend on the unknowns. In practice, cross validation can be used to fine tune the penalty level λ. Theoretical investigation of the performance of the Lasso with cross-validated λ, an interesting and challenging problem in and of itself even in the simpler linear regression model, is beyond the scope of this paper.
General concave penalized estimators in the Cox regression model have been considered in Bradic et al (2011) where oracle inequalities and properties of certain local solutions are considered. Zhang and Zhang (2012) has provided a unified treatment of global and local solutions for concave penalized least squares estimators in linear regression. Since this unified treatment relies on an oracle inequality for the global solution based on the cone invertibility factor, the results in this paper point to a possible extension of such a unified treatment of global and local solutions of general concave regularized methods in the Cox regression model.
Acknowledgements
We are grateful to two anonymous reviewers, the associate editor and editor for their helpful comments which led to considerable improvements in the paper. We also wish to thank a reviewer for bringing to our attention the work of Gaïffas and Guilloux (2012), Lemler (2012) and Kong and Nan (2012) during the revision process of this paper.
Appendix
Here we prove Lemmas 3.1, 3.2, 3.3 and 4.2 and Theorems 3.1 and 4.1.
Proof of Lemma 3.1. Since ℓ(β) is a convex function, Ds(β̂, β) = θ̃′ {ℓ̇(βo + θ̃) − ℓ̇(βo)} ≥ 0, so that the first inequality holds. Since θ̃j = β̂j for j ∈ 𝒪c, (2.4) gives
Thus the second inequality in (3.1) holds. Note that the inequality in the third line above requires (ℓ̇(βo + θ̃))j = −λsgn(β̂j) only in the set 𝒪c ∩ {j : β̂j ≠ 0}, since when j ∈ 𝒪c and β̂j = 0.
Proof of Lemma 3.2. We use similar notation as in Hjort and Pollard (1993). Let ai = ai(s) = b′ {Zi(s) − Z̄n (s, β)}, wi = wi(s) = Yi(s) exp[β′ Zi(s)], and c = c(s) = (maxi ai(s) + mini ai(s))/2. Clearly, maxi |ai − c| ≤ (1/2)ηb. By the definition of Z̄n (t, β),
where the first inequality comes from (ey − ex)/(y − x) ≥ e−(|y|∨|x|) and, since ∑i wiai = 0, the second one from . Thus, since , (2.6) and (2.5) give
This implies the lower bound in (3.12). Similarly, the lower bound in (3.13) follows from
and
The proof of the upper bounds in (3.12) and (3.13), nearly identical to the proof of the lower bounds, is omitted.
Proof of Lemma 3.3. Applying the union bound and changing the scale of the covariates if necessary, we assume without loss of generality that p = K = 1. In this case
where ai(t) = Zi1 (t) − Z̄n,1 (t), i = 1, …, n, are predictable and satisfy |ai(t)| ≤ 1. Thus, (3.15) follows from (3.14).
Let tj be the time of the jth jump of the process , j = 1, …, m, and t0 = 0. Then, tj are stopping times. For j = 0, …, m, define
(7.1) |
Since Mi(s) are martingales and ai(s) are predictable, {Xj, j = 0, 1, …} is a martingale with the difference |Xj − Xj−1| ≤ maxs, i |ai(s)| ≤ 1. Let m be the greatest integer lower bound of . By the martingale version of the Hoeffding (1963) inequality (Azuma, 1967),
By (7.1), Xm = nℓ̇(βo) if and only if . Thus, the left-hand side of (3.15) is no greater than P(|Xm| > nC0x) ≤ e−nx2/2.
Proof of Lemma 4.2. For integers j, m, i1, …, im, let #(j; i1, …, im) be the number of appearances of j in the sequence {i1, …, im}. Since fi,j are degenerate,
This is due to the fact that all terms with exactly one appearance of an index j have zero expectation and all other terms are bounded by 1. Let E0 be the expectation under which i1, …, i2m are iid uniform variables in {1, …, n} and kj = #(j; i1, …, i2m). Since (k1, …, kn) is multinomial(2m, 1/n, …, 1/n), the above inequality can be written as
Let and λ = t/(1 + t/3). It follows from the above moment inequality that
Since , we find Ef0(±λ2Vn) ≤ enλt/2. Consequently, the monotonicity of f(x) = cosh(x1/2) for x > 0 and the lower bound f(x) ≥ −1 allow us to apply the Markov inequality as follows:
The conclusion follows from 2 max0≤x≤1 (1 + x)/(1 + x2)2 ≤ 2.221.
Proof of Theorem 3.1. Let θ̃ = β̂ − βo ≠ 0 and b = θ̃/|θ̃|1. It follows from the convexity of ℓ(βo + xb), as a function of x, and Lemma 3.1 that, in the event |ℓ̇(βo)|∞ ≤ (ξ − 1)/(ξ + 1)λ,
(7.2) |
for x ∈ [0, |θ̃|1] and b ∈ 𝒞 (ξ, 𝒪). Consider all nonnegative x satisfying (7.2). We need to establish a lower bound for
Since ηxb = max0≤s≤1 maxi, j |xb′Zi(s) − xb′Zj(s)| ≤ Kx|b|1 = Kx, Lemma 3.2 yields
(7.3) |
This, combined with (7.2) and the definition of κ(ξ, 𝒪), gives
In other words, any x satisfying (7.2) must satisfy
(7.4) |
Since b′{ℓ̇(βo + xb) − ℓ̇(βo)} is an increasing function of x due to the convexity of ℓ, the set of all nonnegative x satisfying (7.2) is a closed interval [0, x̃] for some x̃ > 0. Thus, (7.4) implies Kx̃ ≤ η, the smaller solution of ηe−η = τ. This yields
in (3.10). The first part of (3.10) follows from (3.3), (3.8), (3.12) and (3.1), due to
Finally, it follows from the definition of Fq(ξ, 𝒪), (7.3) and (7.2) that, for x = |θ̃|1,
This gives the second inequality in (3.11) due to |β̂ − βo|q = |θ̃|1|b|q.
Proof of Theorem 4.1. Let
It follows from Lemma 3.3 (i) with ai(s) = a(s) and C0 = 1 that
Thus, P{maxj,k |(ℓ̈(βo; t*) − Σ̅(t*)|j,k ≥ K2Ln(p(p + 1)/ε)} ≤ ε by the union bound and the respective definitions of ℓ̈ (βo; t*) and Σ̅(t*) in (4.2) and (4.3). Consequently, by (4.2) and Lemma 4.1 (iii) and (ii)
(7.5) |
Let us take the sample mean of i-indexed quantities with weights , so that Z̄n(t; M) is the sample mean of Zi(t). Since Vn(t, βo) Rn (t, βo) = Ĝn (t; ∞),
Thus, by the definition of Σ̅(t*; M) in (4.4) and Lemma 4.1 (iii),
(7.6) |
In addition, the relationship between the sample second moment and variance gives
by the definition of Gn (t; M) and Ĝn (t; M), so that (4.4) can be written as
(7.7) |
We first bound the second term on the right-hand side of (7.7). Define
Since Yi(t) is non-increasing in t,
(7.8) |
Since Rn (t*, M) is the average of iid variables uniformly bounded by M and ERn(t*, M) = r*, the Hoeffding (1963) inequality gives
Since Δ(t; M) is an average of iid mean zero vectors, is a degenerate V -statistic for each (j, k). Moreover, since the summands of these V -statistics are all bounded by K2Λ0(t*), Lemma 4.2 yields
Thus, by (7.7), (7.8), the above two probability bounds and Lemma 4.1 (ii),
(7.9) |
with at least probability .
Finally, by (4.5), is an average of iid matrices with mean Σ(t*; M) and the summands of are uniformly bounded by K2Λ0(t*), so that the Hoeffding (1963) inequality gives
By (7.5), (7.6), (7.9), the above inequality with t = Ln (p(p + 1)/ε) and Lemma 4.1 (ii),
with at least probability . Since
by Lemma 4.1 (i) and the definition in (3.5), the conclusion follows.
Footnotes
Supported in part by NIH Grants R01CA120988, R01CA142774 and NSF Grants DMS-0805670 and DMS-1208225
Supported in part by NIH Grant R35GM047845 and NSF Grant SES1123698
Supported in part by NSF Grants DMS 0906420, DMS 1106753 and DMS 1209014, and the NSA Grant H98230-11-1-0205
References
- 1.Azuma K. Weighted sums of certain dependent random variables. Tohoku Math. J. 1967;19:357–367. [Google Scholar]
- 2.Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann. Statist. 1982;10-4:1100–1120. [Google Scholar]
- 3.Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. 2009;37:1705–1732. [Google Scholar]
- 4.Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann. Statist. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bunea F, Tsybakov A, Wegkamp M. Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 2007;1:169–194. [Google Scholar]
- 6.Cai T, Wang L, Xu G. Shifting inequality and recovery of sparse signals. IEEE Trans. Signal Process. 2010;58:1300–1308. [Google Scholar]
- 7.Candés E, Tao T. Decoding by linear programming. IEEE Trans. Inform. Theory. 2005;51:4203–4215. MR2243152. [Google Scholar]
- 8.Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 1998;20:33–61. [Google Scholar]
- 9.Cox DR. Regression models and life-tables (with discussions) J. Roy. Statist. Soc. Ser. B. 1972;34-2:187–220. [Google Scholar]
- 10.de la Peña VH. A general class of exponential inequalities for martingales and ratios. Ann. Probab. 1999;27:537–564. [Google Scholar]
- 11.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann. Statist. 2004;32:407–451. [Google Scholar]
- 12.Fan J. Comments on “Wavelets in statistics: A review” by A. Antoniadis. J. Amer. Statist. Assoc. 1997;6:131–138. [Google Scholar]
- 13.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96:1348–1360. [Google Scholar]
- 14.Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann. Statist. 2002;30:74–99. [Google Scholar]
- 15.Fan J, Peng H. On non-concave penalized likelihood with diverging number of parameters. Annals of Statistics. 2004;32:928–961. [Google Scholar]
- 16.Gaïffas S, Guilloux A. High dimensional additive hazards models and the Lasso. Electronic Journal of Statistics. 2012;6:522–546. [Google Scholar]
- 17.Greenshtein E, Ritov Y. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli. 2004;10:971–988. [Google Scholar]
- 18.Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
- 19.Hjort NL, Pollard D. Asymptotics for minimisers of convex processes. Yale University; 1993. Preprint. [Google Scholar]
- 20.Hoeffding W. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 1963;58:13–30. [Google Scholar]
- 21.Huang J, Zhang C-H. Estimation and Selection via Absolute Penalized Convex Minimization. J. Machine Leaning Research. 2012;13:1839–1864. [PMC free article] [PubMed] [Google Scholar]
- 22.Koltchinskii V. The Dantzig selector and sparsity oracle inequalities. Bernoulli. 2009;15:799–828. [Google Scholar]
- 23.Kong S, Nan B. Non-asymptotic oracle inequalities for the high-dimensional Cox regression via Lasso. 2012 doi: 10.5705/ss.2012.240. arXiv:1204.1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Lemler S. Oracle inequalities for the Lasso for the conditional hazard rate in a high-dimensional setting. 2012 arXiv:1206.5628. [Google Scholar]
- 25.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 2006;34:1434–1462. [Google Scholar]
- 26.Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 2009;37:246–270. [Google Scholar]
- 27.Negahban S, Ravikumar P, Wainwright M, Yu B. A unied framework for high-dimensional analysis of M-estimators with decomposable regularizers; Proceedings of the NIPS Conference; December, 2009; Vancouver, Canada. 2010. [Google Scholar]
- 28.Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electron. J. Stat. 2008;2:494–515. [Google Scholar]
- 29.Tibshirani R. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B. 1996;58:267–288. [Google Scholar]
- 30.Tibshirani R. The Lasso method for variable selection in the Cox model. Stat. Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- 31.Tsiatis AA. A large sample study of Cox’s regression model. Ann. Statist. 1981;9:93–108. [Google Scholar]
- 32.van de Geer S. On non-asymptotic bounds for estimation in generalized linear models with highly correlated design. Lecture Notes-Monograph Series. 2007;55:121–134. [Google Scholar]
- 33.van de Geer S. High-dimensional generalized linear model and the Lasso. Ann. Statist. 2008;36:614–645. [Google Scholar]
- 34.van de Geer S, Bühlmann P. On the conditions used to prove oracle results for the Lasso. Electron. J. Statist. 2009;3:1360–1392. [Google Scholar]
- 35.Ye F, Zhang C-H. Rate minimaxity of the Lasso and Dantzig selector for the lq loss in lr balls. J. Mach. Learn. Res. 2010;11:3519–3540. [Google Scholar]
- 36.Zhang C-H, Huang J. The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann. Statist. 2008;36:1567–1594. [Google Scholar]
- 37.Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
- 38.Zhang T. On the consistency of feature selection using greedy least squares regression. J. Mach. Learn. Res. 2009;10:555–568. [Google Scholar]
- 39.Zhao P, Yu B. On model selection consistency of Lasso. J. Mach. Learn. Res. 2007;7:2541–2564. [Google Scholar]
- 40.Zhang C-H, Zhang T. A general theory of concave regularization for high dimensional sparse estimation problems. Statist. Sci. 2012;27:576–593. [Google Scholar]