Abstract
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses.
Keywords and phrases: Cox regression, finite sample, lasso, oracle inequality, variable selection
1. Introduction
Since it was introduced by Tibshirani (1996), the lasso regularized method for high-dimensional regression models with sparse coefficients has received a great deal of attention in the literature. Properties of interest for such regression models include the finite sample oracle inequalities. Among the extensive literature of the lasso method, Bunea, Tsybakov, and Wegkamp (2007) and Bickel, Ritov, and Tsybakov (2009) derived the oracle inequalities for prediction risk and estimation error in a general nonparametric regression model, including the high-dimensional linear regression as a special example, and van de Geer (2008) provided oracle inequalities for the generalized linear models with Lipschitz loss functions, e.g., logistic regression and classification with hinge loss. Bunea (2008) and Bach (2010) also considered the lasso regularized logistic regression. For censored survival data, the lasso penalty has been applied to the regularized Cox regression in the literature, see e.g. Tibshirani (1997) and Gui and Li (2005), among others. Recently, Bradic, Fan, and Jiang (2011) studied the asymptotic properties of the lasso regularized Cox model. However, its finite sample non-asymptotic statistical properties have not yet been established in the literature to the best of our knowledge, largely due to lacking iid Lipschitz losses from the partial likelihood. Nonetheless, the lasso approach has been studied extensively in the literature for other models, see e.g. Martinussen and Scheike (2009) and Gaiffas and Guilloux (2012), among others, for the additive hazards model.
We consider the non-asymptotic statistical properties of the lasso regularized high-dimensional Cox regression. Let T be the survival time and C the censoring time. Suppose we observe a sequence of iid observations (Xi, Yi, Δi), i = 1, …, n, where Xi = (Xi1, ⋯ Xim) are the m-dimensional covariates in 𝒳, Yi = Ti ∧ Ci, and Δi = I{Ti≤Ci}. Due to a large amount of parallel material, we follow closely the notation in van de Geer (2008). Let
Consider the Cox model (Cox (1972)):
where θ is the parameter of interest and λ0 is the unknown baseline hazard function. The negative log partial likelihood function for θ is
(1.1) |
The corresponding estimator with lasso penalty is denoted by
where is the weighted l1 norm of the vector θ ∈ Rm. van de Geer (2008) considered σk to be the square-root of the second moment of the k-th covariate Xk, either at the population level (fixed) or at the sample level (random). For normalized Xk, σk = 1. We consider fixed weights σk, k = 1, ⋯ ,m. The results for random weights can be easily obtained from the case with fixed weights following van de Geer (2008), and we leave the detailed calculation to interested readers.
Clearly the negative log partial likelihood (1.1) is a sum of non-iid random variables. For ease of calculation, consider an intermediate function as a “replacement” of the negative log partial likelihood function
(1.2) |
that has the iid structure, but with an unknown population expectation
The negative log partial likelihood function (1.1) can then be viewed as a “working” model for the empirical loss function (1.2). The corresponding loss function is
(1.3) |
with expected loss
(1.4) |
where P denotes the distribution of (Y, Δ, X). Define the target function f̅ as
where θ̅ = arg minθ∈Θ Pγfθ. It is well-known that Pγfθ is convex with respect to θ for the regular Cox model, see for example, Andersen and Gill (1982), thus the above minimum is unique if the Fisher information matrix of θ at θ̅ is non-singular. Define the excess risk of f by
It is desirable to show similar non-asymptotic oracle inequalities for the Cox regression model as in, for example, van de Geer (2008) for generalized linear models. That is, with large probability,
Here 𝒱θ is called the “estimation error”, which is typically proportional to times the number of nonzero elements in θ.
Note that the summands in the negative log partial likelihood function (1.1) are not iid, and the intermediate loss function γ(·, Y, Δ) given in (1.3) is not Lipschitz. Hence the general result of van de Geer (2008) that requires iid Lipschitz loss functions does not apply to the Cox regression. We tackle the problem using pointwise arguments to obtain the oracle bounds of two types of errors: one is between empirical loss (1.2) and expected loss (1.4) without involving the Lipschitz requirement of van de Geer (2008), and one is between the negative log partial likelihood (1.1) and empirical loss (1.2) which establishes the iid approximation of non-iid losses. These steps distinguish our work from that of van de Geer (2008); we rely on the Mean Value Theorem with van de Geer’s Lipschitz condition replaced by the similar, but much less restrictive, boundedness assumption for regression parameters in Bühlmann (2006).
The article is organized as follows. In Section 2, we provide assumptions that are used throughout the paper. In Section 3, we define several useful quantities followed by the main result. We then provide a detailed proof in Section 4 by introducing a series of lemmas and corollaries useful for deriving the oracle inequalities for the Cox model. To avoid duplicate material as much as possible, we refer to the preliminaries and some results in van de Geer (2008) from place to place in the proofs without providing much detail.
2. Assumptions
We impose five basic assumptions. Let ‖·‖ be the L2(P) norm and ‖·‖∞ the sup norm.
Assumption A. Km ≔ max1≤k≤m{‖Xk‖∞/σk} < ∞.
Assumption B. There exists an η > 0 and strictly convex increasing G, such that for all θ ∈ Θ with ‖fθ − f̅‖∞ ≤ η, one has ℰ(fθ) ≥ G(‖fθ − f̅‖).
In particular, G can be chosen as a quadratic function with some constant C0, i.e., G(u) = u2/C0, then the convex conjugate of function G, denoted by H, such that uv ≤ G(u) + H(v) is also quadratic.
Assumption C. There exists a function D(·) on the subsets of the index set {1, ⋯ m}, such that for all 𝒦 ⊂ {1, ⋯ , m}, and for all θ ∈ Θ and θ̃ ∈ Θ, we have . Here, D(𝒦) is chosen to be the cardinal number of 𝒦.
Assumption D. .
Assumption E. The observation time stops at a finite time τ > 0, with ξ ≔ P(Y ≥ τ) > 0.
Assumptions A, B, and C are identical to those in van de Geer (2008) with her ψk the identity function. Assumptions B and C can be easily verified for the random design setting where X is random (van de Geer (2008)) together with the usual assumption of non-singular Fisher information matrix at θ̅ (and its neighborhood) for the Cox model. Assumption D has a similar flavor to the assumption (A2) in Bühlmann (2006) for the persistency property of boosting method in high-dimensional linear regression models, but is much less restrictive in the sense that Lm is allowed to depend on m in contrast with the fixed constant in Bühlmann (2006). Here it replaces the Lipschitz assumption in van de Geer (2008). Assumption E is commonly used for survival models with censored data, see for example, Andersen and Gill (1982). A straightforward extension of Assumption E is to allow τ (thus ξ) to depend on n.
From Assumptions A and D, we have, for any θ ∈ Θ,
(2.1) |
for all i, where σ(m) = max1≤k≤m σk. Note that Um is allowed to depend on m.
3. Main result
Let be the l1 norm of θ. For any θ and θ̃ in Θ, denote
Consider the estimator
3.1. Useful quantities
We first define a set of useful quantities that are involved in the oracle inequalities.
a̅n = 4an, .
r1 > 0, b > 0, d > 1, and 1 > δ > 0 are arbitrary constants.
.
- , where
λn ≔ (1 + b)λ̅n,0.
δ1 = (1 + b)−N1 and δ2 = (1 + b)−N2 are arbitrary constants for some N1 and N2, where N1 ∈ N ≔ {1, 2, …} and N2 ∈ N ∪ {0}.
.
W is a fixed constant given in Lemma 4.3 for a class of empirical processes.
Dθ ≔ D({k : θk ≠ 0, k = 1, …, m}) is the number of nonzero θk’s, where D(·) is given in Assumption C.
, where H is the convex conjugate of function G defined in Assumption B.
.
.
.
In the above, the dependence of on the sample size n is through 𝒱θ that involves the tuning parameter λn. We also impose conditions as in van de Geer (2008):
Condition I(b, δ). .
Condition II(b, δ, d). .
In both conditions, η is given in Assumption B.
3.2. Oracle inequalities
We now provide our theorem on oracle inequalities for the Cox model lasso estimator, with detailed proof given in the next section. The key idea of the proof is to find bounds of differences between empirical errors of the working model (1.2) and between approximation errors of the partial likelihood, denoted as Zθ and Rθ in the next section.
Theorem 3.1. Suppose Assumptions A-E and Conditions I(b, δ) and II(b, δ, d) hold. With
we have, with probability at least
that
4. Proofs
4.1. Preparations
Denote the empirical probability measure based on the sample {(Xi, Yi, Δi) : i = 1, … n} by Pn. Let ε1, ⋯ , εn be a Rademacher sequence, independent of the training data (X1, Y1, Δ1), ⋯ , (Xn, Yn, Δn). For some fixed θ* ∈ Θ and some M > 0, denote ℱM ≔ {fθ : θ ∈ Θ, I(θ−θ*) ≤ M}. Later we take , which is the case of interest. For any θ where I(θ − θ*) ≤ M, denote
Note that van de Geer (2008) sought to bound supf∈ℱM Zθ(M), thus the contraction theorem of Ledoux and Talagrand (1991) (Theorem A.3 in van de Geer (2008)) was needed, which holds for Lipschitz functions. We find that the calculation in van de Geer (2008) does not apply to the Cox model due to the lack of Lipschitz property. However, the pointwise argument is adequate for our purpose because only the lasso estimator or the difference between the lasso estimator θ̂n and the oracle is of interest. Note the notational difference between an arbitrary θ* in the above Zθ(M) and the oracle .
Lemma 4.1. Under Assumptions A, D, and E, for all θ satisfying I(θ − θ*) ≤ M, we have EZθ(M) ≤ a̅nM.
Proof. By the symmetrization theorem, see e.g. van der Vaart and Wellner (1996) or Theorem A.2 in van de Geer (2008), for a class of only one function we have
For A we have
Applying Lemma A.1 in van de Geer (2008), we obtain
Thus we have
(4.1) |
For B, instead of using the contraction theorem that requires Lipschitz, we use the Mean Value Theorem:
where θ** is between θ and θ*, and
(4.2) |
satisfying
Since for all i,
following Lemma A.1 in van de Geer (2008), we obtain
(4.3) |
Combining (4.1) and (4.3), the upper bound for EZθ(M) is achieved.
We now can bound the tail probability of Zθ(M) using the Bousquet’s concentration theorem noted as Theorem A.1 in van de Geer (2008).
Corollary 4.1. Under Assumptions A, D, and E, for all M > 0, r1 > 0 and all θ satisfying I(θ − θ*) ≤ M, it holds that
Proof. Using the triangular inequality and the Mean Value Theorem, we obtain
where θ** is between θ and θ*, and Fθ**(k, Y) is defined in (4.2). So we have
Therefore, in view of Bousquet’s concentration theorem and Lemma 4.1, for all M > 0 and r1 > 0,
Now for any θ satisfying I(θ − θ*) ≤ M, we bound
Here recall that τ is given in Assumption E. By the Mean Value Theorem, we have
(4.4) |
where θ** is between θ and θ* and, by (2.1), we have
(4.5) |
Lemma 4.2. Under Assumption E, we have
Proof. This is obtained directly from Massart (1990) for the Kolmogorov statistic by taking in the following:
Lemma 4.3. Under Assumptions A, D, and E, for all θ we have
(4.6) |
where W is a fixed constant.
Proof. For a class of functions indexed by t, ℱ = {1(y ≥ t)efθ(x)/Um : t ∈ [0, τ], y ∈ R, efθ(x) ≤ Um}, we calculate its bracketing number. For any nontrivial ∈ satisfying 1 > ∈ > 0, let ti be the i-th ⌈1/ε⌉ quantile of Y, so
where ⌈x⌉ is the smallest integer that is greater than or equal to x. Furthermore, take t0 = 0 and t⌈1/ε⌉ = +∞. For i = 1, ⋯, ⌈1/ε⌈, define brackets [Li, Ui] with
such that Li(x, y) ≤ 1(y ≥ t)efθ(x)/Um ≤ Ui(x, y) when ti−1 < t ≤ ti. Since
we have , which yields
where . Thus, from Theorem 2.14.9 in van der Vaart and Wellner (1996), we have for any r > 0,
where W is a constant that only depends on K. Note that r2e−r2 is bounded by e−1. With , we obtain (4.6).
Lemma 4.4. Under Assumptions A, D, and E, for all θ we have
(4.7) |
Proof. Consider the classes of functions indexed by t,
Using the argument in the proof of Lemma 4.3, we have
where , and then for any r > 0,
Thus we have
Let , so . Since
, we obtain (4.7).
Corollary 4.2. Under Assumptions A, D, and E, for all M > 0, r1 > 0, and all θ that satisfy I(θ − θ*) ≤ M, we have
(4.8) |
Proof. From (4.4) and (4.5) we have
where the events E1, E2 and E3 are defined as
Thus
and the result follows from Lemmas 4.2, 4.3 and 4.4.
Now with , we have the following results.
Lemma 4.5. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Under Assumptions B and C, for all θ ∈ Θ with , it holds that
Proof. The proof is exactly the same as that of Lemma A.4 in van de Geer (2008), with the λn defined in Subsection 3.1.
Lemma 4.6. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Consider any random θ̃ ∈ Θ with . Let 1 < d0 ≤ db. It holds that
Proof. The idea is similar to the proof of Lemma A.5 in van de Geer (2008). Let ℰ̃ = ℰ(fθ̃) and . We will use short notation: and . Since , on the set where and , we have
(4.9) |
By (4.8) we know that is bounded by with probability at least , then we have
Since I(θ̃) = I1(θ̃) + I2(θ̃) and , using the triangular inequality, we obtain
(4.10) |
Adding to both sides and from Lemma 4.5,
Because 0 < δ < 1, it follows that
Hence,
which yields the desired result.
Corollary 4.3. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Consider any random θ̃ ∈ Θ with . Let 1 < d0 ≤ db. It holds that
Proof. Repeat Lemma 4.6 N times.
Lemma 4.7. Suppose Conditions I(b, δ) and II(b, δ, d) hold. If , where
then for any integer N, with probability at least
we have
Proof. Since the negative log partial likelihood ln(θ) and the lasso penalty are both convex with respect to θ, applying Corollary 4.3, we obtain the above inequality. This proof is similar to the proof of Lemma A.6 in van de Geer (2008).
Lemma 4.8. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Let N1 ∈ N ≔ {1, 2, …} and N2 ∈ N ∪ {0}. With δ1 = (1+b)−N1 and δ2 = (1+b)−N2 , for any n, with probability at least
we have
where
Proof. The proof is the same as that of Lemma A.7 in van de Geer (2008), with a slightly different probability bound.
4.2. Proof of Theorem 3.1
Proof. The proof follows the same ideas in the proof of Theorem A.4 in van de Geer (2008), with exceptions of pointwise arguments and slightly different probability bounds. Since this is our main result, we provide a detailed proof here despite the amount of overlaps.
Define ℰ̂ ≔ ℰ(fθ̂n and ; use the notation and ; set c ≔ δb/(1 − δ2). Consider the cases (a) c < d(δ1, δ2) and (b) c ≥ d(δ1, δ2).
(a) c < d(δ1, δ2). Let J be an integer satisfying (1 + b)J−1 c ≤ d(δ1, δ2) and (1 + b)J c > d(δ1, δ2). We consider the cases (a1) and (a2) .
(a1) If , then
for some j ∈ {1, ⋯ , J}. Let d0 = c(1 + b)j−1 ≤ d(δ1, δ2) ≤ db. From Corollary 4.1, with probability at least we have .
Since , from (4.9) we have
By (4.8), is bounded by with probability at least
Then we have
Since , and , by triangular inequality we obtain . From Lemma 4.5, . Hence, .
(a2) If , from (4.10) with d0 = c, with probability at least
we have
By the triangular inequality, Lemma 4.5 and (A4),
Hence,
Furthermore, by Lemma 4.8, we have with probability at least
that , where
(b) c ≥ d(δ1, δ2). On the set where , from equation (4.10) we have with probability at least
that
which is the same as (a2) and leads to the same result.
To summarize, let
Note that
Under case (a), we have
Under case (b),
We thus obtain the desired result.
Contributor Information
Shengchun Kong, Email: kongsc@umich.edu.
Bin Nan, Email: bnan@umich.edu.
References
- Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann. Statist. 1982;10:1100–1120. [Google Scholar]
- Bach F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics. 2010;4:384–414. [Google Scholar]
- Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and dantzig selector. Ann. Statist. 2009;37:1705–1732. [Google Scholar]
- Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann. Statist. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bühlmann P. Boosting for high-dimensional linear models. Ann. Statist. 2006;34:559–583. [Google Scholar]
- Bunea F. Honest variable selection in linear and logistic regression models via ℓ1 and ℓ1+ℓ2 penalization. Electronic Journal of Statistics. 2008;2:1153–1194. [Google Scholar]
- Bunea F, Tsybakov AB, Wegkamp MH. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics. 2007;1:169–194. [Google Scholar]
- Cox DR. Regression models and life tables (with discussion) J. Roy. Statist. Soc. B. 1972;34:187–220. [Google Scholar]
- Gaiffas S, Guilloux A. High-dimensional additive hazards models and the lasso. Electronic Journal of Statistics. 2012;6:522–546. [Google Scholar]
- Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
- Ledoux M, Talagrand M. Probability in Banach Spaces: Isoperimetry and Processes. Berlin: Springer; 1991. [Google Scholar]
- Martinussen T, Scheike TH. Covariate selection for the semiparametric additive risk model. Scandinavian Journal of Statistics. 2009;36:602–619. [Google Scholar]
- Massart P. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 1990;18:1269–1283. [Google Scholar]
- Tarigan B, van de Geer S. Classifiers of support vector machine type with ℓ1 complexity regularization. Bernoulli. 2006;12:1045–1076. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the Lasso. JRStat. Soc. B. 1996;58:267–288. [Google Scholar]
- Tibshirani R. The Lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- van de Geer S. High-dimensional generalized linear models and the lasso. Ann. Statist. 2008;36:614–645. [Google Scholar]
- van der Vaart and Wellner, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Wiley; 1996. [Google Scholar]