Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jan 1.
Published in final edited form as: Stat Sin. 2014 Jan 1;24(1):25–42. doi: 10.5705/ss.2012.240

Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

Shengchun Kong 1, Bin Nan 1,*
PMCID: PMC3916829  NIHMSID: NIHMS447122  PMID: 24516328

Abstract

We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses.

Keywords and phrases: Cox regression, finite sample, lasso, oracle inequality, variable selection

1. Introduction

Since it was introduced by Tibshirani (1996), the lasso regularized method for high-dimensional regression models with sparse coefficients has received a great deal of attention in the literature. Properties of interest for such regression models include the finite sample oracle inequalities. Among the extensive literature of the lasso method, Bunea, Tsybakov, and Wegkamp (2007) and Bickel, Ritov, and Tsybakov (2009) derived the oracle inequalities for prediction risk and estimation error in a general nonparametric regression model, including the high-dimensional linear regression as a special example, and van de Geer (2008) provided oracle inequalities for the generalized linear models with Lipschitz loss functions, e.g., logistic regression and classification with hinge loss. Bunea (2008) and Bach (2010) also considered the lasso regularized logistic regression. For censored survival data, the lasso penalty has been applied to the regularized Cox regression in the literature, see e.g. Tibshirani (1997) and Gui and Li (2005), among others. Recently, Bradic, Fan, and Jiang (2011) studied the asymptotic properties of the lasso regularized Cox model. However, its finite sample non-asymptotic statistical properties have not yet been established in the literature to the best of our knowledge, largely due to lacking iid Lipschitz losses from the partial likelihood. Nonetheless, the lasso approach has been studied extensively in the literature for other models, see e.g. Martinussen and Scheike (2009) and Gaiffas and Guilloux (2012), among others, for the additive hazards model.

We consider the non-asymptotic statistical properties of the lasso regularized high-dimensional Cox regression. Let T be the survival time and C the censoring time. Suppose we observe a sequence of iid observations (Xi, Yi, Δi), i = 1, …, n, where Xi = (Xi1, ⋯ Xim) are the m-dimensional covariates in 𝒳, Yi = TiCi, and Δi = I{TiCi}. Due to a large amount of parallel material, we follow closely the notation in van de Geer (2008). Let

={fθ(x)=k=1mθkxk,θΘRm}.

Consider the Cox model (Cox (1972)):

λ(t|X)=λ0(t)efθ(x),

where θ is the parameter of interest and λ0 is the unknown baseline hazard function. The negative log partial likelihood function for θ is

ln(θ)=1ni=1n{fθ(Xi)log[1nj=1n1(YjYi)efθ(Xj)]}Δi. (1.1)

The corresponding estimator with lasso penalty is denoted by

θ^narg minθΘ{ln(θ)+λnI(θ)},

where I(θ)k=1mσk|θk| is the weighted l1 norm of the vector θ ∈ Rm. van de Geer (2008) considered σk to be the square-root of the second moment of the k-th covariate Xk, either at the population level (fixed) or at the sample level (random). For normalized Xk, σk = 1. We consider fixed weights σk, k = 1, ⋯ ,m. The results for random weights can be easily obtained from the case with fixed weights following van de Geer (2008), and we leave the detailed calculation to interested readers.

Clearly the negative log partial likelihood (1.1) is a sum of non-iid random variables. For ease of calculation, consider an intermediate function as a “replacement” of the negative log partial likelihood function

l˜n(θ)=1ni=1n{fθ(Xi)logμ(Yi;fθ)}Δi (1.2)

that has the iid structure, but with an unknown population expectation

μ(t;fθ)=EX,Y{1(Yt)efθ(X)}.

The negative log partial likelihood function (1.1) can then be viewed as a “working” model for the empirical loss function (1.2). The corresponding loss function is

γfθ=γ(fθ(X),Y,Δ){fθ(X)logμ(Y;fθ)}Δ, (1.3)

with expected loss

l(θ)=EY,Δ,X[{fθ(X)logμ(Y;fθ)}Δ]=Pγfθ, (1.4)

where P denotes the distribution of (Y, Δ, X). Define the target function as

f¯arg minfPγffθ¯,

where θ̅ = arg minθ∈Θ Pγfθ. It is well-known that Pγfθ is convex with respect to θ for the regular Cox model, see for example, Andersen and Gill (1982), thus the above minimum is unique if the Fisher information matrix of θ at θ̅ is non-singular. Define the excess risk of f by

(f)PγfPγf¯.

It is desirable to show similar non-asymptotic oracle inequalities for the Cox regression model as in, for example, van de Geer (2008) for generalized linear models. That is, with large probability,

(fθ^n)const.×minθΘ{(fθ)+𝒱θ}.

Here 𝒱θ is called the “estimation error”, which is typically proportional to λn2 times the number of nonzero elements in θ.

Note that the summands in the negative log partial likelihood function (1.1) are not iid, and the intermediate loss function γ(·, Y, Δ) given in (1.3) is not Lipschitz. Hence the general result of van de Geer (2008) that requires iid Lipschitz loss functions does not apply to the Cox regression. We tackle the problem using pointwise arguments to obtain the oracle bounds of two types of errors: one is between empirical loss (1.2) and expected loss (1.4) without involving the Lipschitz requirement of van de Geer (2008), and one is between the negative log partial likelihood (1.1) and empirical loss (1.2) which establishes the iid approximation of non-iid losses. These steps distinguish our work from that of van de Geer (2008); we rely on the Mean Value Theorem with van de Geer’s Lipschitz condition replaced by the similar, but much less restrictive, boundedness assumption for regression parameters in Bühlmann (2006).

The article is organized as follows. In Section 2, we provide assumptions that are used throughout the paper. In Section 3, we define several useful quantities followed by the main result. We then provide a detailed proof in Section 4 by introducing a series of lemmas and corollaries useful for deriving the oracle inequalities for the Cox model. To avoid duplicate material as much as possible, we refer to the preliminaries and some results in van de Geer (2008) from place to place in the proofs without providing much detail.

2. Assumptions

We impose five basic assumptions. Let ‖·‖ be the L2(P) norm and ‖·‖ the sup norm.

Assumption A. Km ≔ max1≤km{‖Xkk} < ∞.

Assumption B. There exists an η > 0 and strictly convex increasing G, such that for all θ ∈ Θ with ‖fθ ≤ η, one has ℰ(fθ) ≥ G(‖fθ‖).

In particular, G can be chosen as a quadratic function with some constant C0, i.e., G(u) = u2/C0, then the convex conjugate of function G, denoted by H, such that uvG(u) + H(v) is also quadratic.

Assumption C. There exists a function D(·) on the subsets of the index set {1, ⋯ m}, such that for all 𝒦 ⊂ {1, ⋯ , m}, and for all θ ∈ Θ and θ̃ ∈ Θ, we have k𝒦σk|θkθ˜k|D(𝒦)fθfθ˜. Here, D(𝒦) is chosen to be the cardinal number of 𝒦.

Assumption D. LmsupθΘk=1m|θk|<.

Assumption E. The observation time stops at a finite time τ > 0, with ξ ≔ P(Y ≥ τ) > 0.

Assumptions A, B, and C are identical to those in van de Geer (2008) with her ψk the identity function. Assumptions B and C can be easily verified for the random design setting where X is random (van de Geer (2008)) together with the usual assumption of non-singular Fisher information matrix at θ̅ (and its neighborhood) for the Cox model. Assumption D has a similar flavor to the assumption (A2) in Bühlmann (2006) for the persistency property of boosting method in high-dimensional linear regression models, but is much less restrictive in the sense that Lm is allowed to depend on m in contrast with the fixed constant in Bühlmann (2006). Here it replaces the Lipschitz assumption in van de Geer (2008). Assumption E is commonly used for survival models with censored data, see for example, Andersen and Gill (1982). A straightforward extension of Assumption E is to allow τ (thus ξ) to depend on n.

From Assumptions A and D, we have, for any θ ∈ Θ,

e|fθ(Xi)|eKmLmσ(m)Um< (2.1)

for all i, where σ(m) = max1≤km σk. Note that Um is allowed to depend on m.

3. Main result

Let I(θ)k=1mσk|θk| be the l1 norm of θ. For any θ and θ̃ in Θ, denote

I1(θ|θ˜)k:θ˜k0σk|θk|,I2(θ|θ˜)I(θ)I1(θ|θ˜).

Consider the estimator

θ^narg minθΘ{ln(θ)+λnI(θ)}.

3.1. Useful quantities

We first define a set of useful quantities that are involved in the oracle inequalities.

  • a̅n = 4an, an=2Km2log(2m)n+Kmlog(2m)n.

  • r1 > 0, b > 0, d > 1, and 1 > δ > 0 are arbitrary constants.

  • dbd(b+d(d1)b1).

  • λ¯n,0=λ¯n,0A+λ¯n,0B, where
    λ¯n,0Aλ¯n,0A(r1)a¯n(1+2r12(Km2+a¯nKm)+4r12a¯nKm3),λ¯n,0Bλ¯n,0B(r1)2KmUm2ξ(2a¯nr1+log(2m)n).
  • λn ≔ (1 + b)λ̅n,0.

  • δ1 = (1 + b)N1 and δ2 = (1 + b)N2 are arbitrary constants for some N1 and N2, where N1N ≔ {1, 2, …} and N2N ∪ {0}.

  • d(δ1,δ2)=1+1+(d21)δ1(d1)(1δ1)δ2.

  • W is a fixed constant given in Lemma 4.3 for a class of empirical processes.

  • DθD({k : θk ≠ 0, k = 1, …, m}) is the number of nonzero θk’s, where D(·) is given in Assumption C.

  • 𝒱θ2δH(2λnDθδ), where H is the convex conjugate of function G defined in Assumption B.

  • θn*arg minθΘ{(fθ)+𝒱θ}.

  • εn*(1+δ)(fθn*)+𝒱θn*.

  • ζn*εn*λn,0.

  • θ(εn*)arg minθΘ,I(θθn*)dbζn*/b{δ(fθ)2λnI1(θθn*|θn*)}

In the above, the dependence of θn* on the sample size n is through 𝒱θ that involves the tuning parameter λn. We also impose conditions as in van de Geer (2008):

Condition I(b, δ). fθn*f¯η.

Condition II(b, δ, d). fθ(εn*)f¯η.

In both conditions, η is given in Assumption B.

3.2. Oracle inequalities

We now provide our theorem on oracle inequalities for the Cox model lasso estimator, with detailed proof given in the next section. The key idea of the proof is to find bounds of differences between empirical errors of the working model (1.2) and between approximation errors of the partial likelihood, denoted as Zθ and Rθ in the next section.

Theorem 3.1. Suppose Assumptions A-E and Conditions I(b, δ) and II(b, δ, d) hold. With

Δ(b,δ,δ1,δ2)d(δ1,δ2)1δ2δb1,

we have, with probability at least

1{log1+b(1+b)2Δ(b,δ,δ1,δ2)δ1δ2}{(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)},

that

(fθ^n)11δεn*andI(θ^nθn*)d(δ1,δ2)ζn*b.

4. Proofs

4.1. Preparations

Denote the empirical probability measure based on the sample {(Xi, Yi, Δi) : i = 1, … n} by Pn. Let ε1, ⋯ , εn be a Rademacher sequence, independent of the training data (X1, Y1, Δ1), ⋯ , (Xn, Yn, Δn). For some fixed θ* ∈ Θ and some M > 0, denote ℱM ≔ {fθ : θ ∈ Θ, I(θ−θ*) ≤ M}. Later we take θ*=θn*, which is the case of interest. For any θ where I(θ − θ*) ≤ M, denote

Zθ(M)|(PnP)[γfθγfθ*]|=|[l˜n(θ)l(θ)][l˜n(θ*)l(θ*)]|.

Note that van de Geer (2008) sought to bound supf∈ℱM Zθ(M), thus the contraction theorem of Ledoux and Talagrand (1991) (Theorem A.3 in van de Geer (2008)) was needed, which holds for Lipschitz functions. We find that the calculation in van de Geer (2008) does not apply to the Cox model due to the lack of Lipschitz property. However, the pointwise argument is adequate for our purpose because only the lasso estimator or the difference between the lasso estimator θ̂n and the oracle θn* is of interest. Note the notational difference between an arbitrary θ* in the above Zθ(M) and the oracle θn*.

Lemma 4.1. Under Assumptions A, D, and E, for all θ satisfying I(θ − θ*) ≤ M, we have EZθ(M) ≤ a̅nM.

Proof. By the symmetrization theorem, see e.g. van der Vaart and Wellner (1996) or Theorem A.2 in van de Geer (2008), for a class of only one function we have

EZθ(M)2E(|1ni=1nεi{[fθ(Xi)logμ(Yi;fθ)]Δi[fθ*(Xi)logμ(Yi;fθ*)]Δi}|)2E(|1ni=1nεi{fθ(Xi)fθ*(Xi)}Δi|)+2E(|1ni=1nεi{logμ(Yi;fθ)logμ(Yi;fθ*)}Δi|)=A+B.

For A we have

A2(k=1mσk|θkθk*|)E(max1km|1ni=1nεiΔiXik/σk|).

Applying Lemma A.1 in van de Geer (2008), we obtain

E(max1km|1ni=1nεiΔiXikσk|)an.

Thus we have

A2anM. (4.1)

For B, instead of using the contraction theorem that requires Lipschitz, we use the Mean Value Theorem:

|1ni=1nεi{logμ(Yi;fθ)logμ(Yi;fθ*)}Δi|=|1ni=1nεiΔik=1m1μ(Yi;fθ**)Yi𝒳(θkθk*)xkefθ**(x)dPX,Y(x,y)|=|k=1mσk(θkθk*)1ni=1nεiΔiμ(Yi;fθ**)σkYi𝒳xkefθ**(x)dPX,Y(x,y)||k=1mσk(θkθk*)|max1km|1ni=1nεiΔiFθ**(k,Yi)|Mmax1km|1ni=1nεiΔiFθ**(k,Yi)|,

where θ** is between θ and θ*, and

Fθ**(k,t)=E[1(Yt)Xkefθ**(X)μ(t;fθ**)σk (4.2)

satisfying

|Fθ**(k,t)|(Xk/σk)E[1(Yt)efθ**(X)]μ(t;fθ**)Km.

Since for all i,

E[εiΔiFθ**(k,Yi)]=0,εiΔiFθ**(k,Yi)Km,and1ni=1nE[εiΔiFθ**(k,Yi)]21ni=1nE[Fθ**(k,Yi)]2EKm2=Km2,

following Lemma A.1 in van de Geer (2008), we obtain

B2anM. (4.3)

Combining (4.1) and (4.3), the upper bound for EZθ(M) is achieved.

We now can bound the tail probability of Zθ(M) using the Bousquet’s concentration theorem noted as Theorem A.1 in van de Geer (2008).

Corollary 4.1. Under Assumptions A, D, and E, for all M > 0, r1 > 0 and all θ satisfying I(θ − θ*) ≤ M, it holds that

P(Zθ(M)λ¯n,0AM)exp(na¯n2r12).

Proof. Using the triangular inequality and the Mean Value Theorem, we obtain

|γfθγfθ*||fθ(X)fθ*(X)|Δ+|logμ(Y;fθ)logμ(Y;fθ*)|Δk=1mσk|θkθk*||Xk|σk+|logμ(Y;fθ)logμ(Y;fθ*)|MKm+k=1mσk|θkθk*|·max1km|Fθ**(k,Y)|2MKm,

where θ** is between θ and θ*, and Fθ**(k, Y) is defined in (4.2). So we have

γfθγfθ*2MKm,andP(γfθγfθ*)24M2Km2.

Therefore, in view of Bousquet’s concentration theorem and Lemma 4.1, for all M > 0 and r1 > 0,

P(Zθ(M)a¯nM(1+2r12(Km2+a¯nKm)+4r12a¯nKm3))exp(na¯n2r12).

Now for any θ satisfying I(θ − θ*) ≤ M, we bound

Rθ(M):=|[ln(θ)l˜n(θ)][ln(θ*)l˜n(θ*)]|=1ni=1n|[log1nj=1n1(YjYi)efθ(Xj)μ(Yi;fθ)log1nj=1n1(YjYi)efθ*(Xj)μ(Yi;fθ*)]Δi|sup0tτ|log1nj=1n1(Yjt)efθ(Xj)μ(t;fθ)log1nj=1n1(Yjt)efθ*(Xj)μ(t;fθ*)|.

Here recall that τ is given in Assumption E. By the Mean Value Theorem, we have

Rθ(M)sup0tτ|k=1m(θkθk*){j=1n1(Yjt)efθ**(Xj)μ(t;fθ**)}1{j=1n1(Yjt)Xjkefθ**(Xj)μ(t;fθ**)j=1n1(Yjt)efθ**(Xj)E[1(Yt)Xkefθ**(X)]μ(t;fθ**)2}|=sup0tτ|k=1mσk(θkθk*){j=1n1(Yjt)(Xjk/σk)efθ**(Xj)j=1n1(Yjt)efθ**(Xj)E[1(Yt)(Xk/σk)efθ**(X)]E[1(Yt)efθ**(X)]}|Msup0tτ[1ni=1n1(Yit)efθ**(Xi)]1sup0tτ{max1km|1ni=1n1(Yit)(Xik/σk)efθ**(Xi)E[1(Yt)(Xk/σk)efθ**(X)]|+Km|1ni=1n1(Yit)efθ**(Xi)E[1(Yt)efθ**(X)]|}, (4.4)

where θ** is between θ and θ* and, by (2.1), we have

sup0tτ[1ni=1n1(Yit)efθ**(Xi)]1Um[1ni=1n1(Yiτ)]1. (4.5)

Lemma 4.2. Under Assumption E, we have

P(1ni=1n1(Yiτ)ξ2)2enξ2/2.

Proof. This is obtained directly from Massart (1990) for the Kolmogorov statistic by taking r=ξn/2 in the following:

P(1ni=1n1(Yiτ)ξ2)P(n|1ni=1n1(Yiτ)ξ|r)P(sup0tτn|1ni=1n1(Yit)P(Yt)|r)2e2r2.

Lemma 4.3. Under Assumptions A, D, and E, for all θ we have

P(sup0tτ|1ni=1n1(Yit)efθ(Xi)μ(t;fθ)|Uma¯nr1)15W2ena¯n2r12, (4.6)

where W is a fixed constant.

Proof. For a class of functions indexed by t, ℱ = {1(yt)efθ(x)/Um : t ∈ [0, τ], yR, efθ(x)Um}, we calculate its bracketing number. For any nontrivial ∈ satisfying 1 > ∈ > 0, let ti be the i-th ⌈1/ε⌉ quantile of Y, so

P(Yti)=iε,i=1,,1/ε1,

where ⌈x⌉ is the smallest integer that is greater than or equal to x. Furthermore, take t0 = 0 and t⌈1/ε⌉ = +∞. For i = 1, ⋯, ⌈1/ε⌈, define brackets [Li, Ui] with

Li(x,y)=1(yti)efθ(x)/Um,Ui(x,y)=1(y>ti1)efθ(x)/Um

such that Li(x, y) ≤ 1(yt)efθ(x)/UmUi(x, y) when ti−1 < tti. Since

{E[UiLi]2}1/2{E[efθ(X)Um{1(Yti)1(Y>ti1)}]2}1/2{P(ti1<Yti)}1/2=ε,

we have N[](ε,,L2)1/ε2/ε, which yields

N[](ε,,L2)2ε2=(Kε)2,

where K=2. Thus, from Theorem 2.14.9 in van der Vaart and Wellner (1996), we have for any r > 0,

P(nsup0tτ|1ni=1n1(Yit)efθ(Xi)Umμ(t;fθ)Um|r)12W2r2e2r215W2er2,

where W is a constant that only depends on K. Note that r2er2 is bounded by e−1. With r=na¯nr1, we obtain (4.6).

Lemma 4.4. Under Assumptions A, D, and E, for all θ we have

P(sup0tτmax1km|1ni=1n1(Yit)Xikσkefθ(Xi)E[1(Yt)Xkσkefθ(X)]|KmUm[a¯nr1+log(2m)n])110W2ena¯n2r12. (4.7)

Proof. Consider the classes of functions indexed by t,

𝒢k={1(yt)efθ(x)xk/(σkKmUm):t[0,τ],yR,|efθ(x)xk/σk|KmUm},k=1,,m.

Using the argument in the proof of Lemma 4.3, we have

N[](ε,𝒢k,L2)(Kε)2,

where K=2, and then for any r > 0,

P(nsup0tτ|1ni=1n1(Yit)efθ(Xi)XikσkKmUmE[1(Yt)efθ(X)XkσkKmUm]|r)15W2er2.

Thus we have

P(nsup0tτmax1km|1ni=1n1(Yit)efθ(Xi)Xik/(σkUmKm)E[1(Yt)efθ(X)Xk/(σkUmKm)]|r)P(k=1mnsup0tτ|1ni=1n1(Yit)efθ(Xi)Xik/(σkUmKm)E[1(Yt)efθ(X)Xk/(σkUmKm)]|r)mP(nsup0tτ|1ni=1n1(Yit)ef0(Xi)Xik/(σkUmKm)E[1(Yt)efθ(X)Xk/(σkUmKm)]|r)m5W2er2=110W2elog(2m)r2.

Let log(2m)r2=na¯n2r12, so r=na¯n2r12+log(2m). Since

a¯n2r12+log(2m)na¯nr1+log(2m)n,

, we obtain (4.7).

Corollary 4.2. Under Assumptions A, D, and E, for all M > 0, r1 > 0, and all θ that satisfy I(θ − θ*) ≤ M, we have

P(Rθ(M)λ¯n,0BM)2exp(nξ2/2)+310W2exp(na¯n2r12). (4.8)

Proof. From (4.4) and (4.5) we have

P(Rθ(M)λ¯n,0B·M)P(E1cE2cE3c),

where the events E1, E2 and E3 are defined as

E1={1ni=1n1(Yiτ)ξ/2},
E2={sup0tτ|1ni=1n1(Yit)efθ**(Xi)μ(t;fθ**)|Uma¯nr1},
E3={max1kmsup0tτ|1ni=1n1(Yit)Xikσkefθ**(Xi)E[1(Yt)Xkσkefθ**(X)]|KmUm(a¯nr1+log(2m)n)}.

Thus

P(Rθ(M)λ¯n,0B·M)P(E1)+P(E2)+P(E3),

and the result follows from Lemmas 4.2, 4.3 and 4.4.

Now with θ*=θn*, we have the following results.

Lemma 4.5. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Under Assumptions B and C, for all θ ∈ Θ with I(θθn*)dbζn*/b, it holds that

2λnI1(θθn*)δ(fθ)+εn*(fθn*).

Proof. The proof is exactly the same as that of Lemma A.4 in van de Geer (2008), with the λn defined in Subsection 3.1.

Lemma 4.6. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Consider any random θ̃ ∈ Θ with ln(θ˜)+λnI(θ˜)ln(θn*)+λnI(θn*). Let 1 < d0 ≤ db. It holds that

P(I(θ˜θn*)d0ζn*b)P(I(θ˜θn*)(d0+b1+b)ζn*b)+(1+310W2)exp(na¯n2r12)+2exp(nξ2/2).

Proof. The idea is similar to the proof of Lemma A.5 in van de Geer (2008). Let ℰ̃ = ℰ(fθ̃) and *=(fθn*). We will use short notation: I1(θ)=I1(θ|θn*) and I2(θ)=I2(θ|θn*). Since ln(θ˜)+λnI(θ˜)ln(θn*)+λnI(θn*), on the set where I(θ˜θn*)d0ζn*/b and Zθ˜(d0ζn*/b)d0ζn*/b·λ¯n,0A, we have

Rθ˜(d0ζn*/b)[ln(θn*)+λnI(θn*)][ln(θ˜)+λnI(θ˜)]λnI(θn*)+λnI(θ˜)[l˜n(θn*)l˜n(θ˜)]λnI(θn*)+λnI(θ˜)[l˜n(θn*)l˜n(θ˜)]λnI(θn*)+λnI(θ˜)[l(θn*)l(θ˜)d0ζn*/b·λ¯n,0AλnI(θn*)+λnI(θ˜)*+˜d0λn,0Aζn*/b. (4.9)

By (4.8) we know that Rθ˜(d0ζn*/b) is bounded by d0λ¯n,0Bζn*/b with probability at least 1310W2exp(na¯n2r12)2exp(nξ2/2), then we have

˜+λnI(θ˜)λ¯n,0Bd0ζn*/b+*+λnI(θn*)+λ¯n,0Ad0ζn*/b.

Since I(θ̃) = I1(θ̃) + I2(θ̃) and I(θn*)=I1(θn*), using the triangular inequality, we obtain

˜+(1+b)λ¯n,0I2(θ˜)λ¯n,0d0ζn*/b+*+(1+b)λ¯n,0I1(θn*)(1+b)λ¯n,0I1(θ˜)λ¯n,0d0ζn*/b+*+(1+b)λ¯n,0I1(θ˜θn*). (4.10)

Adding (1+b)λ¯n,0I1(θ˜θn*) to both sides and from Lemma 4.5,

˜+(1+b)λ¯n,0I(θ˜θn*)λn,0d0ζn*b+*+2(1+b)λ¯n,0I1(θ˜θn*)(λ¯n,0d0+bλ¯n,0)ζn*b+δ˜=(d0+b)λ¯n,0ζn*b+δ˜.

Because 0 < δ < 1, it follows that

I(θ˜θn*)d0+b1+bζn*b.

Hence,

P({I(θ˜θn*)d0ζn*b}{Zθ˜(d0ζn*/b)d0λ¯n,0Aζn*b}{Rθ˜(d0ζn*/b)d0λ¯n,0Bζn*b})P(I(θ˜θn*)d0+b1+bζn*b),

which yields the desired result.

Corollary 4.3. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Consider any random θ̃ ∈ Θ with ln(θ˜)+λnI(θ˜)ln(θn*)+λnI(θn*). Let 1 < d0 ≤ db. It holds that

P(I(θ˜θn*)d0ζn*b)P(I(θ˜θn*)[1+(d01)(1+b)N]ζn*b)+N{(1+310W2)exp(na¯n2r12)+2exp(nξ2/2}.

Proof. Repeat Lemma 4.6 N times.

Lemma 4.7. Suppose Conditions I(b, δ) and II(b, δ, d) hold. If θ˜s=sθ^n+(1s)θn*, where

s=dζn*dζn*+bI(θ^nθn*),

then for any integer N, with probability at least

1N{(1+310W2)exp(na¯n2r12)+2exp(nξ2/2},

we have

I(θ˜sθn*)(1+(d1)(1+b)Nζn*b.

Proof. Since the negative log partial likelihood ln(θ) and the lasso penalty are both convex with respect to θ, applying Corollary 4.3, we obtain the above inequality. This proof is similar to the proof of Lemma A.6 in van de Geer (2008).

Lemma 4.8. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Let N1N ≔ {1, 2, …} and N2N ∪ {0}. With δ1 = (1+b)N1 and δ2 = (1+b)N2 , for any n, with probability at least

1(N1+N2){(1+310W2)exp(na¯n2r12)+2exp(nξ2/2},

we have

I(θ^nθn*)d(δ1,δ2)ζn*b,

where

d(δ1,δ2)=1+1+(d21)δ1(d1)(1δ1)δ2.

Proof. The proof is the same as that of Lemma A.7 in van de Geer (2008), with a slightly different probability bound.

4.2. Proof of Theorem 3.1

Proof. The proof follows the same ideas in the proof of Theorem A.4 in van de Geer (2008), with exceptions of pointwise arguments and slightly different probability bounds. Since this is our main result, we provide a detailed proof here despite the amount of overlaps.

Define ℰ̂ ≔ ℰ(fθ̂n and *(fθn*); use the notation I1(θ)I1(θ|θn*) and I2(θ)I2(θ|θn*); set c ≔ δb/(1 − δ2). Consider the cases (a) c < d1, δ2) and (b) cd1, δ2).

(a) c < d1, δ2). Let J be an integer satisfying (1 + b)J−1 cd1, δ2) and (1 + b)J c > d1, δ2). We consider the cases (a1) cζn*/b<I(θ^nθn*)d(δ1,δ2)ζn*/b and (a2) I(θ^nθn*)cζn*/b.

(a1) If cζn*/b<I(θ^nθn*)d(δ1,δ2)ζn*/b, then

(1+b)j1cζn*b<I(θ^nθn*)(1+b)jcζn*b

for some j ∈ {1, ⋯ , J}. Let d0 = c(1 + b)j−1d1, δ2) ≤ db. From Corollary 4.1, with probability at least 1exp(na¯n2r12) we have Zθ^n((1+b)d0ζn*/b)(1+b)d0λ¯n,0Aζn*/b).

Since ln(θ˜n)+λnI(θ˜n)ln(θn*)+λnI(θn*), from (4.9) we have

^+λnI(θ^n)Rθ^n((1+b)d0ζn*b)+*+λnI(θn*)+(1+b)λ¯n,0Ad0ζn*b.

By (4.8), Rθ^n((1+b)d0ζn*/b) is bounded by (1+b)λ¯n,0Bd0ζn*/b) with probability at least

1310W2exp(na¯n2r12)2exp(nξ2/2).

Then we have

^+(1+b)λ¯n,0I(θ^n)(1+b)λ¯n,0Bd0ζn*b+*+(1+b)λ¯n,0I(θn*)+(1+b)λ¯n,0Ad0ζn*b(1+b)λ¯n,0I(θ^nθn*)+*+(1+b)λ¯n,0I(θn*).

Since I(θ^n)=I1(θ^n)+I2(θ^n),I(θ^nθn*)=I1(θ^nθn*)+I2(θ^n), and I(θn*)=I1(θn*), by triangular inequality we obtain ^2(1+b)λ¯n,0I1(θ^nθn*)+*. From Lemma 4.5, ^δ^+εn**+*=δ^+εn*. Hence, ^εn*/(1δ).

(a2) If I(θ^nθn*)cζn*/b, from (4.10) with d0 = c, with probability at least

1{(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)},

we have

^+(1+b)λ¯n,0I(θ^n)δ1δ2λ¯n,0ζn*+*+(1+b)λ¯n,0I(θn*).

By the triangular inequality, Lemma 4.5 and (A4),

^δ1δ2λ¯n,0ζn*+*+(1+b)λ¯n,0I1(θ^nθn*)δ1δ2λ¯n,0εn*λ¯n,0+*+δ2^+12εn*12*=(δ1δ2+12)εn*+12*+δ2^(δ1δ2+12)εn*+12(1+δ)εn*+δ2^.

Hence,

^22δ[δ1δ2+12+12(1+δ)]εn*=11δεn*.

Furthermore, by Lemma 4.8, we have with probability at least

1(N1+N2){(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}

that I(θ^nθn*)d(δ1,δ2)ζn*b, where

N1=log1+b(1δ1),N2=log1+b(1δ2).

(b) cd1, δ2). On the set where I(θ^nθn*)d(δ1,δ2)ζn*/b, from equation (4.10) we have with probability at least

1{(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}

that

^+(1+b)λ¯n,0I(θ^n)λ¯n,0d(δ1,δ2)ζn*b+*+(1+b)λ¯n,0I(θn*)δ1δ2λ¯n,0ζn*+*+(1+b)λ¯n,0I(θn*),

which is the same as (a2) and leads to the same result.

To summarize, let

A={^11δεn*},B={I(θ^nθn*)d(δ1,δ2)ζn*b}.

Note that

J+1log1+b((1+b)2d(δ1,δ2)c).

Under case (a), we have

P(AB)=P(a1)P(Aca1)+P(a2)P(Aca2)P(a1)J{(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}+P(a2){(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}=P(B)(J+1){(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}1(N1+N2+J+1){(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}1log1+b{(1+b)2δ1δ2·d(δ1,δ2)(1δ2)δb}{(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}.

Under case (b),

P(AB)=P(B)P(AcB)P(B){(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}1(N1+N2+2){(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}=1log1+b{(1+b)2δ1δ2}{(1+310W2)exp(na¯n2r12)+2exp(nξ2/2)}.

We thus obtain the desired result.

Contributor Information

Shengchun Kong, Email: kongsc@umich.edu.

Bin Nan, Email: bnan@umich.edu.

References

  1. Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann. Statist. 1982;10:1100–1120. [Google Scholar]
  2. Bach F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics. 2010;4:384–414. [Google Scholar]
  3. Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and dantzig selector. Ann. Statist. 2009;37:1705–1732. [Google Scholar]
  4. Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann. Statist. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bühlmann P. Boosting for high-dimensional linear models. Ann. Statist. 2006;34:559–583. [Google Scholar]
  6. Bunea F. Honest variable selection in linear and logistic regression models via ℓ1 and ℓ1+ℓ2 penalization. Electronic Journal of Statistics. 2008;2:1153–1194. [Google Scholar]
  7. Bunea F, Tsybakov AB, Wegkamp MH. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics. 2007;1:169–194. [Google Scholar]
  8. Cox DR. Regression models and life tables (with discussion) J. Roy. Statist. Soc. B. 1972;34:187–220. [Google Scholar]
  9. Gaiffas S, Guilloux A. High-dimensional additive hazards models and the lasso. Electronic Journal of Statistics. 2012;6:522–546. [Google Scholar]
  10. Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
  11. Ledoux M, Talagrand M. Probability in Banach Spaces: Isoperimetry and Processes. Berlin: Springer; 1991. [Google Scholar]
  12. Martinussen T, Scheike TH. Covariate selection for the semiparametric additive risk model. Scandinavian Journal of Statistics. 2009;36:602–619. [Google Scholar]
  13. Massart P. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 1990;18:1269–1283. [Google Scholar]
  14. Tarigan B, van de Geer S. Classifiers of support vector machine type with ℓ1 complexity regularization. Bernoulli. 2006;12:1045–1076. [Google Scholar]
  15. Tibshirani R. Regression shrinkage and selection via the Lasso. JRStat. Soc. B. 1996;58:267–288. [Google Scholar]
  16. Tibshirani R. The Lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
  17. van de Geer S. High-dimensional generalized linear models and the lasso. Ann. Statist. 2008;36:614–645. [Google Scholar]
  18. van der Vaart and Wellner, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Wiley; 1996. [Google Scholar]

RESOURCES