Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

Shengchun Kong; Bin Nan

doi:10.5705/ss.2012.240

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: Stat Sin. 2014 Jan 1;24(1):25–42. doi: 10.5705/ss.2012.240

Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

Shengchun Kong ¹, Bin Nan ^1,^*

PMCID: PMC3916829 NIHMSID: NIHMS447122 PMID: 24516328

Abstract

We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses.

Keywords and phrases: Cox regression, finite sample, lasso, oracle inequality, variable selection

1. Introduction

Since it was introduced by Tibshirani (1996), the lasso regularized method for high-dimensional regression models with sparse coefficients has received a great deal of attention in the literature. Properties of interest for such regression models include the finite sample oracle inequalities. Among the extensive literature of the lasso method, Bunea, Tsybakov, and Wegkamp (2007) and Bickel, Ritov, and Tsybakov (2009) derived the oracle inequalities for prediction risk and estimation error in a general nonparametric regression model, including the high-dimensional linear regression as a special example, and van de Geer (2008) provided oracle inequalities for the generalized linear models with Lipschitz loss functions, e.g., logistic regression and classification with hinge loss. Bunea (2008) and Bach (2010) also considered the lasso regularized logistic regression. For censored survival data, the lasso penalty has been applied to the regularized Cox regression in the literature, see e.g. Tibshirani (1997) and Gui and Li (2005), among others. Recently, Bradic, Fan, and Jiang (2011) studied the asymptotic properties of the lasso regularized Cox model. However, its finite sample non-asymptotic statistical properties have not yet been established in the literature to the best of our knowledge, largely due to lacking iid Lipschitz losses from the partial likelihood. Nonetheless, the lasso approach has been studied extensively in the literature for other models, see e.g. Martinussen and Scheike (2009) and Gaiffas and Guilloux (2012), among others, for the additive hazards model.

We consider the non-asymptotic statistical properties of the lasso regularized high-dimensional Cox regression. Let T be the survival time and C the censoring time. Suppose we observe a sequence of iid observations (X_i, Y_i, Δ_i), i = 1, …, n, where X_i = (X_i1, ⋯ X_im) are the m-dimensional covariates in 𝒳, Y_i = T_i ∧ C_i, and Δ_i = I_{{T_i≤C_i}}. Due to a large amount of parallel material, we follow closely the notation in van de Geer (2008). Let

ℱ = {f_{θ} (x) = \sum_{k = 1}^{m} θ_{k} x_{k}, θ \in Θ \subset R^{m}} .

Consider the Cox model (Cox (1972)):

λ (t | X) = λ_{0} (t) e^{f_{θ} (x)},

where θ is the parameter of interest and λ₀ is the unknown baseline hazard function. The negative log partial likelihood function for θ is

l_{n} (θ) = - \frac{1}{n} \sum_{i = 1}^{n} {f_{θ} (X_{i}) - log [\frac{1}{n} \sum_{j = 1}^{n} 1 (Y_{j} \geq Y_{i}) e^{f_{θ} (X_{j})}]} Δ_{i} .

(1.1)

The corresponding estimator with lasso penalty is denoted by

{\hat{θ}}_{n} ≔ \underset{θ \in Θ}{arg min} {l_{n} (θ) + λ_{n} I (θ)},

where $I (θ) ≔ \sum_{k = 1}^{m} σ_{k} | θ_{k} |$ is the weighted l₁ norm of the vector θ ∈ R^m. van de Geer (2008) considered σ_k to be the square-root of the second moment of the k-th covariate X_k, either at the population level (fixed) or at the sample level (random). For normalized X_k, σ_k = 1. We consider fixed weights σ_k, k = 1, ⋯ ,m. The results for random weights can be easily obtained from the case with fixed weights following van de Geer (2008), and we leave the detailed calculation to interested readers.

Clearly the negative log partial likelihood (1.1) is a sum of non-iid random variables. For ease of calculation, consider an intermediate function as a “replacement” of the negative log partial likelihood function

{\tilde{l}}_{n} (θ) = - \frac{1}{n} \sum_{i = 1}^{n} {f_{θ} (X_{i}) - log μ (Y_{i}; f_{θ})} Δ_{i}

(1.2)

that has the iid structure, but with an unknown population expectation

μ (t; f_{θ}) = E_{X, Y} {1 (Y \geq t) e^{f_{θ} (X)}} .

The negative log partial likelihood function (1.1) can then be viewed as a “working” model for the empirical loss function (1.2). The corresponding loss function is

γ_{f_{θ}} = γ (f_{θ} (X), Y, Δ) ≔ - {f_{θ} (X) - log μ (Y; f_{θ})} Δ,

(1.3)

with expected loss

l (θ) = - E_{Y, Δ, X} [{f_{θ} (X) - log μ (Y; f_{θ})} Δ] = P γ_{f_{θ}},

(1.4)

where P denotes the distribution of (Y, Δ, X). Define the target function f̅ as

\bar{f} ≔ \underset{f \in ℱ}{arg min} P γ_{f} ≔ f_{\bar{θ}},

where θ̅ = arg min_θ∈Θ Pγ_{f_θ}. It is well-known that Pγ_{f_θ} is convex with respect to θ for the regular Cox model, see for example, Andersen and Gill (1982), thus the above minimum is unique if the Fisher information matrix of θ at θ̅ is non-singular. Define the excess risk of f by

ℰ (f) ≔ P γ_{f} - P γ_{\bar{f}} .

It is desirable to show similar non-asymptotic oracle inequalities for the Cox regression model as in, for example, van de Geer (2008) for generalized linear models. That is, with large probability,

ℰ (f_{{\hat{θ}}_{n}}) \leq const. \times min_{θ \in Θ} {ℰ (f_{θ}) + 𝒱_{θ}} .

Here 𝒱_θ is called the “estimation error”, which is typically proportional to $λ_{n}^{2}$ times the number of nonzero elements in θ.

Note that the summands in the negative log partial likelihood function (1.1) are not iid, and the intermediate loss function γ(·, Y, Δ) given in (1.3) is not Lipschitz. Hence the general result of van de Geer (2008) that requires iid Lipschitz loss functions does not apply to the Cox regression. We tackle the problem using pointwise arguments to obtain the oracle bounds of two types of errors: one is between empirical loss (1.2) and expected loss (1.4) without involving the Lipschitz requirement of van de Geer (2008), and one is between the negative log partial likelihood (1.1) and empirical loss (1.2) which establishes the iid approximation of non-iid losses. These steps distinguish our work from that of van de Geer (2008); we rely on the Mean Value Theorem with van de Geer’s Lipschitz condition replaced by the similar, but much less restrictive, boundedness assumption for regression parameters in Bühlmann (2006).

The article is organized as follows. In Section 2, we provide assumptions that are used throughout the paper. In Section 3, we define several useful quantities followed by the main result. We then provide a detailed proof in Section 4 by introducing a series of lemmas and corollaries useful for deriving the oracle inequalities for the Cox model. To avoid duplicate material as much as possible, we refer to the preliminaries and some results in van de Geer (2008) from place to place in the proofs without providing much detail.

2. Assumptions

We impose five basic assumptions. Let ‖·‖ be the L₂(P) norm and ‖·‖_∞ the sup norm.

Assumption A. K_m ≔ max_1≤k≤m{‖X_k‖_∞/σ_k} < ∞.

Assumption B. There exists an η > 0 and strictly convex increasing G, such that for all θ ∈ Θ with ‖f_θ − f̅‖_∞ ≤ η, one has ℰ(f_θ) ≥ G(‖f_θ − f̅‖).

In particular, G can be chosen as a quadratic function with some constant C₀, i.e., G(u) = u²/C₀, then the convex conjugate of function G, denoted by H, such that uv ≤ G(u) + H(v) is also quadratic.

Assumption C. There exists a function D(·) on the subsets of the index set {1, ⋯ m}, such that for all 𝒦 ⊂ {1, ⋯ , m}, and for all θ ∈ Θ and θ̃ ∈ Θ, we have $\sum_{k \in 𝒦} σ_{k} | θ_{k} - {\tilde{θ}}_{k} | \leq \sqrt{D (𝒦)} ‖ f_{θ} - f_{\tilde{θ}} ‖$ . Here, D(𝒦) is chosen to be the cardinal number of 𝒦.

Assumption D. $L_{m} ≔ {sup}_{θ \in Θ} \sum_{k = 1}^{m} | θ_{k} | < \infty$ .

Assumption E. The observation time stops at a finite time τ > 0, with ξ ≔ P(Y ≥ τ) > 0.

Assumptions A, B, and C are identical to those in van de Geer (2008) with her ψ_k the identity function. Assumptions B and C can be easily verified for the random design setting where X is random (van de Geer (2008)) together with the usual assumption of non-singular Fisher information matrix at θ̅ (and its neighborhood) for the Cox model. Assumption D has a similar flavor to the assumption (A2) in Bühlmann (2006) for the persistency property of boosting method in high-dimensional linear regression models, but is much less restrictive in the sense that L_m is allowed to depend on m in contrast with the fixed constant in Bühlmann (2006). Here it replaces the Lipschitz assumption in van de Geer (2008). Assumption E is commonly used for survival models with censored data, see for example, Andersen and Gill (1982). A straightforward extension of Assumption E is to allow τ (thus ξ) to depend on n.

From Assumptions A and D, we have, for any θ ∈ Θ,

e^{| f_{θ} (X_{i}) |} \leq e^{K_{m} L_{m} σ_{(m)}} ≔ U_{m} < \infty

(2.1)

for all i, where σ_(m) = max_1≤k≤m σ_k. Note that U_m is allowed to depend on m.

3. Main result

Let $I (θ) ≔ \sum_{k = 1}^{m} σ_{k} | θ_{k} |$ be the l₁ norm of θ. For any θ and θ̃ in Θ, denote

I_{1} (θ | \tilde{θ}) ≔ \sum_{k : {\tilde{θ}}_{k} \neq 0} σ_{k} | θ_{k} |, I_{2} (θ | \tilde{θ}) ≔ I (θ) - I_{1} (θ | \tilde{θ}) .

Consider the estimator

{\hat{θ}}_{n} ≔ \underset{θ \in Θ}{arg min} {l_{n} (θ) + λ_{n} I (θ)} .

3.1. Useful quantities

We first define a set of useful quantities that are involved in the oracle inequalities.

a̅n = 4a_n, $a_{n} = \sqrt{\frac{2 K_{m}^{2} log (2 m)}{n}} + \frac{K_{m} log (2 m)}{n}$ .
r₁ > 0, b > 0, d > 1, and 1 > δ > 0 are arbitrary constants.
$d_{b} ≔ d (\frac{b + d}{(d - 1) b} \lor 1)$ .
${\bar{λ}}_{n, 0} = {\bar{λ}}_{n, 0}^{A} + {\bar{λ}}_{n, 0}^{B}$ , where
${\bar{λ}}_{n, 0}^{A} ≔ {\bar{λ}}_{n, 0}^{A} (r_{1}) ≔ {\bar{a}}_{n} (1 + 2 r_{1} \sqrt{2 (K_{m}^{2} + {\bar{a}}_{n} K_{m})} + \frac{4 r_{1}^{2} {\bar{a}}_{n} K_{m}}{3}), {\bar{λ}}_{n, 0}^{B} ≔ {\bar{λ}}_{n, 0}^{B} (r_{1}) ≔ \frac{2 K_{m} U_{m}^{2}}{ξ} (2 {\bar{a}}_{n} r_{1} + \sqrt{\frac{log (2 m)}{n}}) .$
λ_n ≔ (1 + b)λ̅_n,0.
δ₁ = (1 + b)^−N₁ and δ₂ = (1 + b)^−N₂ are arbitrary constants for some N₁ and N₂, where N₁ ∈ N ≔ {1, 2, …} and N₂ ∈ N ∪ {0}.
$d (δ_{1}, δ_{2}) = 1 + \frac{1 + (d^{2} - 1) δ_{1}}{(d - 1) (1 - δ_{1})} δ_{2}$ .
W is a fixed constant given in Lemma 4.3 for a class of empirical processes.
D_θ ≔ D({k : θ_k ≠ 0, k = 1, …, m}) is the number of nonzero θ_k’s, where D(·) is given in Assumption C.
$𝒱_{θ} ≔ 2 δ H (\frac{2 λ_{n} \sqrt{D_{θ}}}{δ})$ , where H is the convex conjugate of function G defined in Assumption B.
$θ_{n}^{*} ≔ {arg min}_{θ \in Θ} {ℰ (f_{θ}) + 𝒱_{θ}}$ .
$ε_{n}^{*} ≔ (1 + δ) ℰ (f_{θ_{n}^{*}}) + 𝒱_{θ_{n}^{*}}$ .
$ζ_{n}^{*} ≔ \frac{ε_{n}^{*}}{λ_{n, 0}}$ .
$θ (ε_{n}^{*}) ≔ {arg min}_{θ \in Θ, I (θ - θ_{n}^{*}) \leq d_{b} ζ_{n}^{*} / b} {δ ℰ (f_{θ}) - 2 λ_{n} I_{1} (θ - θ_{n}^{*} | θ_{n}^{*})}$

In the above, the dependence of $θ_{n}^{*}$ on the sample size n is through 𝒱_θ that involves the tuning parameter λ_n. We also impose conditions as in van de Geer (2008):

Condition I(b, δ). ${‖ f_{θ_{n}^{*}} - \bar{f} ‖}_{\infty} \leq η$ .

Condition II(b, δ, d). ${‖ f_{θ (ε_{n}^{*})} - \bar{f} ‖}_{\infty} \leq η$ .

In both conditions, η is given in Assumption B.

3.2. Oracle inequalities

We now provide our theorem on oracle inequalities for the Cox model lasso estimator, with detailed proof given in the next section. The key idea of the proof is to find bounds of differences between empirical errors of the working model (1.2) and between approximation errors of the partial likelihood, denoted as Z_θ and R_θ in the next section.

Theorem 3.1. Suppose Assumptions A-E and Conditions I(b, δ) and II(b, δ, d) hold. With

Δ (b, δ, δ_{1}, δ_{2}) ≔ d (δ_{1}, δ_{2}) \frac{1 - δ^{2}}{δ b} \lor 1,

we have, with probability at least

1 - {{log}_{1 + b} \frac{{(1 + b)}^{2} Δ (b, δ, δ_{1}, δ_{2})}{δ_{1} δ_{2}}} {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)},

that

ℰ (f_{{\hat{θ}}_{n}}) \leq \frac{1}{1 - δ} ε_{n}^{*} and I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq d (δ_{1}, δ_{2}) \frac{ζ_{n}^{*}}{b} .

4. Proofs

4.1. Preparations

Denote the empirical probability measure based on the sample {(X_i, Y_i, Δ_i) : i = 1, … n} by P_n. Let ε₁, ⋯ , ε_n be a Rademacher sequence, independent of the training data (X₁, Y₁, Δ₁), ⋯ , (X_n, Y_n, Δ_n). For some fixed θ* ∈ Θ and some M > 0, denote ℱ_M ≔ {f_θ : θ ∈ Θ, I(θ−θ*) ≤ M}. Later we take $θ^{*} = θ_{n}^{*}$ , which is the case of interest. For any θ where I(θ − θ*) ≤ M, denote

Z_{θ} (M) ≔ | (P_{n} - P) [γ_{f_{θ}} - γ_{f_{θ^{*}}}] | = | [{\tilde{l}}_{n} (θ) - l (θ)] - [{\tilde{l}}_{n} (θ^{*}) - l (θ^{*})] | .

Note that van de Geer (2008) sought to bound sup_{f∈ℱ_M} Z_θ(M), thus the contraction theorem of Ledoux and Talagrand (1991) (Theorem A.3 in van de Geer (2008)) was needed, which holds for Lipschitz functions. We find that the calculation in van de Geer (2008) does not apply to the Cox model due to the lack of Lipschitz property. However, the pointwise argument is adequate for our purpose because only the lasso estimator or the difference between the lasso estimator θ̂_n and the oracle $θ_{n}^{*}$ is of interest. Note the notational difference between an arbitrary θ* in the above Z_θ(M) and the oracle $θ_{n}^{*}$ .

Lemma 4.1. Under Assumptions A, D, and E, for all θ satisfying I(θ − θ*) ≤ M, we have EZ_θ(M) ≤ a̅_nM.

Proof. By the symmetrization theorem, see e.g. van der Vaart and Wellner (1996) or Theorem A.2 in van de Geer (2008), for a class of only one function we have

E Z_{θ} (M) \leq 2 E (| \frac{1}{n} \sum_{i = 1}^{n} ε_{i} {[f_{θ} (X_{i}) - log μ (Y_{i}; f_{θ})] Δ_{i} - [f_{θ^{*}} (X_{i}) - log μ (Y_{i}; f_{θ^{*}})] Δ_{i}} |) \leq 2 E (| \frac{1}{n} \sum_{i = 1}^{n} ε_{i} {f_{θ} (X_{i}) - f_{θ^{*}} (X_{i})} Δ_{i} |) + 2 E (| \frac{1}{n} \sum_{i = 1}^{n} ε_{i} {log μ (Y_{i}; f_{θ}) - log μ (Y_{i}; f_{θ^{*}})} Δ_{i} |) = A + B .

For A we have

A \leq 2 (\sum_{k = 1}^{m} σ_{k} | θ_{k} - θ_{k}^{*} |) E (max_{1 \leq k \leq m} | \frac{1}{n} \sum_{i = 1}^{n} ε_{i} Δ_{i} X_{i k} / σ_{k} |) .

Applying Lemma A.1 in van de Geer (2008), we obtain

E (max_{1 \leq k \leq m} | \frac{1}{n} \sum_{i = 1}^{n} ε_{i} Δ_{i} \frac{X_{i k}}{σ_{k}} |) \leq a_{n} .

Thus we have

A \leq 2 a_{n} M .

(4.1)

For B, instead of using the contraction theorem that requires Lipschitz, we use the Mean Value Theorem:

| \frac{1}{n} \sum_{i = 1}^{n} ε_{i} {log μ (Y_{i}; f_{θ}) - log μ (Y_{i}; f_{θ^{*}})} Δ_{i} | = | \frac{1}{n} \sum_{i = 1}^{n} ε_{i} Δ_{i} \sum_{k = 1}^{m} \frac{1}{μ (Y_{i}; f_{θ^{* *}})} \int_{Y_{i}}^{\infty} \int_{𝒳} (θ_{k} - θ_{k}^{*}) x_{k} e^{f_{θ^{* *}} (x)} d P_{X, Y} (x, y) | = | \sum_{k = 1}^{m} σ_{k} (θ_{k} - θ_{k}^{*}) \frac{1}{n} \sum_{i = 1}^{n} \frac{ε_{i} Δ_{i}}{μ (Y_{i}; f_{θ^{* *}}) σ_{k}} \int_{Y_{i}}^{\infty} \int_{𝒳} x_{k} e^{f_{θ^{* *}} (x)} d P_{X, Y} (x, y) | \leq | \sum_{k = 1}^{m} σ_{k} (θ_{k} - θ_{k}^{*}) | max_{1 \leq k \leq m} | \frac{1}{n} \sum_{i = 1}^{n} ε_{i} Δ_{i} F_{θ^{* *}} (k, Y_{i}) | \leq M max_{1 \leq k \leq m} | \frac{1}{n} \sum_{i = 1}^{n} ε_{i} Δ_{i} F_{θ^{* *}} (k, Y_{i}) |,

where θ** is between θ and θ*, and

F_{θ^{* *}} (k, t) = \frac{E [1 (Y \geq t) X_{k} e^{f_{θ^{* *}} (X)}}{μ (t; f_{θ^{* *}}) σ_{k}}

(4.2)

satisfying

| F_{θ^{* *}} (k, t) | \leq \frac{({‖ X_{k} ‖}_{\infty} / σ_{k}) E [1 (Y \geq t) e^{f_{θ^{* *}} (X)}]}{μ (t; f_{θ^{* *}})} \leq K_{m} .

Since for all i,

E [ε_{i} Δ_{i} F_{θ^{* *}} (k, Y_{i})] = 0, {‖ ε_{i} Δ_{i} F_{θ^{* *}} (k, Y_{i}) ‖}_{\infty} \leq K_{m}, and \frac{1}{n} \sum_{i = 1}^{n} E {[ε_{i} Δ_{i} F_{θ^{* *}} (k, Y_{i})]}^{2} \leq \frac{1}{n} \sum_{i = 1}^{n} E {[F_{θ^{* *}} (k, Y_{i})]}^{2} \leq E K_{m}^{2} = K_{m}^{2},

following Lemma A.1 in van de Geer (2008), we obtain

B \leq 2 a_{n} M .

(4.3)

Combining (4.1) and (4.3), the upper bound for EZ_θ(M) is achieved.

We now can bound the tail probability of Z_θ(M) using the Bousquet’s concentration theorem noted as Theorem A.1 in van de Geer (2008).

Corollary 4.1. Under Assumptions A, D, and E, for all M > 0, r₁ > 0 and all θ satisfying I(θ − θ*) ≤ M, it holds that

P (Z_{θ} (M) \geq {\bar{λ}}_{n, 0}^{A} M) \leq exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) .

Proof. Using the triangular inequality and the Mean Value Theorem, we obtain

| γ_{f_{θ}} - γ_{f_{θ^{*}}} | \leq | f_{θ} (X) - f_{θ^{*}} (X) | Δ + | log μ (Y; f_{θ}) - log μ (Y; f_{θ^{*}}) | Δ \leq \sum_{k = 1}^{m} σ_{k} | θ_{k} - θ_{k}^{*} | \frac{| X_{k} |}{σ_{k}} + | log μ (Y; f_{θ}) - log μ (Y; f_{θ^{*}}) | \leq M K_{m} + \sum_{k = 1}^{m} σ_{k} | θ_{k} - θ_{k}^{*} | \cdot max_{1 \leq k \leq m} | F_{θ^{* *}} (k, Y) | \leq 2 M K_{m},

where θ** is between θ and θ*, and F_θ**(k, Y) is defined in (4.2). So we have

{‖ γ_{f_{θ}} - γ_{f_{θ^{*}}} ‖}_{\infty} \leq 2 M K_{m}, and P {(γ_{f_{θ}} - γ_{f_{θ^{*}}})}^{2} \leq 4 M^{2} K_{m}^{2} .

Therefore, in view of Bousquet’s concentration theorem and Lemma 4.1, for all M > 0 and r₁ > 0,

P (Z_{θ} (M) \geq {\bar{a}}_{n} M (1 + 2 r_{1} \sqrt{2 (K_{m}^{2} + {\bar{a}}_{n} K_{m})} + \frac{4 r_{1}^{2} {\bar{a}}_{n} K_{m}}{3})) \leq exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) .

Now for any θ satisfying I(θ − θ*) ≤ M, we bound

R_{θ} (M) : = | [l_{n} (θ) - {\tilde{l}}_{n} (θ)] - [l_{n} (θ^{*}) - {\tilde{l}}_{n} (θ^{*})] | = \frac{1}{n} \sum_{i = 1}^{n} | [log \frac{1}{n} \sum_{j = 1}^{n} \frac{1 (Y_{j} \geq Y_{i}) e^{f_{θ} (X_{j})}}{μ (Y_{i}; f_{θ})} - log \frac{1}{n} \sum_{j = 1}^{n} \frac{1 (Y_{j} \geq Y_{i}) e^{f_{θ^{*}} (X_{j})}}{μ (Y_{i}; f_{θ^{*}})}] Δ_{i} | \leq sup_{0 \leq t \leq τ} | log \frac{1}{n} \sum_{j = 1}^{n} \frac{1 (Y_{j} \geq t) e^{f_{θ} (X_{j})}}{μ (t; f_{θ})} - log \frac{1}{n} \sum_{j = 1}^{n} \frac{1 (Y_{j} \geq t) e^{f_{θ^{*}} (X_{j})}}{μ (t; f_{θ *})} | .

Here recall that τ is given in Assumption E. By the Mean Value Theorem, we have

R_{θ} (M) \leq sup_{0 \leq t \leq τ} | \sum_{k = 1}^{m} (θ_{k} - θ_{k}^{*}) {\frac{\sum_{j = 1}^{n} 1 (Y_{j} \geq t) e^{f_{θ^{* *}} (X_{j})}}{μ (t; f_{θ^{* *}})}}^{- 1} {\frac{\sum_{j = 1}^{n} 1 (Y_{j} \geq t) X_{j k} e^{f_{θ^{* *}} (X_{j})}}{μ (t; f_{θ^{* *}})} - \frac{\sum_{j = 1}^{n} 1 (Y_{j} \geq t) e^{f_{θ^{* *}} (X_{j})} E [1 (Y \geq t) X_{k} e^{f_{θ^{* *}} (X)}]}{μ {(t; f_{θ^{* *}})}^{2}}} | = sup_{0 \leq t \leq τ} | \sum_{k = 1}^{m} σ_{k} (θ_{k} - θ_{k}^{*}) {\frac{\sum_{j = 1}^{n} 1 (Y_{j} \geq t) (X_{j k} / σ_{k}) e^{f_{θ^{* *}} (X_{j})}}{\sum_{j = 1}^{n} 1 (Y_{j} \geq t) e^{f_{θ * *} (X_{j})}} - \frac{E [1 (Y \geq t) (X_{k} / σ_{k}) e^{f_{θ^{* *}} (X)}]}{E [1 (Y \geq t) e^{f_{θ^{* *}} (X)}]}} | \leq M sup_{0 \leq t \leq τ} {[\frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) e^{f_{θ^{* *}} (X_{i})}]}^{- 1} sup_{0 \leq t \leq τ} {max_{1 \leq k \leq m} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) (X_{i k} / σ_{k}) e^{f_{θ^{* *}} (X_{i})} - E [1 (Y \geq t) (X_{k} / σ_{k}) e^{f_{θ^{* *}} (X)}] | + K_{m} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) e^{f_{θ^{* *}} (X_{i})} - E [1 (Y \geq t) e^{f_{θ^{* *}} (X)}] |},

(4.4)

where θ** is between θ and θ* and, by (2.1), we have

sup_{0 \leq t \leq τ} {[\frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) e^{f_{θ^{* *}} (X_{i})}]}^{- 1} \leq U_{m} {[\frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq τ)]}^{- 1} .

(4.5)

Lemma 4.2. Under Assumption E, we have

P (\frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq τ) \leq \frac{ξ}{2}) \leq 2 e^{- n ξ^{2} / 2} .

Proof. This is obtained directly from Massart (1990) for the Kolmogorov statistic by taking $r = ξ \sqrt{n} / 2$ in the following:

P (\frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq τ) \leq \frac{ξ}{2}) \leq P (\sqrt{n} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq τ) - ξ | \geq r) \leq P (sup_{0 \leq t \leq τ} \sqrt{n} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) - P (Y \geq t) | \geq r) \leq 2 e^{- 2 r^{2}} .

Lemma 4.3. Under Assumptions A, D, and E, for all θ we have

P (sup_{0 \leq t \leq τ} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) e^{f_{θ} (X_{i})} - μ (t; f_{θ}) | \geq U_{m} {\bar{a}}_{n} r_{1}) \leq \frac{1}{5} W^{2} e^{- n {\bar{a}}_{n}^{2} r_{1}^{2}},

(4.6)

where W is a fixed constant.

Proof. For a class of functions indexed by t, ℱ = {1(y ≥ t)e^fθ(x)/U_m : t ∈ [0, τ], y ∈ R, e^fθ(x) ≤ U_m}, we calculate its bracketing number. For any nontrivial ∈ satisfying 1 > ∈ > 0, let t_i be the i-th ⌈1/ε⌉ quantile of Y, so

P (Y \leq t_{i}) = i ε, i = 1, \dots, ⌈ 1 / ε ⌉ - 1,

where ⌈x⌉ is the smallest integer that is greater than or equal to x. Furthermore, take t₀ = 0 and t_⌈1/ε⌉ = +∞. For i = 1, ⋯, ⌈1/ε⌈, define brackets [L_i, U_i] with

L_{i} (x, y) = 1 (y \geq t_{i}) e^{f_{θ} (x)} / U_{m}, U_{i} (x, y) = 1 (y > t_{i - 1}) e^{f_{θ} (x)} / U_{m}

such that L_i(x, y) ≤ 1(y ≥ t)e^fθ(x)/U_m ≤ U_i(x, y) when t_i−1 < t ≤ t_i. Since

{E {[U_{i} - L_{i}]}^{2}}^{1 / 2} \leq {E {[\frac{e^{f_{θ} (X)}}{U_{m}} {1 (Y \geq t_{i}) - 1 (Y > t_{i - 1})}]}^{2}}^{1 / 2} \leq {P (t_{i - 1} < Y \leq t_{i})}^{1 / 2} = \sqrt{ε},

we have $N_{[]} (\sqrt{ε}, ℱ, L_{2}) \leq ⌈ 1 / ε ⌉ \leq 2 / ε$ , which yields

N_{[]} (ε, ℱ, L_{2}) \leq \frac{2}{ε^{2}} = {(\frac{K}{ε})}^{2},

where $K = \sqrt{2}$ . Thus, from Theorem 2.14.9 in van der Vaart and Wellner (1996), we have for any r > 0,

P (\sqrt{n} sup_{0 \leq t \leq τ} | \frac{1}{n} \sum_{i = 1}^{n} \frac{1 (Y_{i} \geq t) e^{f_{θ} (X_{i})}}{U_{m}} - \frac{μ (t; f_{θ})}{U_{m}} | \geq r) \leq \frac{1}{2} W^{2} r^{2} e^{- 2 r^{2}} \leq \frac{1}{5} W^{2} e^{- r^{2}},

where W is a constant that only depends on K. Note that r²e^−r² is bounded by e⁻¹. With $r = \sqrt{n} {\bar{a}}_{n} r_{1}$ , we obtain (4.6).

Lemma 4.4. Under Assumptions A, D, and E, for all θ we have

P (sup_{0 \leq t \leq τ} max_{1 \leq k \leq m} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) \frac{X_{i k}}{σ_{k}} e^{f_{θ} (X_{i})} - E [1 (Y \geq t) \frac{X_{k}}{σ_{k}} e^{f_{θ} (X)}] | \geq K_{m} U_{m} [{\bar{a}}_{n} r_{1} + \sqrt{\frac{log (2 m)}{n}}]) \leq \frac{1}{10} W^{2} e^{- n {\bar{a}}_{n}^{2} r_{1}^{2}} .

(4.7)

Proof. Consider the classes of functions indexed by t,

𝒢^{k} = {1 (y \geq t) e^{f_{θ} (x)} x_{k} / (σ_{k} K_{m} U_{m}) : t \in [0, τ], y \in R, | e^{f_{θ} (x)} x_{k} / σ_{k} | \leq K_{m} U_{m}}, k = 1, \dots, m .

Using the argument in the proof of Lemma 4.3, we have

N_{[]} (ε, 𝒢^{k}, L_{2}) \leq {(\frac{K}{ε})}^{2},

where $K = \sqrt{2}$ , and then for any r > 0,

P (\sqrt{n} sup_{0 \leq t \leq τ} | \frac{1}{n} \sum_{i = 1}^{n} \frac{1 (Y_{i} \geq t) e^{f_{θ} (X_{i})} X_{i k}}{σ_{k} K_{m} U_{m}} - E [\frac{1 (Y \geq t) e^{f_{θ} (X)} X_{k}}{σ_{k} K_{m} U_{m}}] | \geq r) \leq \frac{1}{5} W^{2} e^{- r^{2}} .

Thus we have

P (\sqrt{n} sup_{0 \leq t \leq τ} max_{1 \leq k \leq m} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) e^{f_{θ} (X_{i})} X_{i k} / (σ_{k} U_{m} K_{m}) - E [1 (Y \geq t) e^{f_{θ} (X)} X_{k} / (σ_{k} U_{m} K_{m})] | \geq r) \leq P (\cup_{k = 1}^{m} \sqrt{n} sup_{0 \leq t \leq τ} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) e^{f_{θ} (X_{i})} X_{i k} / (σ_{k} U_{m} K_{m}) - E [1 - (Y \geq t) e^{f_{θ} (X)} X_{k} / (σ_{k} U_{m} K_{m})] | \geq r) \leq m P (\sqrt{n} sup_{0 \leq t \leq τ} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) e^{f_{0} (X_{i})} X_{i k} / (σ_{k} U_{m} K_{m}) - E [1 (Y \geq t) e^{f_{θ} (X)} X_{k} / (σ_{k} U_{m} K_{m})] | \geq r) \leq \frac{m}{5} W^{2} e^{- r^{2}} = \frac{1}{10} W^{2} e^{log (2 m) - r^{2}} .

Let $log (2 m) - r^{2} = - n {\bar{a}}_{n}^{2} r_{1}^{2}$ , so $r = \sqrt{n {\bar{a}}_{n}^{2} r_{1}^{2} + log (2 m)}$ . Since

\sqrt{{\bar{a}}_{n}^{2} r_{1}^{2} + \frac{log (2 m)}{n}} \leq {\bar{a}}_{n} r_{1} + \sqrt{\frac{log (2 m)}{n}},

, we obtain (4.7).

Corollary 4.2. Under Assumptions A, D, and E, for all M > 0, r₁ > 0, and all θ that satisfy I(θ − θ*) ≤ M, we have

P (R_{θ} (M) \geq {\bar{λ}}_{n, 0}^{B} M) \leq 2 exp (- n ξ^{2} / 2) + \frac{3}{10} W^{2} exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) .

(4.8)

Proof. From (4.4) and (4.5) we have

P (R_{θ} (M) \leq {\bar{λ}}_{n, 0}^{B} \cdot M) \geq P (E_{1}^{c} \cap E_{2}^{c} \cap E_{3}^{c}),

where the events E₁, E₂ and E₃ are defined as

E_{1} = {\frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq τ) \leq ξ / 2},

E_{2} = {sup_{0 \leq t \leq τ} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) e^{f_{θ^{* *}} (X_{i})} - μ (t; f_{θ^{* *}}) | \geq U_{m} {\bar{a}}_{n} r_{1}},

E_{3} = {max_{1 \leq k \leq m} sup_{0 \leq t \leq τ} | \frac{1}{n} \sum_{i = 1}^{n} 1 (Y_{i} \geq t) \frac{X_{i k}}{σ_{k}} e^{f_{θ^{* *}} (X_{i})} - E [1 (Y \geq t) \frac{X_{k}}{σ_{k}} e^{f_{θ^{* *}} (X)}] | \geq K_{m} U_{m} ({\bar{a}}_{n} r_{1} + \sqrt{\frac{log (2 m)}{n})}} .

Thus

P (R_{θ} (M) \geq {\bar{λ}}_{n, 0}^{B} \cdot M) \leq P (E_{1}) + P (E_{2}) + P (E_{3}),

and the result follows from Lemmas 4.2, 4.3 and 4.4.

Now with $θ^{*} = θ_{n}^{*}$ , we have the following results.

Lemma 4.5. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Under Assumptions B and C, for all θ ∈ Θ with $I (θ - θ_{n}^{*}) \leq d_{b} ζ_{n}^{*} / b$ , it holds that

2 λ_{n} I_{1} (θ - θ_{n}^{*}) \leq δ ℰ (f_{θ}) + ε_{n}^{*} - ℰ (f_{θ_{n}^{*}}) .

Proof. The proof is exactly the same as that of Lemma A.4 in van de Geer (2008), with the λ_n defined in Subsection 3.1.

Lemma 4.6. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Consider any random θ̃ ∈ Θ with $l_{n} (\tilde{θ}) + λ_{n} I (\tilde{θ}) \leq l_{n} (θ_{n}^{*}) + λ_{n} I (θ_{n}^{*})$ . Let 1 < d₀ ≤ d_b. It holds that

P (I (\tilde{θ} - θ_{n}^{*}) \leq d_{0} \frac{ζ_{n}^{*}}{b}) \leq P (I (\tilde{θ} - θ_{n}^{*}) \leq (\frac{d_{0} + b}{1 + b}) \frac{ζ_{n}^{*}}{b}) + (1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2) .

Proof. The idea is similar to the proof of Lemma A.5 in van de Geer (2008). Let ℰ̃ = ℰ(f_θ̃) and $ℰ^{*} = ℰ (f_{θ_{n}^{*}})$ . We will use short notation: $I_{1} (θ) = I_{1} (θ | θ_{n}^{*})$ and $I_{2} (θ) = I_{2} (θ | θ_{n}^{*})$ . Since $l_{n} (\tilde{θ}) + λ_{n} I (\tilde{θ}) \leq l_{n} (θ_{n}^{*}) + λ_{n} I (θ_{n}^{*})$ , on the set where $I (\tilde{θ} - θ_{n}^{*}) \leq d_{0} ζ_{n}^{*} / b$ and $Z_{\tilde{θ}} (d_{0} ζ_{n}^{*} / b) \leq d_{0} ζ_{n}^{*} / b \cdot {\bar{λ}}_{n, 0}^{A}$ , we have

R_{\tilde{θ}} (d_{0} ζ_{n}^{*} / b) \geq [l_{n} (θ_{n}^{*}) + λ_{n} I (θ_{n}^{*})] - [l_{n} (\tilde{θ}) + λ_{n} I (\tilde{θ})] - λ_{n} I (θ_{n}^{*}) + λ_{n} I (\tilde{θ}) - [{\tilde{l}}_{n} (θ_{n}^{*}) - {\tilde{l}}_{n} (\tilde{θ})] \geq - λ_{n} I (θ_{n}^{*}) + λ_{n} I (\tilde{θ}) - [{\tilde{l}}_{n} (θ_{n}^{*}) - {\tilde{l}}_{n} (\tilde{θ})] \geq - λ_{n} I (θ_{n}^{*}) + λ_{n} I (\tilde{θ}) - [l (θ_{n}^{*}) - l (\tilde{θ}) - d_{0} ζ_{n}^{*} / b \cdot {\bar{λ}}_{n, 0}^{A} \geq - λ_{n} I (θ_{n}^{*}) + λ_{n} I (\tilde{θ}) - ℰ^{*} + \tilde{ℰ} - d_{0} λ_{n, 0}^{A} ζ_{n}^{*} / b .

(4.9)

By (4.8) we know that $R_{\tilde{θ}} (d_{0} ζ_{n}^{*} / b)$ is bounded by $d_{0} {\bar{λ}}_{n, 0}^{B} ζ_{n}^{*} / b$ with probability at least $1 - \frac{3}{10} W^{2} exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) - 2 exp (- n ξ^{2} / 2)$ , then we have

\tilde{ℰ} + λ_{n} I (\tilde{θ}) \leq {\bar{λ}}_{n, 0}^{B} d_{0} ζ_{n}^{*} / b + ℰ^{*} + λ_{n} I (θ_{n}^{*}) + {\bar{λ}}_{n, 0}^{A} d_{0} ζ_{n}^{*} / b .

Since I(θ̃) = I₁(θ̃) + I₂(θ̃) and $I (θ_{n}^{*}) = I_{1} (θ_{n}^{*})$ , using the triangular inequality, we obtain

\tilde{ℰ} + (1 + b) {\bar{λ}}_{n, 0} I_{2} (\tilde{θ}) \leq {\bar{λ}}_{n, 0} d_{0} ζ_{n}^{*} / b + ℰ^{*} + (1 + b) {\bar{λ}}_{n, 0} I_{1} (θ_{n}^{*}) - (1 + b) {\bar{λ}}_{n, 0} I_{1} (\tilde{θ}) \leq {\bar{λ}}_{n, 0} d_{0} ζ_{n}^{*} / b + ℰ^{*} + (1 + b) {\bar{λ}}_{n, 0} I_{1} (\tilde{θ} - θ_{n}^{*}) .

(4.10)

Adding $(1 + b) {\bar{λ}}_{n, 0} I_{1} (\tilde{θ} - θ_{n}^{*})$ to both sides and from Lemma 4.5,

\tilde{ℰ} + (1 + b) {\bar{λ}}_{n, 0} I (\tilde{θ} - θ_{n}^{*}) \leq λ_{n, 0} d_{0} \frac{ζ_{n}^{*}}{b} + ℰ^{*} + 2 (1 + b) {\bar{λ}}_{n, 0} I_{1} (\tilde{θ} - θ_{n}^{*}) \leq ({\bar{λ}}_{n, 0} d_{0} + b {\bar{λ}}_{n, 0}) \frac{ζ_{n}^{*}}{b} + δ \tilde{ℰ} = (d_{0} + b) {\bar{λ}}_{n, 0} \frac{ζ_{n}^{*}}{b} + δ \tilde{ℰ} .

Because 0 < δ < 1, it follows that

I (\tilde{θ} - θ_{n}^{*}) \leq \frac{d_{0} + b}{1 + b} \frac{ζ_{n}^{*}}{b} .

Hence,

P ({I (\tilde{θ} - θ_{n}^{*}) \leq d_{0} \frac{ζ_{n}^{*}}{b}} \cap {Z_{\tilde{θ}} (d_{0} ζ_{n}^{*} / b) \leq d_{0} {\bar{λ}}_{n, 0}^{A} \frac{ζ_{n}^{*}}{b}} \cap {R_{\tilde{θ}} (d_{0} ζ_{n}^{*} / b) \leq d_{0} {\bar{λ}}_{n, 0}^{B} \frac{ζ_{n}^{*}}{b}}) \leq P (I (\tilde{θ} - θ_{n}^{*}) \leq \frac{d_{0} + b}{1 + b} \frac{ζ_{n}^{*}}{b}),

which yields the desired result.

Corollary 4.3. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Consider any random θ̃ ∈ Θ with $l_{n} (\tilde{θ}) + λ_{n} I (\tilde{θ}) \leq l_{n} (θ_{n}^{*}) + λ_{n} I (θ_{n}^{*})$ . Let 1 < d₀ ≤ d_b. It holds that

P (I (\tilde{θ} - θ_{n}^{*}) \leq d_{0} \frac{ζ_{n}^{*}}{b}) \leq P (I (\tilde{θ} - θ_{n}^{*}) \leq [1 + (d_{0} - 1) {(1 + b)}^{- N}] \frac{ζ_{n}^{*}}{b}) + N {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2} .

Proof. Repeat Lemma 4.6 N times.

Lemma 4.7. Suppose Conditions I(b, δ) and II(b, δ, d) hold. If ${\tilde{θ}}_{s} = s {\hat{θ}}_{n} + (1 - s) θ_{n}^{*}$ , where

s = \frac{d ζ_{n}^{*}}{d ζ_{n}^{*} + b I ({\hat{θ}}_{n} - θ_{n}^{*})},

then for any integer N, with probability at least

1 - N {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2},

we have

I ({\tilde{θ}}_{s} - θ_{n}^{*}) \leq (1 + (d - 1) {(1 + b)}^{- N} \frac{ζ_{n}^{*}}{b} .

Proof. Since the negative log partial likelihood l_n(θ) and the lasso penalty are both convex with respect to θ, applying Corollary 4.3, we obtain the above inequality. This proof is similar to the proof of Lemma A.6 in van de Geer (2008).

Lemma 4.8. Suppose Conditions I(b, δ) and II(b, δ, d) are met. Let N₁ ∈ N ≔ {1, 2, …} and N₂ ∈ N ∪ {0}. With δ₁ = (1+b)^−N₁ and δ₂ = (1+b)^−N₂ , for any n, with probability at least

1 - (N_{1} + N_{2}) {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2},

we have

I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq d (δ_{1}, δ_{2}) \frac{ζ_{n}^{*}}{b},

where

d (δ_{1}, δ_{2}) = 1 + \frac{1 + (d^{2} - 1) δ_{1}}{(d - 1) (1 - δ_{1})} δ_{2} .

Proof. The proof is the same as that of Lemma A.7 in van de Geer (2008), with a slightly different probability bound.

4.2. Proof of Theorem 3.1

Proof. The proof follows the same ideas in the proof of Theorem A.4 in van de Geer (2008), with exceptions of pointwise arguments and slightly different probability bounds. Since this is our main result, we provide a detailed proof here despite the amount of overlaps.

Define ℰ̂ ≔ ℰ(f_{θ̂_n} and $ℰ^{*} ≔ ℰ (f_{θ_{n}^{*}})$ ; use the notation $I_{1} (θ) ≔ I_{1} (θ | θ_{n}^{*})$ and $I_{2} (θ) ≔ I_{2} (θ | θ_{n}^{*})$ ; set c ≔ δb/(1 − δ²). Consider the cases (a) c < d(δ₁, δ₂) and (b) c ≥ d(δ₁, δ₂).

(a) c < d(δ₁, δ₂). Let J be an integer satisfying (1 + b)^J−1 c ≤ d(δ₁, δ₂) and (1 + b)^J c > d(δ₁, δ₂). We consider the cases (a1) $c ζ_{n}^{*} / b < I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq d (δ_{1}, δ_{2}) ζ_{n}^{*} / b$ and (a2) $I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq c ζ_{n}^{*} / b$ .

(a1) If $c ζ_{n}^{*} / b < I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq d (δ_{1}, δ_{2}) ζ_{n}^{*} / b$ , then

{(1 + b)}^{j - 1} c \frac{ζ_{n}^{*}}{b} < I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq {(1 + b)}^{j} c \frac{ζ_{n}^{*}}{b}

for some j ∈ {1, ⋯ , J}. Let d₀ = c(1 + b)^j−1 ≤ d(δ₁, δ₂) ≤ d_b. From Corollary 4.1, with probability at least $1 - exp (- n {\bar{a}}_{n}^{2} r_{1}^{2})$ we have $Z_{{\hat{θ}}_{n}} ((1 + b) d_{0} ζ_{n}^{*} / b) \leq (1 + b) d_{0} {\bar{λ}}_{n, 0}^{A} ζ_{n}^{*} / b)$ .

Since $l_{n} ({\tilde{θ}}_{n}) + λ_{n} I ({\tilde{θ}}_{n}) \leq l_{n} (θ_{n}^{*}) + λ_{n} I (θ_{n}^{*})$ , from (4.9) we have

\hat{ℰ} + λ_{n} I ({\hat{θ}}_{n}) \leq R_{{\hat{θ}}_{n}} ((1 + b) d_{0} \frac{ζ_{n}^{*}}{b}) + ℰ^{*} + λ_{n} I (θ_{n}^{*}) + (1 + b) {\bar{λ}}_{n, 0}^{A} d_{0} \frac{ζ_{n}^{*}}{b} .

By (4.8), $R_{{\hat{θ}}_{n}} ((1 + b) d_{0} ζ_{n}^{*} / b)$ is bounded by $(1 + b) {\bar{λ}}_{n, 0}^{B} d_{0} ζ_{n}^{*} / b)$ with probability at least

1 - \frac{3}{10} W^{2} exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) - 2 exp (- n ξ^{2} / 2) .

Then we have

\hat{ℰ} + (1 + b) {\bar{λ}}_{n, 0} I ({\hat{θ}}_{n}) \leq (1 + b) {\bar{λ}}_{n, 0}^{B} d_{0} \frac{ζ_{n}^{*}}{b} + ℰ^{*} + (1 + b) {\bar{λ}}_{n, 0} I (θ_{n}^{*}) + (1 + b) {\bar{λ}}_{n, 0}^{A} d_{0} \frac{ζ_{n}^{*}}{b} \leq (1 + b) {\bar{λ}}_{n, 0} I ({\hat{θ}}_{n} - θ_{n}^{*}) + ℰ^{*} + (1 + b) {\bar{λ}}_{n, 0} I (θ_{n}^{*}) .

Since $I ({\hat{θ}}_{n}) = I_{1} ({\hat{θ}}_{n}) + I_{2} ({\hat{θ}}_{n}), I ({\hat{θ}}_{n} - θ_{n}^{*}) = I_{1} ({\hat{θ}}_{n} - θ_{n}^{*}) + I_{2} ({\hat{θ}}_{n})$ , and $I (θ_{n}^{*}) = I_{1} (θ_{n}^{*})$ , by triangular inequality we obtain $\hat{ℰ} \leq 2 (1 + b) {\bar{λ}}_{n, 0} I_{1} ({\hat{θ}}_{n} - θ_{n}^{*}) + ℰ^{*}$ . From Lemma 4.5, $\hat{ℰ} \leq δ \hat{ℰ} + ε_{n}^{*} - ℰ^{*} + ℰ^{*} = δ \hat{ℰ} + ε_{n}^{*}$ . Hence, $\hat{ℰ} \leq ε_{n}^{*} / (1 - δ)$ .

(a2) If $I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq c ζ_{n}^{*} / b$ , from (4.10) with d₀ = c, with probability at least

1 - {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)},

we have

\hat{ℰ} + (1 + b) {\bar{λ}}_{n, 0} I ({\hat{θ}}_{n}) \leq \frac{δ}{1 - δ^{2}} {\bar{λ}}_{n, 0} ζ_{n}^{*} + ℰ^{*} + (1 + b) {\bar{λ}}_{n, 0} I (θ_{n}^{*}) .

By the triangular inequality, Lemma 4.5 and (A4),

\hat{ℰ} \leq \frac{δ}{1 - δ^{2}} {\bar{λ}}_{n, 0} ζ_{n}^{*} + ℰ^{*} + (1 + b) {\bar{λ}}_{n, 0} I_{1} ({\hat{θ}}_{n} - θ_{n}^{*}) \leq \frac{δ}{1 - δ^{2}} {\bar{λ}}_{n, 0} \frac{ε_{n}^{*}}{{\bar{λ}}_{n, 0}} + ℰ^{*} + \frac{δ}{2} \hat{ℰ} + \frac{1}{2} ε_{n}^{*} - \frac{1}{2} ℰ^{*} = (\frac{δ}{1 - δ^{2}} + \frac{1}{2}) ε_{n}^{*} + \frac{1}{2} ℰ^{*} + \frac{δ}{2} \hat{ℰ} \leq (\frac{δ}{1 - δ^{2}} + \frac{1}{2}) ε_{n}^{*} + \frac{1}{2 (1 + δ)} ε_{n}^{*} + \frac{δ}{2} \hat{ℰ} .

Hence,

\hat{ℰ} \leq \frac{2}{2 - δ} [\frac{δ}{1 - δ^{2}} + \frac{1}{2} + \frac{1}{2 (1 + δ)}] ε_{n}^{*} = \frac{1}{1 - δ} ε_{n}^{*} .

Furthermore, by Lemma 4.8, we have with probability at least

1 - (N_{1} + N_{2}) {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)}

that $I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq d (δ_{1}, δ_{2}) \frac{ζ_{n}^{*}}{b}$ , where

N_{1} = {log}_{1 + b} (\frac{1}{δ_{1}}), N_{2} = {log}_{1 + b} (\frac{1}{δ_{2}}) .

(b) c ≥ d(δ₁, δ₂). On the set where $I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq d (δ_{1}, δ_{2}) ζ_{n}^{*} / b$ , from equation (4.10) we have with probability at least

1 - {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)}

that

\hat{ℰ} + (1 + b) {\bar{λ}}_{n, 0} I ({\hat{θ}}_{n}) \leq {\bar{λ}}_{n, 0} d (δ_{1}, δ_{2}) \frac{ζ_{n}^{*}}{b} + ℰ^{*} + (1 + b) {\bar{λ}}_{n, 0} I (θ_{n}^{*}) \leq \frac{δ}{1 - δ^{2}} {\bar{λ}}_{n, 0} ζ_{n}^{*} + ℰ^{*} + (1 + b) {\bar{λ}}_{n, 0} I (θ_{n}^{*}),

which is the same as (a2) and leads to the same result.

To summarize, let

A = {\hat{ℰ} \leq \frac{1}{1 - δ} ε_{n}^{*}}, B = {I ({\hat{θ}}_{n} - θ_{n}^{*}) \leq d (δ_{1}, δ_{2}) \frac{ζ_{n}^{*}}{b}} .

Note that

J + 1 \leq {log}_{1 + b} (\frac{{(1 + b)}^{2} d (δ_{1}, δ_{2})}{c}) .

Under case (a), we have

P (A \cap B) = P (a 1) - P (A^{c} \cap a 1) + P (a 2) - P (A^{c} \cap a 2) \geq P (a 1) - J {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)} + P (a 2) - {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)} = P (B) - (J + 1) {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)} \geq 1 - (N_{1} + N_{2} + J + 1) {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)} \geq 1 - {log}_{1 + b} {\frac{{(1 + b)}^{2}}{δ_{1} δ_{2}} \cdot \frac{d (δ_{1}, δ_{2}) (1 - δ^{2})}{δ b}} {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)} .

Under case (b),

P (A \cap B) = P (B) - P (A^{c} \cap B) \geq P (B) - {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)} \geq 1 - (N_{1} + N_{2} + 2) {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)} = 1 - {log}_{1 + b} {\frac{{(1 + b)}^{2}}{δ_{1} δ_{2}}} {(1 + \frac{3}{10} W^{2}) exp (- n {\bar{a}}_{n}^{2} r_{1}^{2}) + 2 exp (- n ξ^{2} / 2)} .

We thus obtain the desired result.

Contributor Information

Shengchun Kong, Email: kongsc@umich.edu.

Bin Nan, Email: bnan@umich.edu.

References

Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann. Statist. 1982;10:1100–1120. [Google Scholar]
Bach F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics. 2010;4:384–414. [Google Scholar]
Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and dantzig selector. Ann. Statist. 2009;37:1705–1732. [Google Scholar]
Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann. Statist. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bühlmann P. Boosting for high-dimensional linear models. Ann. Statist. 2006;34:559–583. [Google Scholar]
Bunea F. Honest variable selection in linear and logistic regression models via ℓ1 and ℓ1+ℓ2 penalization. Electronic Journal of Statistics. 2008;2:1153–1194. [Google Scholar]
Bunea F, Tsybakov AB, Wegkamp MH. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics. 2007;1:169–194. [Google Scholar]
Cox DR. Regression models and life tables (with discussion) J. Roy. Statist. Soc. B. 1972;34:187–220. [Google Scholar]
Gaiffas S, Guilloux A. High-dimensional additive hazards models and the lasso. Electronic Journal of Statistics. 2012;6:522–546. [Google Scholar]
Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
Ledoux M, Talagrand M. Probability in Banach Spaces: Isoperimetry and Processes. Berlin: Springer; 1991. [Google Scholar]
Martinussen T, Scheike TH. Covariate selection for the semiparametric additive risk model. Scandinavian Journal of Statistics. 2009;36:602–619. [Google Scholar]
Massart P. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 1990;18:1269–1283. [Google Scholar]
Tarigan B, van de Geer S. Classifiers of support vector machine type with ℓ1 complexity regularization. Bernoulli. 2006;12:1045–1076. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the Lasso. JRStat. Soc. B. 1996;58:267–288. [Google Scholar]
Tibshirani R. The Lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
van de Geer S. High-dimensional generalized linear models and the lasso. Ann. Statist. 2008;36:614–645. [Google Scholar]
van der Vaart and Wellner, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Wiley; 1996. [Google Scholar]

[R1] Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann. Statist. 1982;10:1100–1120. [Google Scholar]

[R2] Bach F. Self-concordant analysis for logistic regression. Electronic Journal of Statistics. 2010;4:384–414. [Google Scholar]

[R3] Bickel P, Ritov Y, Tsybakov A. Simultaneous analysis of lasso and dantzig selector. Ann. Statist. 2009;37:1705–1732. [Google Scholar]

[R4] Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann. Statist. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Bühlmann P. Boosting for high-dimensional linear models. Ann. Statist. 2006;34:559–583. [Google Scholar]

[R6] Bunea F. Honest variable selection in linear and logistic regression models via ℓ1 and ℓ1+ℓ2 penalization. Electronic Journal of Statistics. 2008;2:1153–1194. [Google Scholar]

[R7] Bunea F, Tsybakov AB, Wegkamp MH. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics. 2007;1:169–194. [Google Scholar]

[R8] Cox DR. Regression models and life tables (with discussion) J. Roy. Statist. Soc. B. 1972;34:187–220. [Google Scholar]

[R9] Gaiffas S, Guilloux A. High-dimensional additive hazards models and the lasso. Electronic Journal of Statistics. 2012;6:522–546. [Google Scholar]

[R10] Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]

[R11] Ledoux M, Talagrand M. Probability in Banach Spaces: Isoperimetry and Processes. Berlin: Springer; 1991. [Google Scholar]

[R12] Martinussen T, Scheike TH. Covariate selection for the semiparametric additive risk model. Scandinavian Journal of Statistics. 2009;36:602–619. [Google Scholar]

[R13] Massart P. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. Ann. Probab. 1990;18:1269–1283. [Google Scholar]

[R14] Tarigan B, van de Geer S. Classifiers of support vector machine type with ℓ1 complexity regularization. Bernoulli. 2006;12:1045–1076. [Google Scholar]

[R15] Tibshirani R. Regression shrinkage and selection via the Lasso. JRStat. Soc. B. 1996;58:267–288. [Google Scholar]

[R16] Tibshirani R. The Lasso method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]

[R17] van de Geer S. High-dimensional generalized linear models and the lasso. Ann. Statist. 2008;36:614–645. [Google Scholar]

[R18] van der Vaart and Wellner, Wellner J. Weak Convergence and Empirical Processes: With Applications to Statistics. New York: Wiley; 1996. [Google Scholar]

PERMALINK

Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

Shengchun Kong

Bin Nan

Abstract

1. Introduction

2. Assumptions

3. Main result

3.1. Useful quantities

3.2. Oracle inequalities

4. Proofs

4.1. Preparations

4.2. Proof of Theorem 3.1

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

Shengchun Kong

Bin Nan

Abstract

1. Introduction

2. Assumptions

3. Main result

3.1. Useful quantities

3.2. Oracle inequalities

4. Proofs

4.1. Preparations

4.2. Proof of Theorem 3.1

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases