Testing and Confidence Intervals for High Dimensional Proportional Hazards Model

Ethan X Fang; Yang Ning; Han Liu

doi:10.1111/rssb.12224

. Author manuscript; available in PMC: 2023 Oct 18.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2016 Dec 26;79(5):1415–1437. doi: 10.1111/rssb.12224

Testing and Confidence Intervals for High Dimensional Proportional Hazards Model

Ethan X Fang ¹, Yang Ning ¹, Han Liu ^1,^✉

PMCID: PMC10584375 NIHMSID: NIHMS847285 PMID: 37854943

Abstract

This paper proposes a decorrelation-based approach to test hypotheses and construct confidence intervals for the low dimensional component of high dimensional proportional hazards models. Motivated by the geometric projection principle, we propose new decorrelated score, Wald and partial likelihood ratio statistics. Without assuming model selection consistency, we prove the asymptotic normality of these test statistics, establish their semiparametric optimality. We also develop new procedures for constructing pointwise confidence intervals for the baseline hazard function and baseline survival function. Thorough numerical results are provided to back up our theory.

Keywords: Proportional hazards model, censored data, high dimensional inference, survival analysis, decorrelation method

1 Introduction

The proportional hazards model (Cox, 1972) is one of the most important tools for analyzing time to event data, and finds wide applications in epidemiology, medicine, economics, and sociology (Kalbfleisch and Prentice, 2011). This model is semiparametric by treating the baseline hazard function as an infinite dimensional nuisance parameter. To infer the finite dimensional parameter of interest, Cox (1972, 1975) proposes the partial likelihood approach which is invariant to the baseline hazard function. In low dimensional settings, Tsiatis (1981); Andersen and Gill (1982) have established the consistency and asymptotic normality of the maximum partial likelihood estimator.

In high dimensional settings when the number of covariates d is larger than the sample size n, the partial maximum likelihood estimation is an ill-posed problem. To solve this problem, we resort to the penalized estimators (Tibshirani, 1997; Fan and Li, 2002; Gui and Li, 2005). Under the condition d = o(n^1/4), Cai et al. (2005) establish the oracle properties for the maximum penalized partial likelihood estimator using the SCAD penalty. Other types of estimation procedures and their theoretical properties are studied by Zhang and Lu (2007); Wang et al. (2009); Antoniadis et al. (2010); Zhao and Li (2012). In particular, under the ultra-high dimensional regime that d = o(exp(n/s)), Bradic et al. (2011); Huang et al. (2013); Kong and Nan (2014) establish the oracle properties and statistical error bounds of maximum penalized partial likelihood estimator, where s denotes the number of nonzero elements in the parametric component of the Cox model.

Though significant progress has been made towards developing the estimation theory. Little work exists on the inferential aspects (e.g., testing hypothesis or constructing confidence intervals) of high dimensional proportional hazard models. A notable exception is Bradic et al. (2011), who establish the limiting distribution of the oracle estimator. However, such a result hinges on model selection consistency, which is not always possible in applications. To the best of our knowledge, uncertainty assessment for low dimensional parameters of high dimensional proportional hazards model remains an open problem. This paper aims to close this gap by developing valid inferential procedures and theory for high dimensional proportional hazards models. In particular, we test hypotheses and construct confidence intervals for a scalar component of a d dimensional parameter vector¹. Compared with existing work, our method does not require any types of irrepresentable condition or the minimal signal strength condition, thus is more practical in applications.

More specifically, we develop a unified inferential framework by extending the classical score, Wald and partial likelihood ratio tests to high dimensional hazards models. The key ingredient of our construction of these tests is a novel high dimensional decorrelation device of the score function. Theoretically, we establish the asymptotic distributions of these test statistics under the null. Using the same idea, we construct optimal confidence intervals for the parameters of interest. In addition, we consider the problems on inferring the baseline hazard and survival functions and separately establish their asymptotic normalities.

The rest of this paper is organized as follows. In Section 2, we provide some background on the proportional hazards model. In Section 3, we propose the methods for conducting hypothesis testing and constructing confidence intervals for low dimensional components of regression parameters. In Section 4, we provide theoretical analysis of the proposed methods. The inference on the baseline hazard function is studied in Section 5. In Section 6, we investigate the empirical performance of these methods. Section 7 contains the summary and discussions. More technical details and an extension to the multivariate failure time data are presented in the Appendix.

2 Background

We start with an introduction of notation. Let a = (a₁, …, a_d)^T ∈ ℝ^d be a d dimensional vector and A = [a_jk] ∈ ℝ^d×d be a d by d matrix. Let supp(a) = {j : a_j ≠ 0}. For 0 < q < ∞, we define ℓ₀, ℓ_q and ℓ_∞ vector norms as ‖a‖₀ = card(supp(a)), ${‖ a ‖}_{q} = {(\sum_{j = 1}^{d} {‖ a_{j} ‖}^{q})}^{1 / q}$ and ‖a‖_∞ = max_1≤_j_≤_d|a_j|. We matrix define the matrix ℓ_∞-norm as the elementwise sup-norm that ‖A‖_∞ = max_1≤_j,k_≤_d|a_jk|. Let I_d be the identity in ℝ^d×d. For a sequence of random variables ${X_{n}}_{n = 1}^{\infty}$ and a random variable Y, we denote X_n weakly converges to Y by $X_{n} \overset{d}{\to} Y$ . We denote [n] = {1, …, n}.

2.1 Cox’s Proportional Hazards Model

We briefly review the Cox’s proportional hazards model. Let Q be the time to event; R be the censoring time, and X(t) = (X₁(t), …, X_d(t))^T be the d dimensional time dependent covariates at time t. We consider the non-informative censoring setting that Q and R are conditionally independent given X(t). Let W = min{Q, R} and Δ = 1{Q ≤ R} denote the observed survival time and censoring indicator. Let τ be the end of study time. We observe n independent copies of {(X(t), W, Δ) : 0 ≤ t ≤ τ}

{(X_{i} (t), W_{i}, Δ_{i}) : 0 \leq t \leq τ}_{i \in [n]} .

We denote λ{t|X(t)} as the conditional hazard rate function at time t given the covariates X(t). Under the proportional hazards model, we assume that

λ {t | X (t)} = λ_{0} (t) \exp {X^{T} (t) β^{*}},

where λ₀(t) is an unknown baseline hazard rate function, and β^∗ ∈ ℝ^d is an unknown parameter.

2.2 Penalized Estimation

Following Andersen and Gill (1982), we introduce some counting process notation. For each i, let N_i(t) := 1{W_i ≤ t, Δ_i = 1} be the counting process, and Y_i(t) := 1 {W_i ≥ t} be the at risk process for subject i. Assume that the process Y_i(t) is left continuous with its right-hand limits satisfying ℙ(Y_i(t) = 1, 0 ≤ t ≤ τ) > C_τ for some positive constant C_τ. The negative log-partial likelihood is

L (β) = - \frac{1}{n} (\sum_{i = 1}^{n} \int_{0}^{τ} X_{i}^{T} (u) β d N_{i} (u) - \int_{0}^{τ} \log [\sum_{i = 1}^{n} Y_{i} (u) \exp {X_{i}^{T} (u) β}] d \bar{N} (u)),

where $\bar{N} (t) = \sum_{i = 1}^{n} N_{i} (t)$ .

When the dimension d is fixed and smaller than the sample size n, β^∗ can be estimated by the maximum partial likelihood estimator (Andersen and Gill, 1982). However, in high dimensional settings where n < d, the maximum partial likelihood estimator is not well defined. To solve this problem, Fan and Li (2002) impose the sparsity assumption and propose the penalized estimator

\hat{β} : = \underset{β \in ℝ^{d}}{\arg \min} {ℒ (β) + P_{λ} (β)},

(2.1)

where $P_{λ} (\cdot)$ is a sparsity-inducing penalty function and λ is a tuning parameter. Bradic et al. (2011) and Huang et al. (2013) establish the rates of convergence and oracle properties of the maximum penalized partial likelihood estimators $\hat{β}$ using SCAD and Lasso penalties. For notational simplicity, we focus on the Lasso penalized estimator in this paper and indicate that similar properties hold for the SCAD penalty. Existing works generally impose the following assumptions.

Assumption 2.1

The difference of the covariates is uniformly bounded:

\sup_{0 \leq t \leq τ} \max_{i, i^{'} \leq n} \max_{1 \leq j \leq d} | X_{i j} (t) - X_{i^{'} j} (t) | \leq C_{X},

for some constant C_X > 0.

Assumption 2.2

For any set $S \subset {1, \dots, d}$ where $| S | ≍ s$ and any vector v belonging to the cone, $C (ξ, S) = {v \in ℝ^{d} : {‖ v_{S^{C}} ‖}_{1} \leq ξ {‖ v_{S} ‖}_{1}}$ it holds that

κ (ξ, S; \nabla^{2} ℒ (β^{*})) = \inf_{0 \neq v \in C (ξ, S)} \frac{s^{1 / 2} {v^{T} \nabla^{2} L (β^{*}) v}^{1 / 2}}{{‖ v_{S} ‖}_{1}} \geq λ_{\min} > 0.

Note that the bounded covariate condition in Assumption 2.1, which is imposed by both Bradic et al. (2011) and Huang et al. (2013), holds in most real applications. Assumption 2.2 is known as the compatibility factor condition which is also used by Huang et al. (2013). This assumption essentially bounds the minimal eigenvalue of the Hessian matrix ∇²ℒ(β^∗) from below for those directions within the cone $C (ξ, S)$ . In particular, the validity of this assumption has been verified in Theorem 4.1 of Huang et al. (2013). Under these assumptions, Huang et al. (2013) derive the rate of convergence of the Lasso estimator $\hat{β}$ under the ℓ₁-norm. More specifically, they prove that under Assumptions 2.1 and 2.2, if ‖β*‖₀ = s and $λ ≍ \sqrt{n^{- 1} \log d}$ , it holds that

{‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (s λ),

(2.2)

which establishes the estimation consistency in the high dimensional regime.

Additional Notations

For a vector u, we denote u^⊗0 = 1, u^⊗1 = u and u^⊗2 = uu^T. Denote

\begin{array}{l} S^{(r)} (t, β) = \frac{1}{n} \sum_{i = 1}^{n} X_{i}^{\otimes r} (t) Y_{i} (t) \exp {X_{i}^{T} (t) β} for r = 0, 1, 2 \bar{Z} (t, β) = \frac{S^{(1)} (t, β)}{S^{(0)} (t, β)}, \\ V_{n} (t, β) = \sum_{i = 1}^{n} \frac{Y_{i} (t) \exp {X_{i} {(t)}^{T} β}}{n S^{(0)} (t, β)} {X_{i} (t) - \bar{Z} (t, β)}^{\otimes 2} = \frac{S^{(2)} (t, β)}{S^{(0)} (t, β)} - \bar{Z} {(t, β)}^{\otimes 2} . \end{array}

(2.3)

The gradient of ℒ(β) is

\nabla L (β) = \frac{\partial L (β)}{\partial β} = - \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {X_{i} (u) - \bar{Z} (u, β)} d N_{i} (u),

(2.4)

and the Hessian matrix of ℒ(β) is

\nabla^{2} L (β) = \frac{1}{n} \int_{0}^{τ} V_{n} (u, β) d \bar{N} (u) = \frac{1}{n} \int_{0}^{τ} {\frac{S^{(2)} (u, β)}{S^{(0)} (u, β)} - \bar{Z} {(u, β)}^{\otimes 2}} d \bar{N} (u) .

(2.5)

We denote the population versions of above defined quantities by

s^{(r)} (t, β) = E [Y (t) X {(t)}^{\otimes r} \exp {X {(t)}^{T} β}] for r = 0, 1, 2; e (t, β) = s^{(1)} (t, β) / s^{(0)} (t, β),

(2.6)

and

H (β) = E [\int_{0}^{τ} {\frac{s^{(2)} (t, β)}{s^{(0)} (t, β)} - e {(t, β)}^{\otimes 2}} d N (t)], and H^{*} = H (β^{*}),

(2.7)

where H^∗ is the Fisher information matrix based on the partial likelihood.

3 Testing Hyptheses and Constructing Confidence Intervals

While estimation consistency has been established in high dimensions, it remains challenging to develop inferential procedures (e.g., confidence intervals and testing) for high dimensional proportional hazards model. In this section, we propose three novel hypothesis testing procedures. The proposed tests can be viewed as high dimensional counterparts of the conventional score, Wald, and partial likelihood ratio tests.

Hereafter, for notational simplicity, we partition the vector β as β = (α, θ^T)^T, where α = β₁ ∈ ℝ is the parameter of interest; θ = (β₂, …, β_d)^T ∈ ℝ^d⁻¹ is the vector of nuisance parameters, and we denote ℒ(β) by ℒ(α, θ). Let $\nabla_{α α}^{2} L (β)$ , $\nabla_{α θ}^{2} L (β)$ and $\nabla_{θ θ}^{2} L (β)$ be the corresponding partitions of ∇²ℒ(β). Let $H_{α α}^{*}$ , $H_{α θ}^{*}$ and $H_{θ θ}^{*}$ be the corresponding partitions of H^∗, where H^∗ is defined in (2.7). For instances, $H_{θ a}^{*} = H_{2 : d, 1}^{*} \in ℝ^{d - 1}$ and $\nabla_{θ θ}^{2} L (β) = \nabla_{2 : d, 2 : d}^{2} L (β) \in R^{(d - 1) \times (d - 1)}$ . Throughout this paper, without loss of generality, we test the hypothesis H₀: α^∗ = 0 versus H₁: α^∗ ≠ 0. Note that the extension to tests for a multi-dimensional vector $α * \in ℝ^{d_{0}}$ , where d₀ is fixed, is straightforward.

3.1 Decorrelated Score Test

In the classical low dimensional setting, we can exploit the profile partial score function

S (α) = \nabla_{α} L (α, θ) |_{θ = \hat{θ} (α)}

to conduct test, where $\hat{θ} (α) = \arg \min_{θ} L (α, θ)$ is the maximum partial likelihood estimator for θ with a fixed α. Under the null hypothesis that α^∗ = 0, when d is fixed while n goes to infinity, it holds that $\sqrt{n} S (0) \overset{d}{\to} N (0, H_{α α}^{*})$ . If $n {(H_{α α}^{*})}^{- 1} S^{2} (0)$ is larger than the (1 − η)th quantile of a chi-squared distribution with one degree of freedom, we reject the null hypothesis. Classical asymptotic theory shows that this procedure controls type I error with significance level η.

However, in high dimensions, the profile partial score function S(α) with $\hat{θ} (α)$ replaced by a penalized estimator, say the corresponding components of $\hat{β}$ in (2.1), does not yield a tractable limiting distribution due to the existence of a large number of nuisance parameters. To address this problem, we construct a new type of score function for α that is asymptotically normal even in high dimensions. The key component of our procedure is a high dimensional decorrelation device, aiming to handle the impact of the high dimensional nuisance vector.

More specifically, we propose a decorrelated score test for H₀: α^∗ = 0. We first estimate θ^∗ by $\hat{θ}$ using the ℓ₁ penalized estimator $\hat{β}$ in (2.1). Next, we calculate a linear combination of the partial score function $\nabla_{θ} L (0, \hat{θ})$ to best approximate $\nabla_{α} L (0, \hat{θ})$ . The population version of the vector of coefficients in the best linear combination can be calculated as

\begin{array}{l} w * = \arg \min E {\nabla_{α} L (0, θ^{*}) - w^{T} \nabla_{θ} L (0, θ^{*})}^{2} \\ = E {\nabla_{θ} L (0, θ^{*}) \nabla_{θ} L {(0, θ^{*})}^{T}}^{- 1} E {\nabla_{θ} L (0, θ^{*}) \nabla_{α} L (0, θ^{*})} = H_{θ θ}^{* - 1} H_{θ α}^{*}, \end{array}

(3.1)

where the last equality is by the second Bartlett identity (Tsiatis, 1981). In fact, w^∗T∇_θℒ(0, θ^∗) can be interpreted as the projection of ∇_αℒ(0, θ^∗) onto the linear span of the partial score function ∇_θℒ(0, θ^∗). In high dimensions, one cannot directly estimate w^∗ by the corresponding sample version since the problem is ill-posed. Motivated by the definition of w^∗ in (3.1), we estimate it by the Dantzig selector,

\hat{w} =_{w \in ℝ^{d - 1}}^{\arg \min} {‖ w ‖}_{1}, subject to {‖ \nabla_{α θ}^{2} L (\hat{β}) - w^{T} \nabla_{θ θ}^{2} L (\hat{β}) ‖}_{\infty} \leq λ^{'},

(3.2)

where λ′ is a tuning parameter. Since w^∗ is of high dimension d − 1, we impose the sparsity condition on w^∗. Given $\hat{θ}$ and $\hat{w}$ , we propose a decorrelated score function for α as

\hat{U} (α, \hat{θ}) = \nabla_{α} L (α, \hat{θ}) - {\hat{w}}^{T} \nabla_{θ} L (α, \hat{θ}) .

(3.3)

Geometrically, the decorrelated score function is approximately orthogonal to any component of the nuisance score function ∇_θℒ(0, θ^∗). This orthogonality property, which does not hold for the original score function $\nabla_{α} L (α, \hat{θ})$ , reduces the variability caused by the nuisance parameters. A geometric illustration of the decorrelation-based methods is provided in Figure 1, which also incorporates the illustration of the decorrelated Wald and partial likelihood ratio tests to be introduced in the following subsections. Technically, the uncertainty of estimating θ in the partial score function $\nabla_{α} L (α, \hat{θ})$ can be reduced by subtracting the decorrelation term ${\hat{w}}^{T} \nabla_{θ} L (α, \hat{θ})$ . As will be shown in the next section, this is a key step to establish the result that the decorrelated score function $\hat{U} (0, \hat{θ})$ weakly converges to N(0, H_α|_θ) under the null, where $H_{α | θ} = H_{α α}^{*} - H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*}$ . This further explains why the decorrelated score function $\hat{U} (α, \hat{θ})$ rather than the original score function $\nabla_{α} L (α, \hat{θ})$ should be used as the inferential function in high dimensions. On the other hand, in the low dimensional setting, it can be shown that the decorrelated score function $\hat{U} (α, \hat{θ})$ is asymptotically equivalent to the profile partial score function S(α).

Geometric illustration of the decorrelated score, Wald and partial likelihood ratio tests. The purple surface corresponds to the log-partial likelihood function. The orange plane is the tangent plane of the surface at point $(α, \hat{θ})$ . The two red arrows in the orange plane represent ∇_αℒ and ∇_θℒ. The correlated score function in blue is the projection of ∇_αℒ onto the space orthogonal to ∇_θℒ. Given Lasso estimator $\hat{α}$ , the decorrelated Wald estimator is $\tilde{α} = \hat{α} - δ$ , where $δ = {\partial \hat{U} (\hat{α}, \hat{θ}) / \partial α}^{- 1} \hat{U} (\hat{α}, \hat{θ})$ . The decorrelated partial likelihood ratio test compares the log-partial likelihood function values at $(α, \hat{θ})$ and $(\tilde{α}, \hat{θ} - \tilde{α} \hat{w})$ .

To test if α* = 0, we need to standardize $\hat{U} (0, \hat{θ})$ in order to construct the test statistic. We estimate H_α_|_θ by

{\hat{H}}_{α | θ} = \nabla_{α α}^{2} L (\hat{α}, \hat{θ}) - {\hat{w}}^{T} \nabla_{θ α}^{2} L (\hat{α}, \hat{θ}) .

(3.4)

Hence, we define the decorrelated score test statistic as

{\hat{S}}_{n} = n {\hat{H}}_{α | θ}^{- 1} {\hat{U}}^{2} (0, \hat{θ}), where \hat{U} (0, \hat{θ}) and {\hat{H}}_{α | θ} are defined in (3.3) and (3.4) .

(3.5)

In the next section, we show that under the null, ${\hat{S}}_{n}$ converges weakly to a chi-squared distribution with one degree of freedom. Given a significance level η ∈ (0,1), the score test ψ_S(η) is

ψ_{S} (η) = {\begin{cases} 0 & if {\hat{S}}_{n} \leq χ_{1}^{2} (1 - η) \\ 1 & otherwise \end{cases},

(3.6)

where $χ_{1}^{2} (1 - η)$ denotes the (1 − η)th quantile of a chi-squared random variable with one degree of freedom, and the null hypothesis α^∗ = 0 is rejected if and only if ψ_S(η) = 1.

3.2 Confidence Intervals and Decorrelated Wald Test

The decorrelated score test does not provide a confidence interval for α^∗ with a desired coverage probability. In low dimensions, by examing the limiting distribution of the maximum partial likelihood estimator, we can get a confidence interval for α^∗ (Andersen and Gill, 1982), which is equivalent to the classical Wald test. This subsection extends the classical Wald test for the proportional hazards model to high dimensional settings to construct confidence intervals for the parameters of interest.

The key idea of performing Wald test is to derive a regular estimator for α^∗. Our procedure is based on the deccorelated score function $\hat{U} (α, \hat{θ})$ in (3.3). Since $\hat{U} (α, \hat{θ})$ serves as an approximately unbiased estimating equation for α, the root of the equation $\hat{U} (α, \hat{θ}) = 0$ with respect to α defines an estimator for α*. However, searching for the root may be computationally intensive, especially when α is multi-dimensional. To reduce the computational cost, we exploit a closed-form estimator $\tilde{α}$ obtained by linearizing $\hat{U} (α, \hat{θ}) = 0$ at the initial estimator $\hat{α}$ . More specifically, let $\hat{β} = {(\hat{α}, {\hat{θ}}^{T})}^{T}$ be the ℓ₁ penalized estimator in (2.1), we adopt the following one-step estimator,

\tilde{α} = \hat{α} - {\frac{\partial \hat{U} (\hat{α}, \hat{θ})}{\partial α}}^{- 1} \hat{U} (\hat{α}, \hat{θ}), where \hat{U} (\hat{α}, \hat{θ}) = \nabla_{α} L (\hat{α}, \hat{θ}) - {\hat{w}}^{T} \nabla_{θ} L (\hat{α}, \hat{θ}) .

(3.7)

In the next section, we prove that $\sqrt{n} (\tilde{α} - α^{*})$ converges weakly to $N (0, H_{α | θ}^{- 1})$ . Hence, let Z₁₋_η_/2 be the (1 − η/2)-th quantile of N(0, 1). We show that

[\tilde{α} - n^{- 1 / 2} Z_{1 - η / 2} {\hat{H}}_{α | θ}^{- 1 / 2}, \tilde{α} + n^{- 1 / 2} Z_{1 - η / 2} {\hat{H}}_{α | θ}^{- 1 / 2}]

is a 100(1 − η)% confidence interval for α^∗.

From the perspective of hypothesis testing, the decorrelated Wald test statistic for H₀: α^∗ = 0 versus H₁: α^∗ ≠ 0 is

{\hat{W}}_{n} = n {\hat{H}}_{α | θ} {\tilde{α}}^{2}, where \tilde{α} and {\hat{H}}_{α | θ} are defined in (3.7) and (3.4), respectively .

(3.8)

Consequently, the decorrelated Wald test at significance level η is

ψ_{W} (η) = {\begin{cases} 0 & if {\hat{W}}_{n} \leq χ_{1}^{2} (1 - η), \\ 1 & otherwise, \end{cases}

(3.9)

and the null hypothesis α^∗ = 0 is rejected if and only if ψ_W(η) = 1.

3.3 Decorrelated Partial Likelihood Ratio Test

In low dimsional settings, the partial likelihood ratio test statistic is $PLRT = 2 n {L (0, {\hat{θ}}_{P} (0)) - L ({\hat{α}}_{P}, {\hat{θ}}_{P})}$ where ${\hat{θ}}_{P} (0) = \arg \min_{θ} L (0, θ)$ and $({\hat{α}}_{P}, {\hat{θ}}_{P}) = \arg \min_{α, θ} L (α, θ)$ are the maximum partial likelihood estimators under the null and alternative, respectively. Hence, PLRT evaluates the validity of the null hypothesis by comparing the partial likelihood under H₀ with that under H₁. Similar to the partial score test, the partial likelihood ratio test also fails in the high dimensional setting due to the presence of a large number of nuisance parameters. In this section, we propose a new version of the partial likelihood ratio test which is valid in high dimensions.

To handle the impact of high dimensional nuisance parameters, we define the (negative) decorrelated partial likelihood for α as $L_{decor} (α) = L (α, \hat{θ} - α \hat{w})$ . The reason for this name is that the derivative of ℒ_decor(α) with respect to α evaluated at α = 0 is identical to the decorrelated score function $\hat{U} (0, \hat{θ})$ in (3.3). The decorrelated partial likelihood ℒ_decor(α) plays the same role as the profile partial likelihood $L (α, \hat{θ} (α))$ in the low dimensional setting. Hence, the decorrelated partial likelihood ratio test statistic is defined as

{\hat{L}}_{n} = 2 n {L_{decor} (0) - L_{decor} (\tilde{α})}, where L_{decor} (α) = L (α, \hat{θ} - α \hat{w}),

(3.10)

and $\tilde{α}$ is given in (3.7). As discussed in the previous subsection, $\tilde{α}$ is a one-step approximation of the global minimizer of ℒ_decor(α). Hence, the log-likelihood ratio ${\hat{L}}_{n}$ evaluates the validity of the null hypothesis by comparing the decorrelated partial likelihood under H₀ with that under H₁. This is a natural extension of the classical partial likelihood ratio test to the high dimensional setting.

In the next section, we show that ${\hat{L}}_{n}$ converges weakly to a chi-squared distribution with one degree of freedom. Therefore, a decorrelated partial likelihood ratio test with significance level η is

ψ_{L} (η) = {\begin{cases} 0 & if {\hat{L}}_{n} \leq χ_{1}^{2} (1 - η) \\ 1 & otherwise \end{cases},

(3.11)

and ψ_L(η) = 1 indicates a rejection of the null hypothesis.

4 Asymptotic Properties

In this section, we derive the limiting distributions of the decorrelated test statistics under the null hypothesis. More detailed proofs are provided in Appendix A. In our analysis, we make the following regularity assumptions.

Assumption 4.1

The true hazard is uniformly bounded, i.e., $\sup_{t \in [0, τ]} \max_{i \in [n]} | X_{i}^{T} (t) β^{*} | = O (1)$ .

Assumption 4.2

It holds that ‖w^∗‖₀ = s′ ≍ s, and $\sup_{t \in [0, τ]} \max_{i \in [n]} | X_{i, 2 : d}^{T} (t) w^{*} | = O (1)$ .

Assumption 4.3

The Fisher information matrix is bounded, ${‖ H^{*} ‖}_{\infty} = O (1)$ , and its minimum eigenvalue is also bounded from below, Λ_min(H^∗) ≥ C_h > 0, which implies that $H_{α | θ} = H_{α α}^{*} - H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*} \geq C_{h}$ .

To connect these assumptions with existing literature, Assumptions 4.1 and 4.2 extend Assumption (iv) of Theorem 3.3 in van de Geer et al. (2014a) to the proportional hazards model. In particular, the sparsity assumption of w^∗ ensures that the Dantzig selector $\hat{w}$ converges to w^∗ at a fast rate. Assumption 4.3 is related to the Fisher information matrix, which is essential even in low dimensional settings.

Our main result characterizes the asymptotic normality of the decorrelated score function $\hat{U} (0, \hat{θ})$ in (3.3) under the null.

Theorem 4.4

Under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, let $λ ≍ \sqrt{n -^{- 1} \log d}$ , $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ and n⁻¹^/²s³ log d = o(1). Under the null hypothesis that α^∗ = 0, the decorrelated score function $\hat{U} (0, \hat{θ})$ defined in (3.3) satisfies

\sqrt{n} \hat{U} (0, \hat{θ}) \overset{d}{\to} Z, where Z ~ N (0, H_{α | θ}),

(4.1)

and $H_{α | θ} = H_{α α}^{*} - H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*} .$

As we have discussed before, the limiting variance of the decorrelated score function can be estimated by ${\hat{H}}_{α | θ} = \nabla_{α α}^{2} L (\hat{α}, \hat{θ}) - {\hat{w}}^{T} \nabla_{θ α}^{2} L (\hat{α}, \hat{θ})$ . The next lemma shows the consistency of ${\hat{H}}_{α | θ}$ .

Lemma 4.5

Suppose Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold. If $λ ≍ \sqrt{n^{- 1} \log d}$ and $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ , we have

| H_{α | θ} - {\hat{H}}_{α | θ} | = O_{ℙ} (s^{2} \sqrt{\frac{\log d}{n}}),

where ${\hat{H}}_{α | θ}$ is defined in (3.4).

By Theorem 4.4 and Lemma 4.5, the next corollary shows that under the null hypothesis, type I error of the decorrelated score test ψ_S(η) in (3.6) converges asymptotically to the significance level η. Let the associated p-value of the decorrelated score test be $P_{S} = 2 {1 - Φ ({\hat{S}}_{n})}$ , where Φ(·) is the cumulative distribution function of the standard normal random variable and ${\hat{S}}_{n}$ is the score test statistic defined in (3.5). The distribution of P_S converges to a uniform distribution asymptotically.

Corollary 4.6

Suppose Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold, $λ ≍ \sqrt{n^{- 1} \log d}$ , $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ and n⁻¹^/²s³ log d = o(1). The decorrelated score test and the its corresponding p-value satisfy

\lim_{x \to \infty} ℙ (ψ_{S} (η) = 1 | α^{*} = 0) = η, and P_{S} \overset{d}{\to} Unif [0, 1], when α^{*} = 0,

where Unif[0, 1] denotes a random variable uniformly distributed in [0, 1].

We then analyze the decorrelated Wald test under the null. We derive the limiting distribution of the one-step estimator $\tilde{α}$ defined in (3.7) in the next theorem.

Theorem 4.7

Suppose Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold, and $λ ≍ \sqrt{n^{- 1} \log d}$ , $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ , n⁻¹^/²s³ log d = o(1). When the null hypothesis α^∗ = 0 holds, the decorrelated estimator $\tilde{α}$ satisfies

\sqrt{n} \tilde{α} \overset{d}{\to} Z, where Z ~ N (0, H_{α | θ}^{- 1}) .

(4.2)

Utilizing the asymptotic normality of $\tilde{α}$ , we can establish the limiting type I error of ψ_W (η) in (3.9), in the next corollary. Note that, it is straightforward to generalize the result to be $\sqrt{n} (\tilde{α} - α^{*}) \overset{d}{\to} Z$ , where $Z ~ N (0, H_{α | θ}^{- 1})$ for any α^∗. This gives us a confidence interval of α^∗.

Corollary 4.8

Under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, suppose $λ ≍ \sqrt{n^{- 1} \log d}$ , $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ and n⁻¹^/²s³ log d = o(1). The type I error of the decorrelated Wald test ψ_W(η) and its corresponding p-value $P_{W} = 2 {1 - Φ ({\hat{W}}_{n})}$ satisfy

\lim_{n \to} ℙ (ψ_{W} (η) = 1 | α^{*} = 0) = η, and P_{W} \overset{d}{\to} Unif [0, 1] when α^{*} = 0.

In addition, an asymptotic (1 − η) × 100% confidence interval of α^∗ is

(\tilde{α} - \frac{Φ^{- 1} (1 - η / 2)}{\sqrt{n {\hat{H}}_{α | θ}}}, \tilde{α} + \frac{Φ^{- 1} (1 - η / 2)}{\sqrt{n {\hat{H}}_{α | θ}}}) .

Finally, we characterize the limiting distribution of the decorrelated partial likelihood ratio test statistic ${\hat{L}}_{n}$ introduced in (3.10).

Theorem 4.9

Suppose Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold, $λ ≍ \sqrt{n^{- 1} \log d}$ , $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ and n⁻¹^/²s³ log d = o(1). If the null hypothesis α^∗ = 0 holds, the decorrelated likelihood ratio test statistic ${\hat{L}}_{n}$ in (3.10) satisfies

{\hat{L}}_{n} \overset{d}{\to} Z_{χ}, where Z_{χ} ~ χ_{1}^{2} .

(4.3)

This theorem justifies the decorrelated partial likelihood ratio test ψ_L(η) in (3.11). Also, let the p-value associated with the decorrelated partial likelihood ratio test be $P_{L} = 1 - F ({\hat{L}}_{n})$ , where F(·) is the cumulative distribution function of $χ_{1}^{2}$ . Similar to Corollaries 4.6 and 4.8, we characterize the type I error of the test ψ_L(η) in (3.11) and its corresponding p-value below.

Corollary 4.10

Suppose Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold, $λ ≍ \sqrt{n^{- 1} \log d}$ , $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ and n⁻¹^/²s³ log d = o(1). The type I error of the decorrelated partial likelihood ratio test ψ_L(η) with significance level η and its associated p-value P_L satisfy

\lim_{x \to \infty} ℙ (ψ_{L} (η) = 1 | α^{*} = 0) = η, and P_{L} \overset{d}{\to} Unif [0, 1] when α^{*} = 0.

By Corollaries 4.6, 4.8 and 4.10, we see that the decorrelated score, Wald and partial likelihood ratio tests are asymptotically equivalent as summarized in the next corollary.

Corollary 4.11

Suppose Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold, $λ ≍ \sqrt{n^{- 1} \log d}$ , $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ and n⁻¹^/²s³ log d = o(1). If the null hypothesis α* = 0 holds, the test statistics ${\hat{S}}_{n}$ in (3.5), ${\hat{W}}_{n}$ in (3.8), and ${\hat{L}}_{n}$ in (3.10) are asymptotically equivalent, i.e.,

{\hat{S}}_{n} = {\hat{W}}_{n} + o_{ℙ} (1) = {\hat{L}}_{n} + o_{ℙ} (1) .

To summarize this subsection, Corollaries 4.6, 4.8 and 4.10 characterize the asymptotic distributions of the proposed decorrelated test statistics under the scaling when n⁻¹^/²s³ log d = o(1) under the null hypothesis. It is known that $H_{α | θ}$ is the semiparametric information lower bound for inferring α. Theorem 4.7 shows that $\tilde{α}$ achieves the semiparametric information bound, which indicates the semiparametric efficiency of $\tilde{α}$ . Using the asymptotic equivalence in Corollary 4.11, all of our test statistics are semiparametrically efficient (van der Vaart, 2000).

Remark 4.12

All the theoretical results in this section are still valid if we replace the Lasso penalty with nonconvex SCAD or MCP penalties as long as the consistency result (2.2) holds.

Remark 4.13

When the model is misspecified, we denote the oracle parameter as

β^{o} = \underset{β}{\arg \min} E^{*} {L (β)},

where $E^{*}$ is the expectation under the true model. Our proposed methods are still applicable to test if $β_{1}^{o} = 0$ and construct confidence intervals for $β_{1}^{o}$ .

Remark 4.14

Existing works mainly consider high dimensional inferences for linear and generalized models; see Lockhart et al. (2014); Chernozhukov et al. (2013); van de Geer et al. (2014b); Javanmard and Montanari (2013) and Zhang and Zhang (2014). More specifically, Lockhart et al. (2014) consider conditional inference, while we consider unconditional inference. The others propose estimators that are asymptotically normal. Compared with existing approaches, we provide a unified framework which are more general in two aspects: (i) Our framework can deal with nonconvex penalties, while it is unclear if existing works are still valid under nonconvex penalities. (ii) Our framework based on the decorrelated score function provides a natural approach to deal with the misspecified model. In contrast, most existing methods assume the model must be correct.

5 Inference on the Baseline Hazard Function

The baseline hazard function

Λ_{0} (t) = \int_{0}^{t} λ_{0} (u) d u

is treated as a nuisance function in the log-partial likelihood method. In practice, inferences on the baseline hazard function is also of interest. To the best of our knowledge, estimating the baseline hazard function or the survival function and construct confidence intervals in high dimensions remains unexplored. In this section, we extend the decorrelation approach to construct confidence intervals for the baseline hazard function and the survival function. All the proof details are provided in Appendix B.

We consider the following Breslow-type estimator for the baseline hazard function. Given an ℓ₁-penalized estimator $\hat{β}$ derived from (2.1), the direct plug-in estimator for the baseline hazard function at time t is

{\hat{Λ}}_{0} (t, \hat{β}) = \int_{0}^{t} \frac{\sum_{i = 1}^{n} d N_{i} (u)}{\sum_{i = 1}^{n} Y_{i} (u) \exp {X_{i}^{T} (u) \hat{β}}} .

(5.1)

Since the plug-in estimator $\hat{β}$ does not posses a tractable distribution, inference based on the estimator ${\hat{Λ}}_{0} (t, \hat{β})$ is difficult. To handle this problem, we adopt the decorrelation approach as in the previous sections and estimate Λ₀(t) by the sample version of ${\hat{Λ}}_{0} (t, \hat{β}) - {\nabla Λ_{0} (t, β^{*})}^{T} H^{* - 1} \nabla L (\hat{β})$ , where

Λ_{0} (t, β) = E \int_{0}^{t} \frac{d N_{i} (u)}{S^{(0)} (u, β)},

and the gradient ∇Λ₀(t, β*) is taken with respect to the corresponding β component, and H* is the Fisher information matrix defined in (2.7). Similar to Section 3.1, we directly estimate $H^{* - 1} \nabla {\hat{Λ}}_{0} (t, \hat{β})$ by the following Dantzig selector

\hat{u} (t) = \arg \min {‖ u (t) ‖}_{1}, subject to {‖ \nabla {\hat{Λ}}_{0} (t, \hat{β}) - \nabla^{2} L (\hat{β}) u (t) ‖}_{\infty} \leq δ,

(5.2)

where δ is a tuning parameter. It can be shown that the estimator $\hat{u} (t)$ converges to u*(t) = H^*−1∇Λ₀(t, β*) under the following regularity assumption.

Assumption 5.1

It holds that ${‖ u^{*} (t) ‖}_{0} = s^{'} ≍ s for all 0 \leq t \leq τ$ .

Note that Assumption 5.1 plays the same role as Assumption 4.2 in the previous section. Corollary B.2 in Appendix B characterizes the rate of convergence of $\hat{u} (t)$ . Hence, the decorrelated baseline hazard function estimator at time t is

{\tilde{Λ}}_{0} (t, \hat{β}) = {\hat{Λ}}_{0} (t, \hat{β}) - \hat{u} {(t)}^{T} \nabla L (\hat{β}), where \hat{u} (t) is defined in (5.2) .

(5.3)

Based on the estimator (5.3), the survival function S₀(t) = exp{−Λ₀(t)} is estimated by $\tilde{S} (t, \hat{β}) = \exp {- {\tilde{Λ}}_{0} (t, \hat{β})}$ . The main theorem of this section characterizes the asymptotic normality of ${\tilde{Λ}}_{0} (t, \hat{β})$ and $\tilde{S} (t, \hat{β})$ as follows.

Theorem 5.2

Suppose Assumptions 2.1, 2.2, 4.1, 4.3 and 5.1 hold, $λ ≍ \sqrt{n^{- 1} \log d}$ , $δ ≍ s^{'} \sqrt{n^{- 1} \log d}$ and n⁻¹^/²s³ log d = o(1). We have, for any t ∈ [0,τ], the decorrelated baseline hazard function estimator ${\tilde{Λ}}_{0} (t, \hat{β})$ in (5.3) satisfies

\sqrt{n} {Λ_{0} (t) - {\tilde{Λ}}_{0} (t, \hat{β})} \overset{d}{\to} Z, where Z ~ N (0, σ_{1}^{2} (t) + σ_{2}^{2} (t)),

and

σ_{1}^{2} (t) = \int_{0}^{t} \frac{λ_{0} (u) d u}{E [\exp {X^{T} (u) β^{*}} Y (u)]} and σ_{2}^{2} (t) = \nabla Λ_{0} {(t, β^{*})}^{T} H^{* - 1} \nabla Λ_{0} (t, β^{*}) .

(5.4)

The estimated survival function $\tilde{S} (t, \hat{β})$ satisfies

\sqrt{n} {\tilde{S} (t, \hat{β}) - S_{0} (t)} \overset{d}{\to} Z^{'}, where Z^{'} ~ N (0, \frac{σ_{1}^{2} (t) + σ_{2}^{2} (t)}{\exp (2 Λ_{0} (t))}) .

Given Theorem 5.2, we further need to estimate the limiting variances $σ_{1}^{2} (t)$ and $σ_{2}^{2} (t)$ . To this end, we use

{\hat{σ}}_{1}^{2} (t) = \int_{0}^{t} \frac{d {\hat{Λ}}_{0} (u, \hat{β})}{n^{- 1} \sum_{i^{'} = 1}^{n} \exp {X_{i^{'}}^{T} (u) \hat{β}} Y_{i^{'}} (u)} and {\hat{σ}}_{2}^{2} (t) = {\nabla {\hat{Λ}}_{0} (t, \hat{β})}^{T} \hat{u} (t),

where ${\hat{Λ}}_{0} (t, \hat{β})$ is defined in (5.1).

We conclude this section by the following corollary which provides confidence intervals for Λ₀(t) and S₀(t).

Corollary 5.3

Suppose Assumptions 2.1, 2.2, 4.2, 4.3 and 5.1 hold, $λ ≍ \sqrt{n^{- 1} \log d}$ , $δ ≍ s \sqrt{n^{- 1} \log d}$ and n⁻¹^/²s³ log d = o(1). For any t > 0 and 0 < η < 1,

\lim_{x \to \infty} ℙ (| Λ_{0} (t) - {\tilde{Λ}}_{0} (t, \hat{β}) | \leq \frac{Φ^{- 1} (1 - η / 2) {{\hat{σ}}_{1}^{2} (t) + {\hat{σ}}_{2}^{2} (t)}^{1 / 2}}{\sqrt{n}}) = 1 - η,

and

\lim_{x \to \infty} ℙ (| S_{0} (t) - {\tilde{S}}_{0} (t, \hat{β}) | \leq \frac{Φ^{- 1} (1 - η / 2) {{\hat{σ}}_{1}^{2} (t) + {\hat{σ}}_{2}^{2} (t)}^{1 / 2} \exp {- {\tilde{Λ}}_{0} (t, \hat{β})}}{\sqrt{n}}) = 1 - η .

6 Numerical Results

This section reports numerical results of our proposed methods using both simulated and real data. We test the methods proposed in Section 3 and Section 5 by considering empirical behaviors for inferences on the individual regression coefficients β_j’s and the baseline hazard function Λ₀(t).

6.1 Inference on the Parametric Component

We first investigate empirical performances of the decorrelated score, Wald and partial likelihood ratio tests on the parametric component β as proposed in Section 3. To estimate β^∗ and w^∗, we choose the tuning parameters λ by 10-fold cross-validation and set $λ^{'} = \frac{1}{2} \sqrt{n^{- 1} \log d}$ . We find that our simulation results are insensitive to the choice of λ′. We conduct decorrelated score, Wald and partial likelihood ratio tests for β₁ which is set to be 0 under null hypothesis H₀: β₁ = 0 versus alternative H_a: β₁ ≠= 0, where we set the significance level to be η = 0.05. In each setting, we simulate n = 150 independent samples from a multivariate Gaussian distribution N_d(0, Σ) for d = 100, 200, or 500, where Σ is a Toeplitz matrix with Σ_jk = ρ^|j⁻^k| and ρ = 0.25, 0.4, 0.6 or 0.75. The cardinality of the active set s is either 2 or 3, and the regression coefficients in the active set are either all 1’s (Dirac) or drawn randomly from the uniform distribution Unif[0, 2]. We set the baseline hazard rate function to be identity. Thus, the i-th survival time follows an exponential distribution with mean $\exp (X_{i}^{T} β^{*})$ . The i-th censoring time is independently generated from an exponential distribution with mean $U \times \exp (X_{i}^{T} β^{*})$ , where U ~ Unif[1, 3]. As discussed in Fan and Li (2002), this censoring scheme results in about 30% censored samples.

The above simulation is repeated 1,000 times. The empirical type I errors of the decorrelated score, Wald and partial likelihood ratio tests are summarized in Tables 1 and 2. We see that the empirical type I errors of all three tests are close to the desired 5% significance level, which supports our theoretical results. This observation holds for the whole range of ρ, s and d specified in the data generating procedures. In addition, as expected, the empirical type I errors further deviate from the significance level as d increases for all three tests, illustrating the effects of dimensionality d on finite sample performance.

Table 1.

Average Type I error of the decorrelated tests with η = 5% where (n, s) = (150, 2).

Method	d	ρ = 0.25		ρ = 0.4		ρ = 0.6		ρ = 0.75

		Dirac	Unif[0, 2]	Dirac	Unif[0, 2]	Dirac	Unif[0, 2]	Dirac	Unif[0, 2]
Score	100	5.1%	5.2%	5.1%	4.9%	5.2%	5.1%	4.9%	5.0%
	200	5.2%	4.8%	5.3%	4.8%	5.3%	5.6%	4.7%	4.6%
	500	6.1%	6.4%	5.5%	4.6%	4.2%	4.4%	3.9%	3.7%

Wald	100	5.2%	5.3%	5.1%	5.0%	5.2%	4.9%	5.0%	5.1%
	200	5.4%	4.7%	5.3%	4.8%	4.6%	4.7%	4.3%	4.6%
	500	6.3%	6.1%	5.9%	5.5%	5.8%	4.2%	4.5%	3.9%

PLRT	100	4.9%	4.8%	5.1%	5.2%	5.0%	5.2%	4.8%	4.7%
	200	5.7%	5.5%	5.3%	5.5%	4.8%	5.6%	4.6%	4.5%
	500	6.2%	6.2%	5.9%	5.3%	4.5%	4.2%	3.8%	3.6%

Open in a new tab

Table 2.

Average type I error of the decorrelated tests with η = 5% where (n, s) = (150, 3).

	d	ρ = 0.25		ρ = 0.4		ρ = 0.6		ρ = 0.75

		Dirac	Unif[0, 2]	Dirac	Unif[0, 2]	Dirac	Unif[0, 2]	Dirac	Unif[0, 2]
Score	100	5.2%	5.2%	4.8%	5.3%	5.3%	4.9%	5.3%	4.8%
	200	5.2%	4.6%	4.7%	5.3%	5.4%	5.8%	4.5%	4.8%
	500	6.3%	6.5%	5.8%	4.4%	5.2%	4.6%	3.6%	3.4%

Wald	100	5.1%	4.9%	5.3%	4.7%	5.2%	4.9%	5.0%	5.1%
	200	4.8%	4.6%	4.9%	5.1%	5.2%	5.7%	4.2%	4.4%
	500	6.5%	6.8%	6.2%	5.9%	5.1%	4.5%	3.9%	4.2%

PLRT	100	5.3%	5.2%	5.0%	5.3%	5.4%	5.2%	4.9%	4.8%
	200	5.5%	5.3%	5.4%	4.6%	5.2%	5.7%	5.4%	4.3%
	500	6.5%	6.3%	5.7%	5.5%	4.8%	4.1%	3.7%	3.2%

Open in a new tab

We also investigate the empirical power of the proposed tests. Instead of setting β₁ = 0, we generate the data with β₁ = 0.05, 0.1, 0.15, …, 0.55, following the same simulation scheme introduced above. We plot the rejection rates of the three decorrelated tests for testing H₀ : β₁ = 0 with significance level 0.05 and ρ = 0.25 in Figure 2. We see that when d = 100, the three tests share similar power. However, for larger d (e.g., d = 500), the decorrelated partial likelihood ratio test is the most powerful test. In addition, the Wald test is less effective for problems with higher dimensionality. Based on our simulation results, we recommend the decorrelated partial likelihood ratio test for inference in high dimensional problems.

Empirical rejection rates of the decorrelated score, Wald and partial likelihood ratio tests on simulated data with different active set sizes and dimensionality.

6.2 Inference on the Baseline Hazard Function on Simulated Data

In this section, we demonstrate the empirical performance of the decorrelated inference procedure on the baseline hazard function Λ₀(t) proposed as in Section 5. We consider three scenarios with Λ₀(t) = t, t²/2 and t³/3. Note that when Λ₀(t) = p⁻¹t^p, the survival time follows a Weibull distribution with shape parameter p and scale parameter ${p \exp (- X_{i}^{T} β^{*})}^{1 / p}$ , i.e., $W (p, {p \exp (- X_{i}^{T} β^{*})}^{1 / p})$ . We use the same data generating procedures for the covariate Xi’s, parameter β^∗ and censoring time R as in the previous subsection.

In each simulation, we construct 95% confidence intervals for Λ₀(t) at t = 0.2 using the procedures proposed in Section 5. The simulation is repeated 1,000 times. The results for the empirical coverage probabilities of Λ₀(t) are summarized in Tables 3 and 4. It is seen that the coverage probabilities are all between 93% and 97%, which matches our theoretical results.

Table 3.

Empirical coverage probability of 95% confidence intervals for Λ₀(t) at t = 0.2 with (n, s) = (150, 2)

Λ₀(t)	d	ρ = 0.25		ρ = 0.4		ρ = 0.6		ρ = 0.75

		Dirac	Unif[0, 2]	Dirac	Unif[0, 2]	Dirac	Unif[0, 2]	Dirac	Unif[0, 2]
t	100	95.3%	95.1%	94.7%	95.1%	95.2%	94.6%	95.4%	94.9%
	200	95.5%	95.8%	95.7%	95.3%	94.6%	94.5%	94.4%	94.2%
	500	95.9%	96.2%	95.5%	94.8%	94.3%	94.1%	93.7%	93.5%

t ²	100	95.1%	95.3%	95.2%	95.0%	95.4%	94.7%	95.2%	95.3%
	200	95.5%	94.8%	95.4%	94.7%	94.6%	94.0%	94.4%	94.5%
	500	96.6%	96.7%	96.1%	95.4%	94.9%	94.3%	93.8%	93.6%

t ³	100	95.2%	95.0%	95.1%	95.3%	94.8%	95.1%	95.2%	94.7%
	200	95.4%	94.7%	94.6%	95.5%	95.2%	95.8%	94.6%	94.3%
	500	96.6%	95.9%	96.3%	95.9%	94.5%	94.7%	93.6%	93.4%

Open in a new tab

Table 4.

Empirical coverage probability of 95% confidence intervals for Λ₀(t) at t = 0.2 with (n, s) = (150, 3)

Λ₀(t)	d	ρ = 0.25		ρ = 0.4		ρ = 0.6		ρ = 0.75

		Dirac	Unif[0, 2]	Dirac	Unif[0, 2]	Dirac	Unif[0, 2]	Dirac	Unif[0, 2]
t	100	95.1%	94.8%	94.8%	95.2%	95.3%	95.1%	94.8%	95.4%
	200	95.6%	95.3%	95.4%	95.2%	94.7%	94.8%	94.2%	94.3%
	500	96.2%	95.9%	95.8%	96.1%	95.2%	94.3%	93.3%	93.6%

t ²	100	95.3%	94.7%	95.3%	94.9%	94.5%	95.3%	95.4%	95.2%
	200	94.7%	94.5%	95.4%	95.2%	94.1%	94.9%	94.3%	93.8%
	500	96.5%	96.2%	95.8%	96.0%	95.5%	95.1%	93.2%	93.7%

t ³	100	95.0%	95.2%	94.6%	94.8%	95.1%	95.4%	94.9%	95.5%
	200	95.3%	95.5%	95.2%	94.5%	94.3%	94.6%	93.8%	93.5%
	500	95.9%	96.3%	95.7%	96.0%	95.4%	94.7%	93.6%	93.1%

Open in a new tab

To further examine the performance of our method, we conduct additional simulation studies by plotting the 95% confidence intervals of Λ₀(t) at t = 0.05, 0.1, 0.15, …, 0.5, with Λ₀(t) = t and t²/2. The results are presented in Figures 3 and 4.

95% confidence intervals for the baseline hazard function at t = 0.05, 0.1, …, 0.5. The red solid line denotes the estimated baseline hazard function $\tilde{Λ} (t)$ , and blue dashed line denotes Λ₀(t) = t.

6.3 Analyzing a Gene Expression Dataset

We apply the proposed testing procedures to analyze a genomic data set, which is collected from a diffuse large B-cell lymphoma study analyzed by Alizadeh et al. (2000). One of the goals in this study is to investigate how the gene expression levels in B-cell malignancies are associated with the survival time. The expression values for over 13,412 genes in B-cell malignancies are measured by microarray experiments. The data setcontains 40 patients with diffuse large B-cell lymphoma who are recruited and followed until death or the end of the study. A small proportion (≈5%) of the gene expression values are not well measured and are treated as missing values by Alizadeh et al. (2000). For simplicity, we impute the missing values of each gene by the median of the observed values of the same gene. The average survival time is 43.9 months and the censored rate is 55%. Since the sample size n = 40 is small, we conduct pre-screening by fitting univariate proportional hazards models and only keep d = 200 genes with the smallest p-values.

We apply the proposed score, Wald and partial likelihood ratio tests to the pre-screened data. The same strategy for choosing the tuning parameters as that in the simulation studies is adopted. We repeatedly apply the hypothesis tests for all parameters. To control the family-wise error rate due to the multiple testing, the p-values are adjusted by the Bonferroni’s method. To be more conservative, we only report the genes with adjusted p-values less than 0.05 by all of the three methods in Table 5. Many of the genes which are significant in the hypothesis tests are biologically related to lymphoma. For instance, the relation between lymphoma and genes FLT3 (Meierhoff et al., 1995), CDC10 (Di Gaetano et al., 2003), CHN2 (Nishiu et al., 2002) and Emv11 (Hiai et al., 2003) have been experimentally confirmed. This provides evidence that our methods can be used to discover scientific findings in applications involving high dimensional datasets.

Table 5.

Genes with the adjuste p-values less than 0.05 using score, Wald and partial likelihood ratio tests for the large B-cell lymphoma gene expression dataset.

Gene	Score	Wald	PLRT
FLT3	1.01 × 10⁻²	2.86 × 10⁻²	1.72 × 10⁻²
GPD2	3.91 × 10⁻²	4.67 × 10⁻³	7.44 × 10⁻³
PTMAP1	7.86 × 10⁻³	4.84 × 10⁻³	3.75 × 10⁻³
CDC10	3.52 × 10⁻³	2.63 × 10⁻³	1.10 × 10⁻³
Emv11	4.96 × 10⁻³	2.77 × 10⁻⁴	3.49 × 10⁻⁴
CHN2	1.79 × 10⁻²	2.73 × 10⁻²	3.58 ×10⁻³
Ptger2	1.78 ×10⁻²	1.32 × 10⁻²	2.47 × 10⁻³
Swq1	4.04 × 10⁻³	4.21 × 10⁻²	3.67 × 10⁻²
Cntn2	4.05 × 10⁻³	4.84 × 10⁻²	4.03 × 10⁻²

Open in a new tab

7 Discussion

We proposed a novel decorrelation-based approach to conduct inference for both the parametric and nonparametric components of high dimensional Cox’s proportional hazards models. Unlike existing works, our methods do not require conditions on model selection consistency or minimal signal strength. Theoretical properties of the proposed methods are established. Extensive numerical investigations are conducted on the simulated and real datasets to examine the finite sample performances of our methods. To the best of our knowledge, this paper for the first time provides a unified framework on uncertainty assessment of high dimensional Cox’s proportional hazards models. Our methods can be extended to conduct inference for other high-dimensional survival models such as censored linear model (Müller and van de Geer, 2014) and additive hazards model (Lin and Lv, 2013).

In this paper, we focus on the Cox’s proportional hazards model for the univariate survival data. In practice, many biomedical studies involve multiple survival outcomes. For instance, in the Framingham Heart Study by Dawber (1980), both time to coronary heart disease and time to cerebrovascular accident are observed. How the inference can be drawn by jointly analyzing the multivariate survival data in the high dimensional setting remains largely unexplored. To address this problem, we extend the proposed hypothesis testing procedures to deal with the multivariate survival data. More details are presented in Appendix D.

The proposed methods involve two tuning parameters λ and λ′. The presence of multiple tuning parameters in the inferential procedures is encountered in many recent works even under high dimensional linear models (Chernozhukov et al., 2013; van de Geer et al., 2014b; Javanmard and Montanari, 2013; Zhang and Zhang, 2014). Theoretically, we establish the asymptotic normality of the test statistics when $λ ≍ \sqrt{n^{- 1} \log d}$ and $λ^{'} ≍ s \sqrt{n^{- 1} \log d}$ . Empirically, our numerical results suggest that cross-validation seems to be a practical procedure for the choice of λ. As an important future investigation, it is of interest to provide rigorous theoretical justification of practical procedures such as cross-validation for the choice of tuning parameters.

Supplementary Material

Supplemental Material

NIHMS847285-supplement-Supplemental_Material.pdf^{(619.6KB, pdf)}

Acknowledgments

We thank Professor Bradic for providing very helpful comments. This research is partially supported by the grants NSF CAREER DMS 1454377, NSF IIS1408910, NSF IIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841.

A Proofs in Section 4

In this section, we provide the detailed proofs in Section 4. We first provide a key lemma which characterizes the asymptotic normality of ∇ℒ(β^∗). This lemma is essential in our later proofs to derive the asymptotic distributions of the test statistics.

Lemma A.1

Under Assumptions 2.1, 4.2 and 4.3, for any vector v ∈ ℝ^d, if ${‖ v ‖}_{0} \leq s^{'}$ and $n^{- 1 / 2} \sqrt{{s^{'}}^{3} \log d} = o (1)$ , it holds that

\frac{\sqrt{n} v^{T} \nabla L (β^{*})}{\sqrt{v^{T} H^{*} v}} \overset{d}{\to} N (0, 1), where H^{*} is defined in (2.7) .

Proof

Let $M_{i} (t) = N_{i} (t) - \int_{0}^{t} Y_{i} (u) λ_{0} (u) d u$ . By the definition of ∇ℒ(β*) in (2.4), we have

\begin{array}{l} \nabla L (β^{*}) = - \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {X_{i} (u) - {\bar{Z}}_{n} (u, β^{*})} d M_{i} (u) \\ = - \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {X_{i} (u) - e (u, β^{*})} d M_{i} (u) - \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {e (u, β^{*}) - {\bar{X}}_{n} (u, β^{*})} d M_{i} (u), \end{array}

Thus, by the identity $H^{*} = \sqrt{n} Var {\nabla L (β^{*})}$ , we have

\begin{array}{l} \frac{\sqrt{n} v^{T} \nabla L (β^{*})}{\sqrt{v^{T} H^{*} v}} = - \frac{1}{\sqrt{n}} \underset{S}{\underset{︸}{\frac{v^{T}}{\sqrt{v^{T} H^{*} v}} \sum_{i = 1}^{n} \int_{0}^{τ} {X_{i} (u) - e (u, β^{*})} d M_{i} (u)}} \\ - \underset{E}{\underset{︸}{\frac{1}{\sqrt{n}} \frac{v^{T}}{\sqrt{v^{T} H^{*} v}} \sum_{i = 1}^{n} \int_{0}^{τ} {e (u, β^{*}) - {\bar{X}}_{n} (u, β^{*})} d M_{i} (u)}} . \end{array}

For the first term S, denote by

ξ_{i} = \frac{v^{T}}{\sqrt{v^{T} H^{*} v}} \int_{0}^{τ} {X_{i} (u) - e (u, β^{*})} d M_{i} (u) .

We have $E (ξ_{i}) = 0$ , and Var(n⁻¹^/²S) = 1. Thus S is a sum of n independent random variables with mean 0. To get the asymptotic distribution of n⁻¹^/²S, we verify the Lyapunov condition. Indeed, we have

\begin{array}{l} \frac{1}{n^{3 / 2}} \sum_{i = 1}^{n} E {| \frac{v^{T}}{\sqrt{v^{T} H^{*} v}} \int_{0}^{τ} {X_{i} (u) - e (u, β^{*})} d M_{i} (u) |}^{3} \\ \leq \frac{C}{C_{h}^{3 / 2} n^{3 / 2}} \sum_{i = 1}^{n} {s^{'}}^{3 / 2} \sup_{u \in [0, τ]} {‖ X_{i} (u) - e (u, β^{*}) ‖}_{\infty}^{3} \\ = O ({s^{'}}^{3 / 2} n^{- 1 / 2}), \end{array}

where the inequality follows by Assumption 4.3 for some constant C, and the equality holds by Lemma C.1 and Assumption 2.1. Thus, the Lyapunov condition holds by our scaling assumption that s′³^/²n⁻¹^/² = o(1). Apply Lindeberg Feller Central Limit Theorem, we have $n^{- 1 / 2} S \overset{d}{\to} N (0, 1)$ .

Next, we prove that the second term E = o_ℙ(1). Since

\begin{array}{l} E = \frac{1}{\sqrt{n}} \frac{v^{T}}{\sqrt{v^{T} H^{*} v}} \sum_{i = 1}^{n} \int_{0}^{τ} [{e (u, β^{*}) - {\bar{X}}_{n} (u, β^{*})} 1 d M_{i} (u)] \\ \leq \frac{1}{\sqrt{n}} \frac{{s^{'}}^{1 / 2}}{λ_{\min}} \sup_{u \in [0, τ]} {‖ e (u, β^{*}) - {\bar{X}}_{n} (u, β^{*}) ‖}_{\infty} \int_{0}^{τ} | \sum_{i = 1}^{n} 1 d M_{i} (u) | . \end{array}

By Lemma C.1, it holds that $\sup_{u \in [0, τ]} {‖ e (u, β^{*}) - {\bar{X}}_{n} (u, β^{*}) ‖}_{\infty} = O_{ℙ} (\sqrt{n^{- 1} \log d})$ . It holds that, for some constant C > 0,

E \leq \frac{C}{\sqrt{n}} \frac{1}{λ_{\min}} \sqrt{\frac{s^{'} \log d}{n}} \int_{0}^{τ} | \sum_{i = 1}^{n} 1 d M_{i} (u) | .

It remains to bound the term $\int_{0}^{τ} | \sum_{i = 1}^{n} 1 d M_{i} (u) |$ . By Theorem 2.11.9 and Example of 2.11.16 of van der Vaart and Wellner (1996), $\bar{G} (t) : = n^{- 1 / 2} \sum_{i = 1}^{n} M_{i} (t)$ converges weakly to a tight Gaussian process G(t). Furthermore, by Strong Embedding Theorem of Shorack and Wellner (2009), there exists another probability space such that $(S^{* (0)} (β, t), S^{* (1)} (β, t), {\bar{G}}^{*} (t))$ converges almost surely to $(s^{* (0)} (β, t), s^{* (1)} (β, t), G^{*} (t))$ , where * indicates the existences in a new probability space. This implies that $\int_{0}^{τ} | d G (t) | = \int_{0}^{τ} | d G^{*} (t) | + o_{ℙ} (1)$ . We have, by our assumption $n^{- 1} \sqrt{s^{'} \log d} = o_{ℙ} (1)$ , the term E satisfies that

E = O_{ℙ} (\sqrt{\frac{s^{'} \log d}{n}} \cdot \frac{1}{\sqrt{n}}) = o_{ℙ} (1) .

Combining this with the result that $n^{- 1 / 2} S \overset{d}{\to} N (0, 1)$ concludes the proof. □

Next, we characterize the rate of convergence of the Dantzig selector $\hat{w}$ in (3.2) in the following lemma.

Lemma A.2

Under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, If $λ^{'} ≍ s^{'} \sqrt{n^{- 1} \log d}$ , we have

{‖ \hat{w} - w^{*} ‖}_{1} = O_{ℙ} (s^{'} s \sqrt{n^{- 1} \log d}),

(A.1)

where $\hat{w}$ and w* are defined in (3.2) and (3.1), respectively.

Proof

As shown in Lemma C.6, under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, the condition (C.7) in Lemma C.8 is satisfied for $λ^{'} ≍ s^{'} \sqrt{n^{- 1} \log d}$ . Consequently, we have

{‖ \hat{w} - w^{*} ‖}_{1} = O_{ℙ} (s^{'} s \sqrt{n^{- 1} \log d}),

which concludes the proof. □

Proof of Theorem 4.4

To derive the asymptotic distribution of $\sqrt{n} \hat{U} (0, \hat{θ})$ , we start with decomposing $\hat{U} (0, \hat{θ})$ into several terms.

\begin{array}{l} U (0, \hat{θ}) = \nabla_{α} L (0, \hat{θ}) - {\hat{w}}^{T} \nabla_{θ} L (0, \hat{θ}) \\ = \nabla_{α} L (0, θ^{*}) + \nabla_{α θ}^{2} L (0, \bar{θ}) (\hat{θ} - θ^{*}) - {{\hat{w}}^{T} \nabla_{θ} L (0, θ^{*}) + {\hat{w}}^{T} \nabla_{θ θ}^{2} L (0, \tilde{θ}) (\hat{θ} - θ^{*})} \\ = \underset{S}{\underset{︸}{\nabla_{α} L (0, θ^{*}) - w^{* T} \nabla_{θ} L (0, θ^{*})}} + \underset{E_{1}}{\underset{︸}{{(w^{*} - \hat{w})}^{T} \nabla_{θ} L (0, θ^{*})}} + \underset{E_{2}}{\underset{︸}{{\nabla_{α θ}^{2} L (0, \bar{θ}) - {\hat{w}}^{T} \nabla_{θ θ}^{2} L (0, \tilde{θ})} (\hat{θ} - θ^{*})}}, \end{array}

(A.2)

where the second equality holds by the mean value theorem for some $\bar{θ} = θ^{*} + u (\hat{θ} - θ^{*})$ , $\tilde{θ} = θ^{*} + u^{'} (\hat{θ} - θ^{*})$ and $u, u^{'} \in [0, 1]$ .

We consider the terms S, E₁ and E₂ separately. For the first term S, by Lemma A.1, taking $v = {(1, - w^{* T})}^{T}$ . We have,

\sqrt{n} S \overset{d}{\to} Z, where Z ~ N (0, H_{α | θ}) .

(A.3)

For the term E₁, we have,

E_{1} \leq {‖ \hat{w} - w^{*} ‖}_{1} {‖ \nabla_{θ} L (0, θ^{*}) ‖}_{\infty} = O_{ℙ} (s^{'} λ^{'} \sqrt{n^{- 1} \log d}),

(A.4)

where ${‖ \hat{w} - w^{*} ‖}_{1} = O_{ℙ} (s^{'} λ^{'})$ by Lemma C.8, and ${‖ \nabla_{θ} L (0, θ^{*}) ‖}_{\infty} = O_{ℙ} (\sqrt{n^{- 1} \log d})$ by Lemma C.3.

For the term E₂, we have,

E_{2} = \underset{E_{21}}{\underset{︸}{{\nabla_{α θ}^{2} L (0, \bar{θ}) - H_{α θ}^{*} H_{θ θ}^{* - 1} \nabla_{θ θ}^{2} L (0, \tilde{θ})} (\hat{θ} - θ^{*})}} + \underset{E_{22}}{\underset{︸}{{(w^{*} - \hat{w})}^{T} \nabla_{θ θ}^{2} L (0, \tilde{θ}) (\hat{θ} - θ^{*})}} .

(A.5)

Considering the terms E₂₁ and E₂₂ separately, first, we have,

\begin{array}{l} E_{21} = {\nabla_{α θ}^{2} L (0, \bar{θ}) - H_{α θ}^{*} + H_{α θ}^{*} - H_{α θ}^{*} H_{θ θ}^{* - 1} \nabla_{θ θ}^{2} L (0, \tilde{θ})} (\hat{θ} - θ^{*}) \\ \leq {‖ \nabla_{α θ}^{2} L (0, \bar{θ}) - H_{α θ}^{*} ‖}_{\infty} {‖ \hat{θ} - θ^{*} ‖}_{1} + | H_{α θ}^{*} (I_{d - 1} - H_{θ θ}^{* - 1} \nabla_{θ θ}^{2} L (0, \tilde{θ})) (\hat{θ} - θ^{*}) |, \end{array}

(A.6)

where the inequality holds by Hölder’s inequality. For the first term in the above inequality, we have

{‖ \nabla_{α θ}^{2} L (0, \bar{θ}) - H_{α θ}^{*} ‖}_{\infty} {‖ \hat{θ} - θ^{*} ‖}_{1} = O_{ℙ} (s^{2} λ^{2}),

(A.7)

since ${‖ \hat{θ} - θ^{*} ‖}_{1} = O_{ℙ} (s λ)$ by (2.2) and ${‖ \nabla_{α θ} L (0, \bar{θ}) - H_{α θ}^{*} ‖}_{\infty} = O_{ℙ} (s λ)$ by Lemma C.5.

For the second term in (A.6), by Hölder’s inequality, we have

\begin{array}{l} | H_{α θ}^{*} (I_{d - 1} - H_{θ θ}^{* - 1} \nabla_{θ θ}^{2} L (0, \tilde{θ})) (\hat{θ} - θ^{*}) | = | H_{α θ}^{*} H_{θ θ}^{* - 1} (H_{θ θ}^{*} - \nabla_{θ θ}^{2} L (0, \tilde{θ})) (\hat{θ} - θ^{*}) | \\ \leq {‖ w^{*} ‖}_{1} {‖ H_{θ θ}^{*} - \nabla_{θ θ}^{2} L (0, \tilde{θ}) ‖}_{\infty} {‖ \hat{θ} - θ^{*} ‖}_{1} = O_{ℙ} (s^{'} s^{2} λ^{2}), \end{array}

(A.8)

where the last equality holds since ${‖ w^{*} ‖}_{1} = O (s^{'})$ by Assumption 4.2, ${‖ H_{θ θ}^{*} - \nabla_{θ θ}^{2} L (0, \tilde{θ}) ‖}_{\infty} = O_{ℙ} (s λ)$ by Lemma C.5, and ${‖ \hat{θ} - θ^{*} ‖}_{1} = O_{ℙ} (s λ)$ by (2.2). Plugging (A.7) and (A.8) into (A.6), we have

| E_{21} | = O_{ℙ} (s^{'} s^{2} λ^{2}) .

(A.9)

For the second term E₂₂ in (A.5), we have,

| E_{22} | \leq {‖ \hat{w} - w^{*} ‖}_{1} {‖ \nabla_{θ θ}^{2} L (0, \tilde{θ}) ‖}_{\infty} {‖ \hat{θ} - θ^{*} ‖}_{1} = O_{ℙ} (s^{'} s λ^{'} λ),

(A.10)

where we use the results that ${‖ \hat{w} - w^{*} ‖}_{1} = O_{ℙ} (s^{'} λ^{'})$ by Lemma C.8, ${‖ \hat{θ} - θ^{*} ‖}_{1} \leq O_{ℙ} (s λ)$ by (2.2), and ${‖ \nabla_{θ θ}^{2} L (0, \tilde{θ}) ‖}_{\infty} = O_{ℙ} (1)$ by Lemma C.5.

Plugging (A.6) and (A.10) into (A.5), we have $E_{2} = O_{ℙ} (n^{- 1} s^{'} s^{2} \log d)$ . Combining it with (A.4), we have

| E_{1} | + | E_{2} | = O_{ℙ} (\frac{s^{'} s^{2} \log d}{n}) = o_{ℙ} (\frac{1}{\sqrt{n}}),

(A.11)

where the last equality holds by the assumption that n⁻¹^/²s³ log d = o(1) and s ≍ s′. Combining (A.11), (A.3) and (A.2), our claim (4.1) holds as desired. □

Proof of Lemma 4.5

By the definition of H_α|_θ and ${\hat{H}}_{α | θ}$ , we have

| H_{α | θ} - {\hat{H}}_{α | θ} | \leq \underset{E_{1}}{\underset{︸}{| H_{α α}^{*} - \nabla_{α α}^{2} L (\hat{α}, \hat{θ}) |}} + \underset{E_{2}}{\underset{︸}{| H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*} - {\hat{w}}^{T} \nabla_{θ α}^{2} L (\hat{α}, \hat{θ}) |}} .

(A.12)

We consider the two terms separately. For the first term E₁, we have by Lemma C.5, $E_{1} = O_{ℙ} (s λ)$ . For the second term E₂, we have,

\begin{array}{l} E_{2} = | H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*} - {\hat{w}}^{T} \nabla_{θ α}^{2} L (\hat{α}, \hat{θ}) | = | H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*} - {\hat{w}}^{T} H_{θ α}^{*} + {\hat{w}}^{T} H_{θ α}^{*} - {\hat{w}}^{T} \nabla_{θ α}^{2} L (\hat{α}, \hat{θ}) | \\ \leq \underset{E_{21}}{\underset{︸}{| H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*} - {\hat{w}}^{T} H_{θ α}^{*} |}} + \underset{E_{22}}{\underset{︸}{| {\hat{w}}^{T} H_{θ α}^{*} - {\hat{w}}^{T} \nabla_{θ α}^{2} L (\hat{α}, \hat{θ}) |}} . \end{array}

For the term E₂₁, we have, by Hölder’s inequality,

E_{21} \leq {‖ H_{α θ}^{*} H_{θ θ}^{* - 1} - {\hat{w}}^{T} ‖}_{1} {‖ H_{θ α}^{*} ‖}_{\infty} = O_{ℙ} (s^{'} λ^{'}),

(A.13)

where the last inequality holds by the fact that ${‖ H_{α θ}^{*} H_{θ θ}^{* - 1} - {\hat{w}}^{T} ‖}_{1} = O_{ℙ} (s^{'} λ^{'})$ , and ${‖ H_{θ α}^{*} ‖}_{\infty} = O (1)$ by Assumption 4.3.

For the second term E₂₂, we have, by Hölder’s inequality,

E_{22} \leq {‖ \hat{w} ‖}_{1} {‖ H_{θ α}^{*} - \nabla_{θ α}^{2} L (\hat{α}, \hat{θ}) ‖}_{\infty} = O_{ℙ} (s^{'} s λ),

(A.14)

where the last equality holds by the assumption that ${‖ w^{*} ‖}_{1} = O (s^{'})$ , the result $‖ \hat{w} - w^{*} ‖ = O_{ℙ} (s^{'} λ^{'})$ by (A.1) and by Lemma C.5 that ${‖ H_{θ α}^{*} - \nabla_{θ α}^{2} L (\hat{α}, \hat{θ}) ‖}_{\infty} = O_{ℙ} (s λ)$ .

Combining (A.13) and (A.14), we have, $E_{2} \leq E_{21} + E_{22} = O_{ℙ} (s^{'} λ^{'})$ . Together with the result that $E_{1} = O_{ℙ} (s^{2} λ)$ , the claim holds as desired. □

Proof of Theorem 4.7

Based on our construction of $\tilde{α}$ in (3.7), we have

\begin{array}{l} \tilde{α} = \hat{α} - {\frac{\partial \hat{U} (\hat{α}, \hat{θ})}{\partial α}}^{- 1} \hat{U} (\hat{α}, \hat{θ}) = \hat{α} - H_{α | θ}^{- 1} \hat{U} (\hat{α}, \hat{θ}) + \underset{R_{1}}{\underset{︸}{\hat{U} (\hat{α}, \hat{θ}) [H_{α | θ}^{- 1} - {\frac{\partial \hat{U} (\hat{α}, \hat{θ})}{\partial α}}^{- 1}]}} \\ = \hat{α} - H_{α | θ}^{- 1} {\hat{U} (0, \hat{θ}) + \frac{(\hat{α} - 0) \partial \hat{U} (\bar{α}, \hat{θ})}{\partial α}} + R_{1} \\ = \hat{α} - H_{α | θ}^{- 1} \hat{U} (0, \hat{θ}) - \hat{α} H_{α | θ}^{- 1} H_{α | θ} + \underset{R_{2}}{\underset{︸}{\hat{α} H_{α | θ}^{- 1} {H_{α | θ} - \frac{\partial \hat{U} (\bar{α}, \hat{θ})}{\partial α}}}} + R_{1} = - H_{α | θ}^{- 1} \hat{U} (0, \hat{θ}) + R_{1} + R_{2}, \end{array}

(A.15)

where (A.15) holds by the mean value theorem for some $\bar{α} = u \hat{α}$ and u ∈ [0, 1]. For the term R₁, note that

| \hat{U} (\hat{α}, \hat{θ}) - \hat{U} (0, \hat{θ}) | = | \hat{α} | \cdot | \frac{\partial \hat{U} ({\bar{α}}^{'}, \hat{θ})}{\partial α} |

where the equality holds by mean-value theorem with ${\bar{α}}^{'} = u \hat{α}$ for some u ∈ [0, 1]. Under the null hypothesis α* = 0, by Theorem 3.2 of Huang et al. (2013), $| \hat{α} - α^{*} | \leq {‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (s λ)$ . By regularity condition $H_{α | θ} = O (1)$ and Lemma 4.5, it also holds that $| \partial \hat{U} ({\bar{α}}^{'}, \hat{θ}) / \partial α | = O_{ℙ} (1)$ . Thus, we have

| \hat{U} (\hat{α}, \hat{θ}) - \hat{U} (0, \hat{θ}) | = O_{ℙ} (s λ), and | \hat{U} (0, \hat{θ}) | = O_{ℙ} (n^{- 1 / 2}),

(A.16)

where the second equality holds by Theorem 4.4. Thus, by triangle inequality, we have

| R_{1} | \leq | \hat{U} (\hat{α}, \hat{θ}) - \hat{U} (0, \hat{θ}) | \cdot | H_{α | θ}^{- 1} - {\frac{\partial \hat{U} (\hat{α}, \hat{θ})}{\partial α}}^{- 1} | + | \hat{U} (0, \hat{θ}) | \cdot | H_{α | θ}^{- 1} - {\frac{\partial \hat{U} (\hat{α}, \hat{θ})}{\partial α}}^{- 1} | = O_{ℙ} (s^{3} \frac{\log d}{n}),

where the last equality holds by (A.16) and Lemma 4.5.

For the term R₂, we have,

| R_{2} | \leq | \hat{α} H_{α | θ}^{- 1} | \cdot | H_{α | θ} - \frac{\partial \hat{U} (\bar{α}, \hat{θ})}{\partial α} | = O_{ℙ} (s^{3} \frac{\log d}{n}),

where the last inequality holds by the fact that $| \hat{α} | = O_{ℙ} (s λ)$ , $| H_{α | θ} | = O (1)$ and Lemma 4.5.

Consequently, it holds that,

\sqrt{n} \tilde{α} \overset{d}{\to} Z, where Z ~ N (0, H_{α | θ}^{- 1}),

and the last equality follows by Theorem 4.4 and our the assumption that n⁻¹^/²s³ log d = o(1). The claim follows as desired. □

Proof of Theorem 4.9

We have

\begin{array}{l} L (\tilde{α}, \hat{θ} - \tilde{α} \hat{w}) - L (0, \hat{θ}) \\ = \tilde{α} \nabla_{α} L (0, \hat{θ}) - \tilde{α} {\hat{w}}^{T} \nabla_{θ} L (0, \hat{θ}) + \frac{{\tilde{α}}^{2}}{2} \nabla_{α α}^{2} L (\bar{α}, \hat{θ}) + \frac{{\tilde{α}}^{2}}{2} {\hat{w}}^{T} \nabla_{θ θ}^{2} L (0, \bar{θ}) \hat{w} - {\tilde{α}}^{2} {\hat{w}}^{T} \nabla_{θ} L ({\bar{α}}^{'}, \hat{θ}) \\ = \underset{T_{1}}{\underset{︸}{\tilde{α} \hat{U} (0, \hat{θ})}} + \underset{T_{2}}{\underset{︸}{\frac{{\tilde{α}}^{2}}{2} {\nabla_{α α}^{2} L (\bar{α}, \hat{θ}) + {\hat{w}}^{T} \nabla_{θ θ}^{2} L (0, \bar{θ}) \hat{w} - 2 {\tilde{w}}^{T} \nabla_{θ α}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'})}}}, \end{array}

(A.17)

where the first equality follows by the mean-value theorem with $\bar{α} = u_{1} \hat{α}$ , ${\bar{α}}^{'} = u_{2} \hat{α}$ , $\bar{θ} = θ^{*} + u_{3} (\hat{θ} - θ^{*})$ , and ${\bar{θ}}^{'} = θ^{*} + u_{4} (\hat{θ} - θ^{*})$ for some $0 \leq u_{1}, u_{2}, u_{3}, u_{4} \leq 1$ .

We first look at the term T₁. Under the null hypothesis α* = 0, $\sqrt{n} \hat{U} (0, \hat{θ}) \overset{d}{\to} Z + o_{ℙ} (1)$ and $\sqrt{n} \tilde{α} = - H_{α | θ}^{- 1} Z + o_{ℙ} (1)$ by Theorems 4.4 and 4.7, respectively, where Z ~ N(0,H_α|_θ). We have,

T_{1} = {n^{- 1 / 2} Z + o_{ℙ} (n^{- 1 / 2})} {- n^{- 1 / 2} H_{α | θ}^{- 1} Z + o_{ℙ} (n^{- 1 / 2})} = - n^{- 1} Z^{2} H_{α | θ}^{- 1} + o_{ℙ} (n^{- 1}) .

(A.18)

Next, we look at the term T₂,

\begin{array}{l} T_{2} = \underset{T_{21}}{\underset{︸}{\frac{{\tilde{α}}^{2}}{2} (H_{α α}^{*} + H_{α θ} H_{θ θ}^{* - 1} H_{θ α}^{*} - 2 H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*})}} \\ + \underset{T_{22}}{\underset{︸}{\frac{{\tilde{α}}^{2}}{2} [{\nabla_{α α}^{2} L (\bar{α}, \hat{θ}) - H_{α α}^{*}} + {{\hat{w}}^{T} \nabla_{θ θ}^{2} L (0, \bar{θ}) \hat{w} - w^{*} H_{θ θ}^{*} w^{*}} - 2 {{\tilde{w}}^{T} \nabla_{θ α}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*} w^{*}}]}} \end{array}

(A.19)

It holds by Theorem 4.7 that $\sqrt{n} \tilde{α} \overset{d}{\to} H_{α θ}^{- 1} Z$ . Together with the regularity condition $H_{α | θ} = O (1)$ , we have,

2 n T_{21} = n {\tilde{α}}^{2} H_{α | θ} \overset{d}{\to} H_{α | θ}^{- 1} Z^{2} .

(A.20)

Considering the term T₂₂, we have

\begin{array}{l} T_{22} = \frac{{\tilde{α}}^{2}}{2} [\underset{R_{1}}{\underset{︸}{{\nabla_{α α}^{2} L (\bar{α}, \hat{θ}) - H_{α α}^{*}}}} + \underset{R_{2}}{\underset{︸}{{{\hat{w}}^{T} \nabla_{θ θ}^{2} L (0, \bar{θ}) \hat{w} - w^{*} H_{θ θ}^{*} w^{*}}}} \\ - \underset{R_{3}}{\underset{︸}{2 {{\tilde{w}}^{T} \nabla_{α θ}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - w^{* T} H_{α θ}^{*}}}}] . \end{array}

(A.21)

For the first term |R₁|, we have, by Lemma C.5, $| R_{1} | = | \nabla_{α α}^{2} L (\bar{α}, \hat{θ}) - H_{α α}^{*} | = O_{ℙ} (s λ)$ . For the second term,

\begin{array}{l} | R_{2} | = | {\hat{w}}^{T} \nabla_{θ θ}^{2} L (0, \bar{θ}) \hat{w} - w^{*} H_{θ θ}^{*} w^{*} | \\ \leq | {(\hat{w} - w^{*})}^{T} \nabla_{θ θ}^{2} L (0, \bar{θ}) (\hat{w} - w^{*}) | + 2 | w^{*} \nabla_{θ θ}^{2} L (0, \bar{θ}) (\hat{w} - w^{*}) | \\ + | w^{* T} (\nabla_{θ θ}^{2} L (0, \bar{θ}) - H_{θ θ}^{*}) w^{*} | \\ \leq {‖ \nabla_{θ θ}^{2} L (0, \bar{θ}) ‖}_{\infty} {‖ \hat{w} - w^{*} ‖}_{1}^{2} + 2 {‖ w^{*} ‖}_{1} {‖ \nabla_{θ θ}^{2} L (0, \bar{θ}) ‖}_{\infty} {‖ \hat{w} - w^{*} ‖}_{1} \\ + {‖ w^{*} ‖}_{1}^{2} {‖ \nabla_{θ θ}^{2} L (0, \bar{θ}) - H_{θ θ}^{*} ‖}_{\infty} \\ = O_{ℙ} ({s^{'}}^{2} {λ^{'}}^{2}) + O_{ℙ} ({s^{'}}^{2} λ^{'}) + O_{ℙ} ({s^{'}}^{2} s λ), \end{array}

(A.22)

where the last equality follows by (2.2), Lemma C.4, Lemma C.8 and the sparsity Assumption 4.1 of w*.

For the third term |R₃|, we have

\begin{array}{l} | R_{3} | \leq [| {\nabla_{α θ}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*}} \hat{w} | + | H_{α θ}^{*} (\hat{w} - w^{*}) |] \\ \leq 2 [| {\nabla_{α θ}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*}} (\hat{w} - w^{*}) | + | {\nabla_{α θ}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*}} w^{*} | + | H_{α θ}^{*} (\hat{w} - w^{*}) |] \\ \leq 2 {‖ \nabla_{α θ}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*} ‖}_{\infty} {‖ \hat{w} - w^{*} ‖}_{1} + 2 {‖ \nabla_{α θ}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*} ‖}_{\infty} {‖ w^{*} ‖}_{1} \\ + 2 {‖ H_{α θ}^{*} ‖}_{\infty} {‖ \hat{w} - w^{*} ‖}_{1} . \end{array}

Note that ${‖ \nabla_{α θ}^{2} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*} ‖}_{\infty} {‖ \hat{w} - w^{*} ‖}_{1} = O_{ℙ} (s^{'} s λ^{'} λ)$ by Lemma C.8 and Lemma C.4, ${‖ \nabla_{α θ} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*} ‖}_{\infty} {‖ w^{*} ‖}_{1} = O_{ℙ} (s^{'} s λ)$ by Lemma C.4 and Assumption 4.2, and ${‖ H_{α θ}^{*} ‖}_{\infty} {‖ \hat{w} - w^{*} ‖}_{1} = O_{ℙ} (s^{'} λ^{'})$ by Assumption 4.3 and Lemma C.8. We have $| R_{3} | = O_{ℙ} (s^{'} s λ)$ .

Combining the results above, we have,

T_{22} = \frac{{\tilde{α}}^{2}}{2} \cdot O_{ℙ} ({s^{'}}^{2} s λ) = O_{ℙ} (\frac{{s^{'}}^{2} s \sqrt{\log d}}{n^{3 / 2}}) = O_{ℙ} (n^{- 1}),

(A.23)

where the second equality follows by Theorem 4.7 that $\tilde{α} = O_{ℙ} (n^{- 1 / 2})$ under the null hypothesis, and the last equality follows by the assumption that n⁻¹^/²s′s² log d = o(1).

Combining (A.20) and (A.23) with (A.19), we have

2 n T_{2} \overset{d}{\to} H_{α | θ}^{- 1} Z^{2}, where Z ~ N (0, H_{α | θ}) .

(A.24)

Plugging (A.18) and (A.24) into (A.17), by Theorem 4.4,

- 2 n {L (\tilde{α}, \hat{θ} - \tilde{α} \hat{w}) - L (0, \hat{θ})} \overset{d}{\to} Z_{χ}^{2}, where Z_{χ} ~ χ_{1}^{2},

which concludes the proof. □

B Proofs in Section 5

In this section, we provide detailed proofs in Section 5.

Lemma B.1

Under Assumptions 2.1, 2.2, 4.2, 4.3 and 5.1, ${‖ \nabla {\hat{Λ}}_{0} (t, \hat{β}) - \nabla Λ_{0} (t, β^{*}) ‖}_{\infty} = O_{ℙ} (s \sqrt{n^{- 1} \log d})$ .

Proof

By the definition of ${\hat{Λ}}_{0} (t, \hat{β})$ in (5.1), we have,

{‖ \nabla {\hat{Λ}}_{0} (t, \hat{β}) - \nabla Λ_{0} (t, β^{*}) ‖}_{\infty} = {‖ \frac{1}{n} \int_{0}^{t} \frac{S^{(1)} (u, \hat{β}) d \bar{N} (u)}{{S^{(0)} (u, \hat{β})}^{2}} + E \int_{0}^{t} \frac{s^{(1)} (u, β^{*}) d N (u)}{{s^{(0)} (u, β^{*})}^{2}} ‖}_{\infty} = O_{ℙ} (s \sqrt{\frac{\log d}{n}}),

where the last inequality follows by the same argument in Lemma C.5. □

A corollary of Lemma B.1 and Lemma C.8 follows immediately which characterizes the rate of convergence of $\hat{u} (t)$ .

Corollary B.2

Under Assumptions 2.1, 2.2, 4.2, 4.3 and 5.1, if $δ ≍ s \sqrt{n^{- 1} \log d}$ we have,

{‖ \hat{u} (t) - u^{*} (t) ‖}_{1} = O_{ℙ} (s s^{'} \sqrt{\frac{\log d}{n}}) .

Proof of Theorem 5.2

We first decompose $\sqrt{n} {Λ_{0} (t) - {\tilde{Λ}}_{0} (t, \hat{β})}$ into two terms that

\sqrt{n} {Λ_{0} (t) - {\tilde{Λ}}_{0} (t, \hat{β})} = \sqrt{n} \underset{I_{1} (t)}{\underset{︸}{{Λ_{0} (t) - {\tilde{Λ}}_{0} (t, β^{*})}}} + \sqrt{n} \underset{I_{2} (t)}{\underset{︸}{{{\hat{Λ}}_{0} (t, β^{*}) - {\tilde{Λ}}_{0} (t, \hat{β})}}} .

Let $M_{i} (t) = N_{i} (t) - \int_{0}^{t} Y_{i} (u) λ_{0} (u) d u$ . For the first term $\sqrt{n} I_{1} (t)$ , we have

\sqrt{n} I_{1} (t) = \int_{0}^{t} \frac{\sqrt{n} \sum_{i = 1}^{n} d M_{i} (u)}{\sum_{i = 1}^{n} Y_{i} (u) \exp {X_{i}^{T} (u) β^{*}}} .

Since M_i(t) is a martingale, $\sqrt{n} I_{1} (t)$ becomes a sum of martingale residuals. By Andersen and Gill (1982), we have, as n → ∞, $\sqrt{n} I_{1} (t) \overset{d}{\to} N (0, σ_{1}^{2} (t))$ , where

σ_{1}^{2} (t) = \int_{0}^{t} \frac{λ_{0} (u) d u}{E [\exp {X^{T} (u) β^{*}} Y (u)]} .

For the second term I₂(t), we have, by mean value theorem, for some $\tilde{β} = β^{*} + t (\hat{β} - β^{*})$ , ${\tilde{β}}^{'} = β^{*} + t^{'} (\hat{β} - β^{*})$ and 0 ≤ t, t′ ≤ 1,

\begin{array}{l} I_{2} (t) = {\hat{Λ}}_{0} (t, β^{*}) - {\hat{Λ}}_{0} (t, \hat{β}) + {\hat{u} (t)}^{T} \nabla L (\hat{β}) \\ = {(β^{*} - \hat{β})}^{T} \nabla {\hat{Λ}}_{0} (t, \tilde{β}) + {\hat{u} (t)}^{T} {\nabla L (β^{*}) + \nabla^{2} L ({\tilde{β}}^{'}) (\hat{β} - β^{*})} \\ = {u^{*} (t)}^{T} \nabla L (β^{*}) + \underset{R_{1}}{\underset{︸}{{β^{*} - \hat{β}}^{T} \nabla {\hat{Λ}}_{0} (t, \tilde{β}) + {u^{*} (t)}^{T} \nabla^{2} L ({\tilde{β}}^{'}) (\hat{β} - β^{*})}} \\ + \underset{R_{2}}{\underset{︸}{{\hat{u} (t) - u^{*} (t)}^{T} {\nabla L (β^{*}) + \nabla^{2} L ({\tilde{β}}^{'}) (\hat{β} - β^{*})}}} . \end{array}

Next, we consider the two terms R₁ and R₂. For the term R₁, we have

\begin{array}{l} R_{1} = {(β^{*} - \hat{β})}^{T} \nabla {\hat{Λ}}_{0} (t, \tilde{β}) + {u^{*} (t)}^{T} \nabla^{2} L ({\tilde{β}}^{'}) (\hat{β} - β^{*}) \\ = {(β^{*} - \hat{β})}^{T} [H^{*} H^{* - 1} \nabla {\hat{Λ}}_{0} (t, \tilde{β}) - \nabla^{2} L ({\tilde{β}}^{'}) H^{* - 1} \nabla {\hat{Λ}}_{0} (t, β^{*})] \\ = \underset{R_{11}}{\underset{︸}{{β^{*} - \hat{β}}^{T} {\nabla {\hat{Λ}}_{0} (t, \tilde{β}) - \nabla Λ_{0} (t, β^{*})}}} + \underset{R_{12}}{\underset{︸}{{(β^{*} - \hat{β})}^{T} [H^{*} - \nabla^{2} L ({\tilde{β}}^{'})] H^{* - 1} \nabla Λ_{0} (t, β^{*})}} . \end{array}

It holds that $| R_{11} | \leq {‖ β^{*} - \hat{β} ‖}_{1} {‖ \nabla Λ_{0} (t, \tilde{β}) - \nabla {\hat{Λ}}_{0} (t, β^{*}) ‖}_{\infty} = O_{ℙ} (s^{2} n^{- 1} \log d)$ by (2.2) and Lemma B.1, and $| R_{12} | \leq {‖ β^{*} - \hat{β} ‖}_{1} {‖ H^{*} - \nabla^{2} L ({\tilde{β}}^{'}) ‖}_{\infty} {‖ u^{*} (t) ‖}_{1} = O_{ℙ} (s^{'} s^{2} n^{- 1} \log d)$ . Summing them up, by triangle inequality, we have $| R_{1} | = O_{ℙ} (s^{'} s^{2} n^{- 1} \log d)$ .

For the term R₂, we have

\begin{array}{l} | R_{2} | \leq {‖ \hat{u} (t) - u^{*} (t) ‖}_{1} {‖ \nabla L (β^{*}) ‖}_{\infty} + {‖ \hat{u} (t) - u^{*} (t) ‖}_{1} {‖ \nabla^{2} L (\tilde{β^{'}}) ‖}_{\infty} {‖ \hat{β} - β^{*} ‖}_{1} \\ = O_{ℙ} (s^{'} s n^{- 1} \log d) + O_{ℙ} (s^{'} s^{2} n^{- 1} \log d), \end{array}

where the last inequality holds by Lemma C.3 and C.5.

Meanwhile, by Lemma A.1, taking v = u^∗(t), we have the term $\sqrt{n} u^{* T} (t) \nabla L (β^{*}) \overset{d}{\to} N (0, σ_{2}^{2} (t))$ , where $σ_{2}^{2} (t) = \nabla Λ_{0} {(t, β^{*})}^{T} H^{* - 1} \nabla Λ_{0} (t, β^{*})$ . Thus, we have,

\sqrt{n} I_{2} (t) \overset{d}{\to} Z, where Z ~ N (0, σ_{2}^{2} (t)),

and $σ_{2}^{2} (t) = \nabla Λ_{0} {(t, β^{*})}^{T} H^{* - 1} \nabla Λ_{0} (t, β^{*})$ .

Following the standard martingale theory, the covariance between I₁(t) and I₂(t) is 0. Our claim holds as desired. □

C Technical Lemmas

In this section, we prove some concentration results of the sample gradient ∇ℒ(β^∗) and sample Hessian matrix ∇²ℒ(β^∗). The mathematical tools we use are mainly from empirical process theory.

We start from introducing the following notations. Let ‖·‖_ℙ,r denote the L_r(ℙ)-norm. For any given ε > 0 and the function class ℱ, let N_[](ε, ℱ, L_r(ℙ)) and N(ε, ℱ, L₂(ℚ)) denote the bracketing number and the covering number, respectively. The quantifies log N_[](ε, ℱ, L_r(ℙ)) and log N(ε, ℱ, L₂(ℚ)) are called entropy with bracketing and entropy, respectively. In addition, let F be an envelope of ℱ where |f| ≤ F for all f ∈ ℱ. The bracketing integral and uniform entropy integral are defined as

J_{[]} (δ, F, L_{r} (ℙ)) = \int_{0}^{δ} \sqrt{\log N_{[]} (ε, F, L_{r} (ℙ)) d ε},

and

J (δ, F, L_{2}) = \int_{0}^{δ} \sqrt{\log \sup_{ℚ} N (ε {‖ F ‖}_{ℚ, 2}, F, L_{2} (ℚ)) d ε},

respectively, where the supremum is taken over all probability measures ℚ with ‖F‖_ℚ,2 > 0. Denote the empirical process by $G_{n} (f) = n^{1 / 2} (ℙ_{n} - ℙ) (f)$ , where $ℙ_{n} (f) = n^{- 1} \sum_{i = 1}^{n} f (X_{i})$ and $ℙ (f) = E (f (X_{i}))$ . The following three Lemmas characterize the bounds for the expected maximal empirical processes and the concentration of the maximal empirical processes.

Lemma C.1

Under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, there exist some constant C > 0, such that, for r = 0, 1, 2, with probability at least $1 - O (d^{- 3})$ ,

\sup_{t \in [0, τ]} {‖ s^{(r)} (t, β^{*}) - S^{(r)} (t, β^{*}) ‖}_{\infty} \leq C \sqrt{\frac{\log d}{n}},

where s^(r)(t, β*) and S^(r)(t, β*) are defined in (2.6) and (2.3).

Proof

We will only prove the case for r = 1, and the cases for r = 0 and 2 follow by the similar argument. For j = 1,…, d, let

E_{j} = \sup_{t \in [0, τ]} | S_{j}^{(1)} (t, β^{*}) - s_{j}^{(1)} (t, β^{*}) |,

where $S_{j}^{(1)} (t, β^{*})$ and $s_{j}^{(1)} (t, β^{*})$ denote the j-th component of S⁽¹⁾(t, β*) and s⁽¹⁾(t, β*), respectively. We will prove a concentration result of E_j.

First, we show the class of functions {X_j(t)Y (t) exp (X^T(t)β*) : t ∈ [0, τ]} has bounded uniform entropy integral. By Lemma 9.10 of Kosorok (2007), the class ℱ = {X_j(t) : t ∈ [0, τ]} is a VC-hull class associated with a VC class of index 2. By Corollary 2.6.12 of van der Vaart and Wellner (1996), the entropy of the class ℱ satisfies log N(∈‖F‖_Q_,2, ℱ, L₂(ℚ)) ≤ C′(1/∈) for some constant C′ > 0, and hence ℱ has the uniform entropy integral $J (1, F, L_{2}) \leq \int_{0}^{1} \sqrt{K (1 / \in)} d \in < \infty$ . By the same argument, we have that {exp{X(t)^T β*} : t ∈ [0, τ]} also has a uniform entropy integral. Meanwhile, by example 19.16 of van der Vaart and Wellner (1996), {Y (t) : t ∈ [0, τ]} is a VC class and hence has bounded uniform entropy integral. Thus, by Theorem 9.15 of Kosorok (2007), we have {X_j(t)Y(t)exp{X(t)^Tβ*} : t ∈ [0, τ]} has bounded uniform entropy integral.

Next, taking the envelop F as sup_t _{∈ [0,} _τ_] |X_j(t)Y (t) exp {X^T (t)β*}|, by Lemma 19.38 of van der Vaart (2000),

E (E_{j}) \leq C_{1} n^{- 1 / 2} J (1, F, L_{2}) {‖ F ‖}_{ℙ, 2} \leq C n^{- 1 / 2},

for some positive constants C₁ and C. By McDiarmid’s inequality, we have, for any Δ > 0,

ℙ (E_{j} \geq C n^{- 1 / 2} (1 + Δ)) \leq ℙ (E_{j} \geq E (E_{j}) + n^{- 1 / 2} C Δ) \leq \exp (- C_{2} Δ^{2} L^{- 2}),

for some positive constant C₂ and L, and the desired result follows by taking $Δ = \sqrt{n^{- 1} \log d}$ a union bound over j = 1, …, d. □

Lemma C.2

Suppose the Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold, and $λ ≍ \sqrt{n^{- 1} \log d}$ . We have, for r = 0, 1, 2 and t ∈ [0, τ],

{‖ S^{(r)} (t, \hat{β}) - S^{(r)} (t, β^{*}) ‖}_{\infty} = O_{ℙ} (s \sqrt{\frac{\log d}{n}}) .

Proof

Similar to the previous Lemma, we only prove the case for r = 1, and the other two cases follow by the similar argument. For the case r = 1, we have

\begin{array}{l} {‖ S^{(1)} (t, \hat{β}) - S^{(1)} (t, β^{*}) ‖}_{\infty} = {‖ \frac{1}{n} \sum_{i = 1}^{n} Y_{i} (t) [\exp {X_{i}^{T} (t) \hat{β}} - \exp {X_{i}^{T} (t) β^{*}}] X_{i} (t) ‖}_{\infty} \\ \leq \max_{i} {Y_{i} (t) {‖ X_{i} (t) ‖}_{\infty} | \exp {X_{i}^{T} (t) \hat{β}} - \exp {X_{i}^{T} (t) β^{*}} |} \\ \leq C_{X} \cdot \max_{i} | \exp {X_{i}^{T} (t) β^{*}} [\exp {X_{i}^{T} (t) (\hat{β} - β^{*})} - 1] | \end{array}

(C.1)

\begin{array}{l} \leq C_{X} \cdot C_{1} \cdot \max_{i} {‖ X_{i} (t) ‖}_{\infty} {‖ \hat{β} - β^{*} ‖}_{1} \\ = O_{ℙ} (s \sqrt{\frac{\log d}{n}}), \end{array}

(C.2)

where (C.1) holds by the Assumption 2.1 for some constant C_X > 0; (C.2) holds by Assumption 4.1 that $X_{i}^{T} (t) β^{*} = O (1)$ and exp(|x|) ≤ 1+2|x| for any |x| sufficiently small, and the last equality holds by (2.2). Our claim holds as desired. □

Lemma C.3

Under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, there exists a positive constant C, such that with probability at least $1 - O (d^{- 3})$ ,

{‖ \nabla L (β^{*}) ‖}_{\infty} \leq C \sqrt{\frac{\log d}{n}} .

Proof

By definition, we have, for all j = 1, …, d,

\begin{matrix} \nabla_{j} L (β^{*}) = - \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {X_{i j} (u, β^{*}) - {\bar{X}}_{j} (u, β^{*})} d M_{i} (u) \\ = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {\bar{X}}_{j} (u, β^{*}) d M_{i} (u) - \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} X_{i j} (u, β^{*}) d M_{i} (u) . \end{matrix}

(C.3)

For the first term, we have for all t ∈ [0, τ],

{\bar{X}}_{j} (t, β^{*}) - e_{j} (t, β^{*}) = \frac{S_{j}^{(1)} (t, β^{*}) - s_{j}^{(1)} (t, β^{*})}{S^{(0)} (t, β^{*})} - \frac{s_{j}^{(1)} (t, β^{*}) {S^{(0)} (t, β^{*}) - s^{(0)} (t, β^{*})}}{S^{(0)} (t, β^{*}) s^{(0)} (t, β^{*})} .

(C.4)

By Assumption 2.1 and the fact that ℙ(y(τ) > 0) > 0, we have that $\sup_{t \in [0, τ]} | {\bar{X}}_{j} (t, β^{*}) - e_{j} (t) | \leq C_{1}$ for some constant C₁ > 0. In addition,

\frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {\bar{X}}_{j} (u, β^{*}) d M_{i} (u) \leq \sup_{f \in F_{j}} \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} f (u) d M_{i} (u),

where ℱ_j denotes the class of functions f: [0, τ] → ℝ which have uniformly bounded variation and satisfy sup_t_∈[0,_τ_] |f(t) − e_j(t)| ≤ δ₁ for some δ₁. By constructing ℓ_∞ balls centered at piecewise constant functions on a regular grid, one can show that the covering number of the class ℱ_j satisfies $N (ε, F_{j}, l_{\infty}) \leq {(C_{2} ε^{- 1})}^{C_{3} ε^{- 1}}$ for some positive constants C₂, C₃. Let $G_{j} = {\int_{0}^{\infty} f (t) d M (t) : f \in F_{j}}$ . Note that for any two f₁, f₂, ∈ ℱ_j,

| \int_{0}^{τ} f_{1} (t) - f_{2} (t) d M (t) | \leq \sup_{u \in [0, τ]} | f_{1} (u) - f_{2} (u) | \int_{0}^{τ} | d M (t) | .

By Theorem 2.7.11 of van der Vaart and Wellner (1996), the bracketing number of the class $G_{j}$ satisfies $N_{[]} (2 ε {‖ F ‖}_{ℙ, 2}, G_{j}, l_{2} (ℙ)) \leq N (ε, F_{j}, {‖ \cdot ‖}_{\infty}) \leq {(C_{2} ε^{- 1})}^{C_{3} ε^{- 1}}$ , where $F \int_{0}^{τ} | d M (t) |$ . Hence, $G_{j}$ has bounded bracketing integral. An application of Corollary 19.35 of van der Vaart (2000) yields that

E (\sup_{f \in F_{j}} \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} f (u) d M_{i} (u)) \leq n^{- 1 / 2} C_{4}

for some constant C₄ > 0. Then, by McDiarmid’s inequality,

ℙ (\frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {\bar{X}}_{j} (u, β^{*}) d M_{i} (u) > t) \leq ℙ (\sup_{f \in F_{j}} \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} f (u) d M_{i} (u) > t) \leq \exp (- \frac{n t^{2}}{C_{5}}),

for some constant C₅. Following by the union bound, we have with probability at least $1 - O (d^{- 3})$ ,

{‖ \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} {\bar{X}}_{j} (u, β^{*}) d M_{i} (u) ‖}_{\infty} \leq C \sqrt{\frac{\log d}{n}},

Note that the second term of (C.3) is a sum of i.i.d. mean-zero bounded random variables. Following by the Hoeffding inequality and the union bound, we have with probability at least $1 - O (d^{- 3})$ ,

{‖ \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{\infty} X_{i j} (u, β^{*}) d M_{i} (u) ‖}_{\infty} \leq C \sqrt{\frac{\log d}{n}},

for some constant C. The claim follows as desired. □

Lemma C.4

Under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, for any 1 ≤ j, k ≤ d, there exists a positive constant C, such that with probability at least $1 - O (d^{- 1})$ ,

\max_{j, k = 1, \dots, d} | \nabla_{j k}^{2} L (β^{*}) - H_{j k}^{*} | \leq C \sqrt{\frac{\log d}{n}} .

(C.5)

Proof

By the definitions of ∇²ℒ(β^*) and H^* in (2.5) and (2.7), we have

\begin{array}{l} \nabla^{2} ℒ (β^{*}) - H^{*} = \underset{T_{1}}{\underset{︸}{\frac{1}{n} \int_{0}^{τ} {\frac{S^{(2)} (t, β^{*})}{S^{(0)} (t, β^{*})} - \frac{s^{(2)} (t, β^{*})}{s^{(0)} (t, β^{*})}} d \bar{N} (t)}} \\ + \underset{T_{2}}{\underset{︸}{\frac{1}{n} \int_{0}^{τ} \frac{s^{(2)} (t, β^{*})}{s^{(0)} (t, β^{*})} d \bar{N} (t) - E [\int_{0}^{τ} \frac{s^{(2)} (t, β^{*})}{s^{(0)} (t, β^{*})} d N (t)]}} \\ \underset{T_{3}}{\underset{︸}{+ \frac{1}{n} \int_{0}^{τ} {e {(t, β^{*})}^{\otimes 2} - \bar{Z} {(t, β^{*})}^{\otimes 2}} d \bar{N} (t)}} \\ \underset{T_{4}}{\underset{︸}{+ E [\int_{0}^{τ} e {(t, β^{*})}^{\otimes 2} d N (t)] - \frac{1}{n} \int_{0}^{τ} e {(t, β^{*})}^{\otimes 2} d \bar{N} (t)}} . \end{array}

For the term T₁, we have, with probability at least $1 - O (d^{- 1})$ ,

{‖ T_{1} ‖}_{\infty} \leq \sup_{t \in [0, τ]} {‖ \frac{S^{(2)} (t, β^{*})}{S^{(0)} (t, β^{*})} - \frac{s^{(2)} (t, β^{*})}{s^{(0)} (t, β^{*})} ‖}_{\infty} \cdot \frac{1}{n} \int_{0}^{τ} d \bar{N} (t) \leq C_{1} \sqrt{\frac{\log d}{n}},

where the last inequality follows by Lemma C.1. Next, by Assumption 2.1, we have

{‖ \frac{s^{(2)} (t, β^{*})}{s^{(0)} (t, β^{*})} ‖}_{\infty} < \infty .

Consequently, T₂ becomes an i.i.d. sum of mean 0 bounded random variables. Hoeffding’s inequality gives that with probability at least $1 - O (d^{- 1})$ , ${‖ T_{2} ‖}_{\infty} \leq C_{2} \sqrt{n^{- 1} \log d}$ . Meanwhile, the terms T₃ and T₄ can be bounded similarly. Our claim holds as desired. □

Lemma C.5

Under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, let $\hat{β}$ be the estimator for β^* estimated by (2.1) satisfying the result in (2.2) that ${‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (s λ)$ with $λ ≍ O (\sqrt{n^{- 1} \log d})$ . Then, we have, for any $\tilde{β} = β^{*} + u (\hat{β} - β^{*})$ with u ∈ [0, 1],

{‖ \nabla^{2} L (\tilde{β}) ‖}_{\infty} = O_{ℙ} (1), and {‖ \nabla^{2} L (\tilde{β}) - H^{*} ‖}_{\infty} = O_{ℙ} (s \sqrt{\frac{\log d}{n}}) .

Proof

Let ξ = max_u≥₀ max_i,i′ |∆^T{X_i(u) − X_i_′ (u)}|, where $Δ = \tilde{β} - β^{*}$ . By Lemma 3.2 of Huang et al. (2013), it holds that,

\exp (- 2 ξ) \nabla^{2} L (β^{*}) \underline{≺} \nabla^{2} L (\tilde{β}) \underline{≺} \exp (2 ξ) \nabla^{2} L (β^{*}),

(C.6)

where A ⪯ B means that the matrix B − A is a positive semidefinite matrix.

Note that the diagonal elements of a positive semidefinite matrix can only be nonnegative. In addition, for a positive semidefinite matrix A ∈ ℝ^d×d, it is easy to see that ${‖ A ‖}_{\infty} = \max_{j} {a_{i j}}_{j = 1}^{d}$ . We have,

\exp (- 2 ξ) {‖ \nabla^{2} L (β^{*}) ‖}_{\infty} \leq {‖ \nabla^{2} L (\tilde{β}) ‖}_{\infty} \leq \exp (2 ξ) {‖ \nabla^{2} L (β^{*}) ‖}_{\infty} .

By (2.2) that ${‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (s λ)$ , which implies that ${‖ \tilde{β} - β^{*} ‖}_{1} = O_{ℙ} (s λ)$ as $\tilde{β}$ is on the line segment connecting β^* and $\hat{β}$ . Hence, $ξ = O_{ℙ} (s λ)$ . By triangle inequality,

{‖ \nabla^{2} L (\tilde{β}) - H^{*} ‖}_{\infty} \leq \underset{E_{1}}{\underset{︸}{{‖ \nabla^{2} L (\tilde{β}) - \nabla^{2} L (β^{*}) ‖}_{\infty}}} + \underset{E_{2}}{\underset{︸}{{‖ \nabla^{2} L (β^{*}) - H^{*} ‖}_{\infty}}} .

We consider the two terms separately, for the first term E₁, we have, by (C.6) and taking the Taylor’s expansion of exp(2ξ),

{‖ \nabla^{2} L (\tilde{β}) - \nabla^{2} L (β^{*}) ‖}_{\infty} \leq 2 {‖ ξ \nabla^{2} L (β^{*}) ‖}_{\infty} + o_{ℙ} (ξ) .

Since $ξ = O_{ℙ} (s λ)$ , and by Assumption 4.3, we have,

{‖ \nabla^{2} L (\tilde{β}) - \nabla^{2} L (β^{*}) ‖}_{\infty} = O_{ℙ} (s λ),

and $E_{1} = O_{ℙ} (s \sqrt{n^{- 1} \log d})$ as $λ ≍ \sqrt{n^{- 1} \log d}$ . In addition, $E_{2} = O_{ℙ} (s \sqrt{n^{- 1} \log d})$ by Lemma C.4. It further implies that ${‖ \nabla^{2} L (\tilde{β}) ‖}_{\infty} = O_{ℙ} (1)$ . □

Lemma C.6

Under Assumptions 2.1, 2.2 4.1, 4.2 and 4.3, it holds that

{‖ \nabla_{α θ}^{2} L (\hat{β}) - w^{* T} \nabla_{θ θ}^{2} L (\hat{β}) ‖}_{\infty} = O_{ℙ} (s \sqrt{\frac{\log d}{n}}) .

Proof

By triangle inequality, we have

\begin{array}{l} {‖ \nabla_{α θ}^{2} L (\hat{β}) - w^{* T} \nabla_{θ θ}^{2} L (\hat{β}) ‖}_{\infty} \\ \leq \underset{E_{1}}{\underset{︸}{{‖ H_{α θ}^{*} - w^{* T} H_{θ θ}^{*} ‖}_{\infty}}} + + \underset{E_{2}}{\underset{︸}{{‖ \nabla_{θ α}^{2} L (\hat{β}) - H_{θ α}^{*} ‖}_{\infty}}} + \underset{E_{3}}{\underset{︸}{{‖ w^{* T} H_{θ θ}^{*} - \nabla_{θ θ}^{2} L (\hat{β}) ‖}_{\infty}}} . \end{array}

It is seen that E₁ = 0 by the definition of $w^{*} = H_{θ θ}^{* - 1} H_{θ α}^{*}$ in (3.1). In addition, $E_{2} = O_{ℙ} (s \sqrt{n^{- 1} \log d})$ by Lemma C.5. For the term E₃, we have

E_{3} \leq \underset{E_{31}}{\underset{︸}{{‖ w^{* T} {\nabla_{θ θ}^{2} L (\hat{β}) - \nabla_{θ θ}^{2} L (β^{*})} ‖}_{\infty}}} + \underset{E_{32}}{\underset{︸}{{‖ w^{* T} {\nabla_{θ θ}^{2} L (β^{*}) - H_{θ θ}^{*}} ‖}_{\infty}}} .

For the term E₃₁, by the definition of ∇²ℒ(·) in (2.5), we have

\begin{array}{r} w^{* T} {\nabla_{θ θ}^{2} L (\hat{β}) - \nabla_{θ θ}^{2} L (β^{*})} = \underset{T_{1}}{\underset{︸}{w^{* T} {\frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} \frac{S^{(2)} (t, \hat{β})}{S^{(0)} (t, \hat{β})} - \frac{S^{(2)} (t, β^{*})}{S^{(0)} (t, β^{*})} d N_{i} (t)}_{θ θ}}} \\ + \underset{T_{2}}{\underset{︸}{w^{* T} {\frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} \bar{Z} {(t, \hat{β})}^{\otimes 2} - \bar{Z} {(t, β^{*})}^{\otimes 2}}_{θ θ}}} . \end{array}

For the term T₁, we have

T_{1} = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} \frac{S^{(0)} (t, β^{*}) w^{* T} S_{θ θ}^{(2)} (t, \hat{β}) - S^{(0)} (t, \hat{β}) w^{* T} S_{θ θ}^{(2)} (t, β^{*})}{S^{(0)} (t, \hat{β}) S^{(0)} (t, β^{*})}

For ease of notation, in the rest of the proof, let ${\hat{S}}^{(r)} (t) : = S^{(r)} (t, \hat{β})$ and S^*(^r⁾(t): = S⁽^r⁾(t, β^*) for r = 0, 1, 2. We have, for the k-th component of T₁,

\begin{matrix} T_{1, k} = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} \frac{S^{* (0)} (t) \frac{1}{n} \sum_{i^{'} = 1}^{n} y_{i^{'}} (t) \exp {X_{i^{'}}^{T} (t) \hat{β}} w^{* T} X_{i^{'}, θ} (t) X_{i^{'}, k} (t)}{{\hat{S}}^{(0)} (t) S^{* (0)} (t)} d N_{i} (t) \\ - \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} \frac{{\hat{S}}^{(0)} (t) \sum_{i^{'} = 1}^{n} y_{i^{'}} (t) \exp {X_{i^{'}}^{T} (t) β^{*}} w^{* T} X_{i^{'}, θ} (t) X_{i^{'}, k} (t)}{{\hat{S}}^{(0)} (t) S^{* (0)} (t)} d N_{i} (t) . \end{matrix}

Consequently, it holds that

\begin{array}{l} | T_{1, k} | \\ \leq | \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} \frac{{S^{* (0)} (t) - {\hat{S}}^{(0)} (t)} \frac{1}{n} \sum_{i^{'} = 1}^{n} Y_{i^{'}} (t) \exp {X_{i^{'}}^{T} (t) \hat{β}} w^{* T} X_{i^{'}, θ} (t) X_{i^{'}, k} (t)}{{\hat{S}}^{(0)} (t) S^{* (0)} (t)} d N_{i} (t) | \\ + | \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{τ} \frac{{\hat{S}}^{(0)} (t) \frac{1}{n} \sum_{i^{'} = 1}^{n} Y_{i^{'}} (t) [\exp {X_{i^{'}}^{T} (t) \hat{β}} - \exp {X_{i^{'}}^{T} (t) β^{*}}] w^{* T} X_{i^{'}, θ} (t) X_{i^{'}, k} (t)}{{\hat{S}}^{(0)} (t) S^{* (0)} (t)} d N_{i} (t) | \\ \leq \sup_{t \in [0, τ]} | \frac{1}{n} \sum_{i = 1}^{n} \frac{{S^{* (0)} (t) - {\hat{S}}^{(0)} (t)} [\frac{1}{n} \sum_{i^{'} = 1}^{n} Y_{i^{'}} (t) \exp {X_{i^{'}}^{T} (t) β^{*}} w^{* T} X_{i^{'}, θ} (t) X_{i^{'}, k} (t)]}{{\hat{S}}^{(0)} (t) S^{* (0)} (t)} | \cdot τ \\ + | \frac{1}{n} \sum_{i = 1}^{n} \frac{{S^{* (0)} (t) - {\hat{S}}^{(0)} (t)} [\frac{1}{n} \sum_{i^{'} = 1}^{n} Y_{i^{'}} (t) [\exp {X_{i^{'}}^{T} (t) \hat{β}} - \exp {X_{i^{'}}^{T} (t) β^{*}}] w^{* T} X_{i^{'}, θ} (t) X_{i^{'}, k} (t)]}{{\hat{S}}^{(0)} (t) S^{* (0)} (t)} | \cdot τ \\ + | \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{S}}^{(0)} (t) \frac{1}{n} \sum_{i^{'} = 1}^{n} Y_{i^{'}} (t) [\exp {X_{i^{'}}^{T} (t) \hat{β}} - \exp {X_{i^{'}}^{T} (t) β^{*}}] w^{* T} X_{i^{'}, θ} (t) X_{i^{'}, k} (t)}{{\hat{S}}^{(0)} (t) S^{* (0)} (t)} | \cdot τ \\ = O_{ℙ} (s \sqrt{n^{- 1} \log d}), \end{array}

where the last equality holds by Assumptions 2.1 and 4.1 that $X_{i}^{T} (t) β^{*}$ is bounded, S^*(0)(t) is bounded away from 0, and by Lemma C.2 that $| {\hat{S}}^{(r)} (t) - S^{* (r)} (t) | = O_{ℙ} (s \sqrt{n^{- 1} \log d})$ .

The term T₂ can be bounded by the similar argument, and our claim holds as desired. □

Lemma C.7

Under Assumptions 2.1 and 2.2, and if n^−1/2s³ log d = o(1), the RE condition holds for the sample Hessian matrix $\nabla^{2} L (\hat{β})$ . Specifically, for the vectors in the cone $C = {v | {‖ v_{S} ‖}_{1} \leq ξ {‖ v_{S^{C}} ‖}_{1}}$ , we have

\frac{v^{T} \nabla^{2} L (\hat{β}) v}{{‖ v ‖}_{2}} \geq \frac{1}{2} κ^{2} (ξ, | S |; \nabla^{2} L (β^{*})), for all v \in C .

Proof

By Lemma 3.2 of Huang et al. (2013), we have exp(−2ξ_b)∇²ℒ(β) ⪯ ∇²ℒ(β+b), where ξ_b = max_u≥₀ max_{i,i′,k,k′} |b^T{X_ik(u) − X_i′k′(u)}|. Let $b = \hat{β} - β^{*}$ . By Assumption 2.1 that ‖{X_ik(u) − X_i′k′(u)}‖_∞ ≤ C_X, we have $ξ_{b} = O_{ℙ} (s \sqrt{n^{- 1} \log d})$ by (2.2), we have ${‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (s λ)$ . By the scaling assumption that n⁻¹^/²s³ log d = o(1), we have $ξ_{b} \leq \frac{1}{2} \log 2$ . Consequently, exp(−2ξ_b) ≥ 1/2. We have $\nabla^{2} L (\hat{β}) \underline{≻} \frac{1}{2} \cdot \nabla^{2} L (β^{*})$ . Since the cone $C$ is a subset of ℝ^d, our claim follows as desired. □

Lemma C.8

Under Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3, if

{‖ \nabla_{θ α}^{2} L (\hat{β}) - w^{* T} \nabla_{θ θ}^{2} L (\hat{β}) ‖}_{\infty} \leq λ^{'},

(C.7)

we have, the Dantzig selector $\hat{w}$ defined in (3.2) satisfies

{‖ \hat{w} - w^{*} ‖}_{1} \leq \frac{16 λ^{'} s^{'}}{κ^{2} (1, s^{'}; \nabla^{2} L (β^{*}))} .

Proof

We first derive the result that the vector $\hat{Δ} = \hat{w} - w^{*}$ belongs to the cone $C = {v | {‖ v_{S^{C}} ‖}_{1} \leq {‖ v_{S} ‖}_{1}}$ . By our assumption (C.7), and since ${‖ \hat{w} ‖}_{1} \leq {‖ w^{*} ‖}_{1}$ by the optimality condition of Dantzig selector in (D.2), we have

{‖ {\hat{w}}_{S} ‖}_{1} + {‖ {\hat{w}}_{S^{C}} ‖}_{1} \leq {‖ {\hat{w}}_{S}^{*} ‖}_{1},

where we use the fact that ${‖ {\hat{w}}_{S^{C}}^{*} ‖}_{1} = 0$ .

By triangle inequality, we have

{‖ {\hat{w}}_{S}^{*} ‖}_{1} \leq {‖ {\hat{w}}_{S} ‖}_{1} + {‖ {\hat{Δ}}_{S} ‖}_{1} .

Summing up the above two inequalities, we have

{‖ {\hat{Δ}}_{S^{C}} ‖}_{1} \leq {‖ {\hat{Δ}}_{S} ‖}_{1} .

(C.8)

Meanwhile, by the feasibility conditions of the Dantzig selector $\hat{w}$ and w^*, we have

{‖ \nabla_{θ θ}^{2} L (\hat{β}) \hat{Δ} ‖}_{\infty} \leq {‖ \nabla_{θ α}^{2} L (\hat{β}) - w^{* T} \nabla_{θ θ}^{2} L (\hat{β}) ‖}_{\infty} + {‖ \nabla_{θ α}^{2} L (\hat{β}) - \hat{w} \nabla_{θ θ}^{2} L (\hat{β}) ‖}_{\infty} \leq 2 λ^{'} .

(C.9)

By (C.8) and (C.9), we have

{\hat{Δ}}^{T} \nabla_{θ θ}^{2} L (\hat{β}) \hat{Δ} \leq {‖ \hat{Δ} ‖}_{1} {‖ \nabla_{θ θ}^{2} L (\hat{β}) \hat{Δ} ‖}_{\infty} \leq 2 λ^{'} {‖ \hat{Δ} ‖}_{1} \leq 4 λ^{'} {‖ {\hat{Δ}}_{S} ‖}_{1} .

By Lemma C.8, it holds that

{\hat{Δ}}^{T} \nabla_{θ θ}^{2} L (\hat{β}) \hat{Δ} \geq \frac{1}{2} κ^{2} (1, s^{'}; \nabla^{2} L (β^{*})) {‖ {\hat{Δ}}_{S} ‖}_{2}^{2},

which implies that

{\hat{Δ}}^{T} \nabla_{θ θ}^{2} L (\hat{β}) \hat{Δ} \geq \frac{1}{2} κ^{2} (1, s^{'}; \nabla^{2} L (β^{*})) s^{- 1} {‖ {\hat{Δ}}_{S} ‖}_{2}^{1} .

Consequently, we have

{‖ {\hat{Δ}}_{S} ‖}_{1} \frac{8 λ^{'} s^{'}}{κ^{2} (1, s^{'}; \nabla^{2} L (β^{*}))} .

By (C.8), it holds that

{‖ \hat{Δ} ‖}_{1} \leq 2 {‖ {\hat{Δ}}_{S} ‖}_{1} \leq \frac{16 λ^{'} s^{'}}{κ^{2} (1, s^{'}; \nabla^{2} L (β^{*}))} .

as desired. □

D Extensions to Multivariate Failure Time Data

In real applications, it is also of interest to study multivariate failure time outcomes. For example, Cai et al. (2005) consider the time to coronary heart disease and time to cerebrovascular accident. In their study, the primary sampling unit is the family. Using multivariate model, it takes the advantage to incorporate the assumption that the failure times for subjects within a family are likely to be correlated. In this section, we extend our method to conduct inference in the high dimensional multivariate failure time setting.

To be more specific about the model, assume there are n independent clusters (families). Each cluster i contains M_i subjects, and for each subject, there are K types of failure may occur. Thus, it is reasonable to assume that the number K is fixed that does not increase with dimensionality d and sample size n. For example, Cai et al. (2005) study the time to coronary heart disease and the time to cerebrovascular accident where K = 2. Denote the covariates of the kth failure type of subject m in cluster i at time t by X_ikm(t). The marginal hazards model is taken as

Λ_{ikm} {| t | X_{ikm} (t)} = Λ_{0 k} (t) \exp {X_{ikm}^{T} (t) β},

where the baseline hazard functions Λ₀_k(t)’s are treated as nuisance parameters, and the model is known as mixed baseline hazards model. Using this model, our inference procedures are conducted based on the pseudo-partial likelihood approach, since the working model does not assume any correlation for the different failure times within each cluster. The log pseudo-partial likelihood loss function is

\begin{array}{l} L (β) = - \frac{1}{n} [\sum_{k = 1}^{K} \sum_{i = 1}^{n} \sum_{m = 1}^{M_{i}} \int_{0}^{τ} X_{ikm}^{T} (u) β d N_{ikm} (u) - \\ \sum_{k = 1}^{K} \int_{0}^{τ} \log [\sum_{i = 1}^{n} \sum_{m = 1}^{M_{i}} Y_{ikm} (u) \exp {X_{ikm}^{T} (u) β}] d {\bar{N}}_{k} (u)], \end{array}

where Y_ikm(t) and N_ikm(t) denote the at risk indicator and the number of observed failure event at time t of the kth type on subject m in cluster i, and ${\bar{N}}_{k} = \sum_{i = 1}^{n} \sum_{m = 1}^{M_{i}} N_{ikm}$ for each k. The penalized maximum pseudo likelihood estimator is

\hat{β} = \underset{β \in ℝ^{d}}{\arg \min} L (β) + P_{λ} (β) .

(D.1)

To connect the multivariate failure time model with Cox’s proportional hazards model, first, we observe that we can drop the index m. This is by the fact that, for each (i, m) where i ∈ {1, …n} and m ∈ {1, …, M_i}, we can map (i, m) to $i' = \sum_{j = 1}^{i - 1} M_{j} + m$ , and we define $\sum_{j = 1}^{0} M_{j} = 0$ . It is not difficult to see the mapping is a bijection. After the mapping, the penalized estimator remains the same. Thus, without loss of generality, we assume M_i = 1 for all i, and we drop the index m. Next, we observe that the loss function $L (β)$ is decomposable that

L (β) = \sum_{k = 1}^{K} L^{(k)} (β),

where

L^{(k)} (β) = - \frac{1}{n} [\sum_{i = 1}^{n} \int_{0}^{t} X_{i k}^{T} (u) β d N_{i k} (u) - \int_{0}^{t} \log [\sum_{i = 1}^{n} Y_{i k} (u) \exp {X_{i k}^{T} (u) β}] d {\bar{N}}_{k} (u)] .

Thus, the loss function of multivariate failure time model can be decomposed into a sum of K loss functions of Cox’s proportional hazards models. However, the extension of the inference of the Cox model to multivariate failure time model is not trivial since the loss function is derived from a pseudo-likelihood function.

First, we extend the estimation procedure to the multivariate failure time model in the high dimensional setting, where we take $P_{λ} (β) = λ {‖ β ‖}_{1}$ . It is not difficult to obtain that (2.2) holds for the multivariate failure time model. An alternative approach is that we estimate β^∗ using each type k of failure time independently. Specifically, we construct the estimator $\hat{β}$ by

\hat{β} = K^{- 1} \sum_{k = 1}^{K} {\hat{β}}^{(k)}, where {\hat{β}}^{(k)} = \underset{β^{(k)}}{\arg \min} L^{(k)} (β^{(k)}) + λ {‖ β^{(k)} ‖}_{1}, for all k .

Since for each ${\hat{β}}^{(k)}$ , ${‖ {\hat{β}}^{(k)} - β^{*} ‖}_{1} = O_{ℙ} (λ_{s})$ by (2.2), it is readily seen that ${‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (λ_{s})$ .

We extend the decorrelated score, Wald and partial likelihood ratio tests to the multivariate failure time model. We first introduce some notation. For k = 1, …, K,

S_{k}^{(r)} (t, β) = \frac{1}{n} \sum_{i = 1}^{n} X_{i k}^{\otimes r} (t) Y_{i k} (t) \exp {X_{i k}^{T} (t) β}, for r = 0, 1, 2, and {\bar{Z}}_{k n} (t, β) = \frac{S_{k}^{(1)} (t, β)}{S_{k}^{(0)} (t, β)},

where their corresponding population versions are

s_{k}^{(r)} (t, β) = E [Y_{k} (t) X_{i k} {(t)}^{\otimes r} \exp {X_{i k} (t) β}], for r = 0, 1, 2 and e_{k} (t, β) = s_{k}^{(1)} (t, β) / s_{k}^{(0)} (t, β) .

Next, we derive the gradient and Hessian matrix at the point β of the loss function,

\nabla L (β) = - \frac{1}{n} \sum_{k = 1}^{K} \sum_{i = 1}^{n} \int_{0}^{τ} {X_{i k} (u) - {\bar{Z}}_{k n} (u, β)} d N_{i k} (u),

and

\nabla^{2} L (β) = \frac{1}{n} \sum_{k = 1}^{K} \int_{0}^{τ} {\frac{S_{k}^{(2)} (u, β)}{S_{k}^{(0)} (u, β)} - {\bar{Z}}_{k n} (u, β^{\otimes 2})} d {\bar{N}}_{k} (u) .

The population version of the gradient and Hessian matrix are

g (β) = \sum_{k = 1}^{K} E + [\int_{0}^{τ} {X (u) - e_{k} (u, β)} d {\bar{N}}_{k} (u)],

and

H (β) = \sum_{k = 1}^{K} E + [\int_{0}^{τ} {\frac{s_{k}^{(2)} (u, β)}{s_{k}^{(0)} (u, β)} - e {(u, β)}^{\otimes 2}} d {\bar{N}}_{k} (u)] .

For notational simplicity, let H^∗ = H(β^∗).

Note that, utilizing the decomposable structure, by the similar argument, the concentration results in Appendix C hold for the empirical gradient and Hessian matrix. We estimate the decorrelation vector $w^{*} = H_{θ θ}^{* - 1} H_{θ α}^{*}$ by the following Dantzig selector

\hat{w} = \arg \min {‖ w ‖}_{1}, subject to {‖ \nabla_{θ α}^{2} L (0, \hat{θ}) - w^{T} \nabla_{θ θ}^{2} L (0, \hat{θ}) ‖}_{\infty} \leq δ,

(D.2)

where δ is a tuning parameter. The rate of convergence of $\hat{w}$ follows by the similar argument as in Lemma C.8.

We first introduce the decorrelated score test in multivariate failure time model. Suppose the null hypothesis is H₀: α^∗ = 0, and the alternative hypothesis is H_α: α^∗ ≠ 0. The decorrelated score function is constructed similar to (3.3) that

{\hat{U}}^{M} (0, \hat{θ}) = \nabla_{α} L (0, \hat{θ}) - {\hat{w}}^{T} \nabla_{θ} L (0, \hat{θ}) .

(D.3)

The main technical difference between the multivariate failure time model and the univariate Cox’s model is that, the loss function of Cox’s model is a log profile likelihood function, and Bartlett’s identity $Var {\nabla L (β^{*})} = \nabla^{2} L (β^{*})$ holds. In multivariate case, this identity does not hold. We need the following lemma which is analogous to Lemma A.1. We omit the proof details to avoid repetition.

Lemma D.1

For any vector v ∈ ℝ_d if ‖v‖₀ ≤ s′ and $n^{- 1} \sqrt{s' \log d} = o (1)$ it holds that

\frac{\sqrt{n} v^{T} \nabla L (β^{*})}{\sqrt{v^{T} Ω v}} \overset{d}{\to} N (0, 1) . where Ω = Var {\sqrt{n} \nabla L (β^{*})} \in ℝ^{d \times d} .

By the similar argument as in Theorem 4.4, we derive the asymptotic normality of ${\hat{U}}^{M} (0, \hat{θ})$ in the next theorem.

Theorem D.2

Suppose that Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold. Let ${\hat{U}}^{M} (0, \hat{θ})$ be defined in (D.3). Under the null hypothesis that α^∗ = 0 and if $λ ≍ \sqrt{n^{- 1} \log d}$ , $δ ≍ s' \sqrt{n^{- 1} \log d}$ , n^−1/2s³ log d = o(1), we have

\sqrt{n} {\hat{U}}^{M} (0, \hat{θ}) \overset{d}{\to} Z, where Z ~ N (0, σ^{2}) and σ^{2} = Ω_{α α} - 2 w^{* T} Ω_{θ α} + w^{* T} Ω_{θ θ} w^{*} .

Proof

By the definition of ${\hat{U}}^{M} (0, \hat{θ})$ and mean value theorem, we have, for some z, z′ ∈ [0, 1], $\bar{θ} = θ^{*} + z (\hat{θ} - θ^{*})$ and $\tilde{θ} = θ^{*} + z' (\hat{θ} - θ^{*})$ ,

\begin{array}{l} {\hat{U}}^{M} (0, \hat{θ}) = \nabla_{α} L (0, \hat{θ}) - {\hat{w}}^{T} \nabla_{θ} L (0, \hat{θ}) \\ = \nabla_{α} L (0, θ^{*}) + \nabla_{α θ} L (0, θ^{*}) - {{\hat{w}}^{T} \nabla_{θ} L (0, θ^{*}) + \hat{w} \nabla_{θ θ} L (0, \tilde{θ}) (\hat{θ} - θ^{*})} \\ = \underset{S}{\underset{︸}{\nabla_{α} L (0, θ^{*}) - w^{* T} \nabla_{θ} L (0, θ^{*})}} + \underset{E_{1}}{\underset{︸}{{(w^{*} - \hat{w})}^{T} \nabla_{θ} L (0, θ^{*})}} \\ + \underset{E_{2}}{\underset{︸}{{\nabla_{α θ} L (0, \bar{θ}) - {\hat{w}}^{T} \nabla_{θ θ} L (0, \tilde{θ})} (\hat{θ} - θ^{*})}} . \end{array}

Using Lemma D.1, taking b = (1, −w^∗T)^T and by the assumption that ‖w^∗‖₀ ≤ s⁰, it holds that

\sqrt{n} S \overset{d}{\to} Z, where Z ~ N (0, σ^{2}) and σ^{2} = Ω_{α α} - 2 w^{* T} Ω_{θ α} + w^{* T} Ω_{θ θ} w^{*} .

Following a similar proof as that in Theorem 4.4 and utilizing the separable in multivariate failure time model, we have $\sqrt{n} E_{1} =_{O ℙ} (1)$ and $\sqrt{n} E_{2} =_{O ℙ} (1)$ . This concludes our proof. □

Remark D.3

Under the assumptions of D.2, using plug-in estimator ${\hat{σ}}^{2} = {\hat{Ω}}_{α α} - 2 \hat{w} {\hat{Ω}}_{θ α} + {\hat{w}}^{T} {\hat{Ω}}_{θ θ} \hat{w}$ converges to σ² at the rate of $O_{ℙ} (s' s \sqrt{n^{- 1} \log d}) =_{O ℙ} (1)$ .

Next, we extend the decorrelated Wald test to the multivariate failure time model, which constructs confidence intervals for α^*. We first estimate β^* by ℓ₁-penalized estimator $\hat{β} = (\hat{α}, \hat{θ})$ . Let

{\tilde{α}}^{M} = \hat{α} - {\frac{\partial {\hat{U}}^{M} (\hat{α}, \hat{θ})}{\partial α}}^{- 1} {\hat{U}}^{M} (\hat{α}, \hat{θ}) .

We derive the asymptotic normality of ${\tilde{α}}^{M}$ in the next theorem.

Theorem D.4

Suppose Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold. For $λ ≍ \sqrt{n^{- 1} \log d}, δ ≍ s' \sqrt{n^{- 1} \log d}$ and n^−1/2s³ log d = o(1) under the null hypothesis that α^* = 0, we have

\sqrt{n} \tilde{α} \overset{d}{\to} Z, where Z ~ N (0, σ^{2} / γ^{4}),

and σ² = Ω_αα−2w^*^T Ω_θα + w^*^TΩ_θθw^*, $γ^{2} = H_{α α}^{*} - w^{* T} H_{θ α}^{*}$ .

Proof

By the definition of $\tilde{α}$ , we have,

\begin{array}{l} \tilde{α} = \hat{α} - [γ^{- 2} - γ^{- 2} + {\frac{\partial {\hat{U}}^{M} (\hat{α}, \hat{θ})}{\partial α}}^{- 1}] \hat{U} (\hat{α}, \hat{θ}) \\ = \hat{α} - γ^{- 2} {{\hat{U}}^{M} (0, \hat{θ}) + \frac{(\hat{α} - 0) \partial {\hat{U}}^{M} (\bar{α}, \hat{θ})}{\partial α}} + [γ^{- 2} - {\frac{\partial {\hat{U}}^{M} (\hat{α}, \hat{θ})}{\partial α}}^{- 1}] \hat{U} (\hat{α}, \hat{θ}), where \\ = \hat{α} - γ^{- 2} {\hat{U}}^{M} (0, \hat{θ}) - \hat{α} γ^{2} γ^{- 2} + \hat{α} γ^{- 2} {γ^{2} - \frac{\partial {\hat{U}}^{M} (\bar{α}, \hat{θ})}{\partial α}} + {\hat{U}}^{M} (\hat{α}, \hat{θ}) [γ^{- 2} - {\frac{\partial {\hat{U}}^{M} (\hat{α}, \hat{θ})}{\partial α}}^{- 1}], \\ = \underset{S}{\underset{︸}{- γ^{- 2} {\hat{U}}^{M} (0, \hat{θ})}} + \underset{R_{1}}{\underset{︸}{\hat{α} γ^{- 2} {γ^{2} - \frac{\partial {\hat{U}}^{M} (\bar{α}, \hat{θ})}{\partial α}}}} + \underset{R_{2}}{\underset{︸}{{\hat{U}}^{M} (\hat{α}, \hat{θ}) [γ^{- 2} - {\frac{\partial {\hat{U}}^{M} (\hat{α}, \hat{θ})}{\partial α}}^{- 1}]}}, \end{array}

where the second equality holds by mean value theorem for some $\bar{α} = υ \hat{α}$ and υ ∈ [0, 1]. For the first term above, we have $\sqrt{n} S \overset{d}{\to} Z$ where Z ~ N(0, σ²/γ⁴) by Theorem D.2. In addition, $\sqrt{n} R_{1} =_{O ℙ} (1)$ and $\sqrt{n} R_{2} =_{O ℙ} (1)$ by the similar argument in Theorem 4.7. This concludes the proof. □

Finally, we extend the decorrelated partial likelihood ratio test to the multivariate failure time model. The test statistic is

2 n {L (0, \hat{θ}) - L (\tilde{α}, \hat{θ} - \tilde{α} \hat{w})} .

Under the null hypothesis, the test statistic follows a weighted chi-squared distribution as shown in the following theorem.

Theorem D.5

Suppose Assumptions 2.1, 2.2, 4.1, 4.2 and 4.3 hold. If $λ ≍ \sqrt{n^{- 1} \log d}$ , $δ ≍ s' \sqrt{n^{- 1} \log d}$ and n^−1/2s³ log d, under the null hypothesis α^* = 0, we have

2 n {L (0, \hat{θ}) - L (\tilde{α}, \hat{θ} - \tilde{α} \hat{w})} \overset{d}{\to} σ^{2} γ^{- 2} Z_{χ}, where Z_{χ} ~ χ_{1}^{2},

and σ² = Ω_αα−2w^*^T Ω_θα + w^*^TΩ_θθw^*, $γ^{2} = H_{α α}^{*} - w^{* T} H_{θ α}^{*}$ .

Proof

We have, by mean value theorem, for some $\bar{α} = v_{1} \hat{α}$ , ${\bar{α}}^{'} = v_{2} \hat{α}$ , $\bar{θ} = θ^{*} + t_{3} (\hat{θ} - θ^{*})$ and ${\bar{θ}}^{'} = θ^{*} + v_{4} (\hat{θ} - θ^{*})$ and 0 ≤ v₁, v₂, v₃, v₄ ≤ 1,

\begin{array}{l} L (\tilde{α}, \hat{θ} - \tilde{α} \hat{w}) - L (0, \hat{θ}) \\ = \tilde{α} \nabla_{α} L (0, \hat{θ}) - \tilde{α} {\hat{w}}^{T} \nabla_{θ} L (0, \hat{θ}) + \frac{{\tilde{α}}^{2}}{2} \nabla_{α α} (L (\bar{α}, \hat{θ}) + {\hat{w}}^{T} \nabla_{θ θ} L (0, \bar{θ}) \hat{w} - {\tilde{α}}^{2} {\hat{w}}^{T} \nabla_{α θ} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) \\ = \underset{L}{\underset{︸}{\tilde{α} \hat{U} (0, \hat{θ})}} + \underset{E}{\underset{︸}{\frac{{\tilde{α}}^{2}}{2} {\nabla_{α α} L (\bar{α}, \hat{θ}) + {\hat{w}}^{T} \nabla_{θ θ} L (0, \bar{θ}) \hat{w} - 2 \hat{w} \nabla_{α θ} L ({\bar{α}}^{'}, {\bar{θ}}^{'})}}} . \end{array}

We first look at the term L. By Theorem D.2, we have $\hat{U} (0, \hat{θ}) = \hat{U} (0, {\hat{θ}}^{*}) + o_{ℙ} (n^{- 1 / 2})$ , and by Theorem D.4 $\tilde{α} = - γ^{- 2} \hat{U} (0, \hat{θ}) + o_{ℙ} (n^{- 1 / 2})$ , we have

L = - γ^{- 2} {\hat{U}}^{M} {(0, \hat{θ})}^{2} + o_{ℙ} (n^{- 1})

Next, we look at the term E,

\begin{array}{l} E = \underset{E_{1}}{\underset{︸}{\frac{{\tilde{α}}^{2}}{2} (H_{α α}^{*} + H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*} - 2 H_{α θ}^{*} H_{θ θ}^{* - 1} H_{θ α}^{*})}} \\ + \underset{E_{2}}{\underset{︸}{\frac{{\tilde{α}}^{2}}{2} [{\nabla_{α α} L (\bar{α}, \hat{θ}) - H_{α α}^{*}} + {{\hat{w}}^{T} \nabla_{θ θ} L (0, \bar{θ}) \hat{w} - w^{*} H_{θ θ}^{*} w^{*}} - 2 {\tilde{w} \nabla_{α θ} L ({\bar{α}}^{'}, {\bar{θ}}^{'}) - H_{α θ}^{*} w^{*}}]}} . \end{array}

By Theorem D.4, it holds that $2 n E_{1} \overset{d}{\to} σ^{2} γ^{- 2} Z_{χ}$ . In addition, by the similar argument as in Theorem 4.9, we have E₂ = o_ℙ(n⁻¹). Thus, we have

2 n {L (0, \hat{θ}) - L (\tilde{α}, \hat{θ} - \tilde{α} \hat{w})} \overset{d}{\to} σ^{2} γ^{- 2} Z_{χ}, where Z_{χ} ~ χ_{1}^{2},

which concludes our proof. □

Footnotes

It is straightforward to extend the setting from univariate scalar to multivariate parameter vector.

Contributor Information

Ethan X. Fang, Email: xingyuan@princeton.edu.

Yang Ning, Email: yangning@princeton.edu.

Han Liu, Email: hanliu@princeton.edu.

References

Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. doi: 10.1038/35000501. [DOI] [PubMed] [Google Scholar]
Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Statist. 1982:1100–1120. [Google Scholar]
Antoniadis A, Fryzlewicz P, Letué F. The Dantzig selector in Cox’s proportional hazards model. Scand J Stat. 2010;37:531–552. [Google Scholar]
Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann Statist. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chernozhukov V, Chetverikov D, Kato K. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann Statist. 2013;41:2786–2819. [Google Scholar]
Cox DR. Regression models and life-tables. J R Stat Soc Ser B Stat Methodol. 1972;34:187–220. [Google Scholar]
Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
Dawber TR. The Framingham Study: the epidemiology of atherosclerotic disease. Vol. 84. Harvard University Press; Cambridge: 1980. [Google Scholar]
Di Gaetano N, Cittera E, Nota R, Vecchi A, Grieco V, Scanziani E, Botto M, Introna M, Golay J. Complement activation determines the therapeutic activity of rituximab in vivo. J Immunol. 2003;171:1581–1587. doi: 10.4049/jimmunol.171.3.1581. [DOI] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30:74–99. [Google Scholar]
Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]
Hiai H, Tsuruyama T, Yamada Y. Pre-B lymphomas in SL/Kh mice: A multi-factorial disease model. Cancer Science. 2003;94:847–850. doi: 10.1111/j.1349-7006.2003.tb01365.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang J, Sun T, Ying Z, Yu Y, Zhang C-H. Oracle inequalities for the Lasso in the Cox model. Ann Statist. 2013;41:1142–1165. doi: 10.1214/13-AOS1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional statistical models. NIPS. 2013:1187–1195. [Google Scholar]
Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. Vol. 360. John Wiley & Sons; 2011. [Google Scholar]
Kong S, Nan B. Non-asymptotic oracle inequalities for the high-dimensional Cox regression via Lasso. Stat Sinica. 2014;24:25–42. doi: 10.5705/ss.2012.240. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. Springer: 2007. [Google Scholar]
Lin W, Lv J. High-dimensional sparse additive hazards regression. J Amer Statist Asooc. 2013;108:247–264. [Google Scholar]
Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R, et al. A significance test for the Lasso. Ann Statist. 2014;42:413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meierhoff G, Dehmel U, Gruss H, Rosnet O, Birnbaum D, Quentmeier H, Dirks W, Drexler H. Expression of FLT3 receptor and FLT3-ligand in human leukemia-lymphoma cell lines. Leukemia. 1995;9:1368–1372. [PubMed] [Google Scholar]
Müller P, van de Geer S. Censored linear model in high dimensions. 2014 arXiv:1405.0579. [Google Scholar]
Nishiu M, Yanagawa R, Nakatsuka S-i, Yao M, Tsunoda T, Nakamura Y, Aozasa K. Microarray analysis of gene-expression profiles in diffuse large b-cell lymphoma: Identification of genes related to disease progression. Cancer Science. 2002;93:894–901. doi: 10.1111/j.1349-7006.2002.tb01335.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shorack GR, Wellner JA. Empirical Processes with Applications to Statistics. Vol. 59. SIAM; 2009. [Google Scholar]
Tibshirani R. The Lasso method for variable selection in the Cox model. Stat Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
Tsiatis AA. A large sample study of Cox’s regression model. Ann Statist. 1981;9:93–108. [Google Scholar]
van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Statist. 2014a;42:1166–1202. [Google Scholar]
van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Statist. 2014b;42:1166–1202. [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 2000. [Google Scholar]
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer: 1996. [Google Scholar]
Wang S, Nan B, Zhu N, Zhu J. Hierarchically penalized Cox regression with grouped variables. Biometrika. 2009;96:307–322. [Google Scholar]
Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc Ser B Stat Methodol. 2014;76:217–242. [Google Scholar]
Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
Zhao SD, Li Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivariate Anal. 2012;105:397–411. doi: 10.1016/j.jmva.2011.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

NIHMS847285-supplement-Supplemental_Material.pdf^{(619.6KB, pdf)}

[R1] Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. doi: 10.1038/35000501. [DOI] [PubMed] [Google Scholar]

[R2] Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Statist. 1982:1100–1120. [Google Scholar]

[R3] Antoniadis A, Fryzlewicz P, Letué F. The Dantzig selector in Cox’s proportional hazards model. Scand J Stat. 2010;37:531–552. [Google Scholar]

[R4] Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann Statist. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Chernozhukov V, Chetverikov D, Kato K. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann Statist. 2013;41:2786–2819. [Google Scholar]

[R7] Cox DR. Regression models and life-tables. J R Stat Soc Ser B Stat Methodol. 1972;34:187–220. [Google Scholar]

[R8] Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]

[R9] Dawber TR. The Framingham Study: the epidemiology of atherosclerotic disease. Vol. 84. Harvard University Press; Cambridge: 1980. [Google Scholar]

[R10] Di Gaetano N, Cittera E, Nota R, Vecchi A, Grieco V, Scanziani E, Botto M, Introna M, Golay J. Complement activation determines the therapeutic activity of rituximab in vivo. J Immunol. 2003;171:1581–1587. doi: 10.4049/jimmunol.171.3.1581. [DOI] [PubMed] [Google Scholar]

[R11] Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30:74–99. [Google Scholar]

[R12] Gui J, Li H. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics. 2005;21:3001–3008. doi: 10.1093/bioinformatics/bti422. [DOI] [PubMed] [Google Scholar]

[R13] Hiai H, Tsuruyama T, Yamada Y. Pre-B lymphomas in SL/Kh mice: A multi-factorial disease model. Cancer Science. 2003;94:847–850. doi: 10.1111/j.1349-7006.2003.tb01365.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Huang J, Sun T, Ying Z, Yu Y, Zhang C-H. Oracle inequalities for the Lasso in the Cox model. Ann Statist. 2013;41:1142–1165. doi: 10.1214/13-AOS1098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional statistical models. NIPS. 2013:1187–1195. [Google Scholar]

[R16] Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. Vol. 360. John Wiley & Sons; 2011. [Google Scholar]

[R17] Kong S, Nan B. Non-asymptotic oracle inequalities for the high-dimensional Cox regression via Lasso. Stat Sinica. 2014;24:25–42. doi: 10.5705/ss.2012.240. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. Springer: 2007. [Google Scholar]

[R19] Lin W, Lv J. High-dimensional sparse additive hazards regression. J Amer Statist Asooc. 2013;108:247–264. [Google Scholar]

[R20] Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R, et al. A significance test for the Lasso. Ann Statist. 2014;42:413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Meierhoff G, Dehmel U, Gruss H, Rosnet O, Birnbaum D, Quentmeier H, Dirks W, Drexler H. Expression of FLT3 receptor and FLT3-ligand in human leukemia-lymphoma cell lines. Leukemia. 1995;9:1368–1372. [PubMed] [Google Scholar]

[R22] Müller P, van de Geer S. Censored linear model in high dimensions. 2014 arXiv:1405.0579. [Google Scholar]

[R23] Nishiu M, Yanagawa R, Nakatsuka S-i, Yao M, Tsunoda T, Nakamura Y, Aozasa K. Microarray analysis of gene-expression profiles in diffuse large b-cell lymphoma: Identification of genes related to disease progression. Cancer Science. 2002;93:894–901. doi: 10.1111/j.1349-7006.2002.tb01335.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Shorack GR, Wellner JA. Empirical Processes with Applications to Statistics. Vol. 59. SIAM; 2009. [Google Scholar]

[R25] Tibshirani R. The Lasso method for variable selection in the Cox model. Stat Med. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]

[R26] Tsiatis AA. A large sample study of Cox’s regression model. Ann Statist. 1981;9:93–108. [Google Scholar]

[R27] van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Statist. 2014a;42:1166–1202. [Google Scholar]

[R28] van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann Statist. 2014b;42:1166–1202. [Google Scholar]

[R29] van der Vaart AW. Asymptotic Statistics. Cambridge University Press; 2000. [Google Scholar]

[R30] van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer: 1996. [Google Scholar]

[R31] Wang S, Nan B, Zhu N, Zhu J. Hierarchically penalized Cox regression with grouped variables. Biometrika. 2009;96:307–322. [Google Scholar]

[R32] Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. J R Stat Soc Ser B Stat Methodol. 2014;76:217–242. [Google Scholar]

[R33] Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]

[R34] Zhao SD, Li Y. Principled sure independence screening for Cox models with ultra-high-dimensional covariates. J Multivariate Anal. 2012;105:397–411. doi: 10.1016/j.jmva.2011.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Testing and Confidence Intervals for High Dimensional Proportional Hazards Model

Ethan X Fang

Yang Ning

Han Liu

Abstract

1 Introduction

2 Background

2.1 Cox’s Proportional Hazards Model

2.2 Penalized Estimation

Assumption 2.1

Assumption 2.2

Additional Notations

3 Testing Hyptheses and Constructing Confidence Intervals

3.1 Decorrelated Score Test

Figure 1.

3.2 Confidence Intervals and Decorrelated Wald Test

3.3 Decorrelated Partial Likelihood Ratio Test

4 Asymptotic Properties

Assumption 4.1

Assumption 4.2

Assumption 4.3

Theorem 4.4

Lemma 4.5

Corollary 4.6

Theorem 4.7

Corollary 4.8

Theorem 4.9

Corollary 4.10

Corollary 4.11

Remark 4.12

Remark 4.13

Remark 4.14

5 Inference on the Baseline Hazard Function

Assumption 5.1

Theorem 5.2

Corollary 5.3

6 Numerical Results

6.1 Inference on the Parametric Component

Table 1.

Table 2.

Figure 2.

6.2 Inference on the Baseline Hazard Function on Simulated Data

Table 3.

Table 4.

Figure 3.

Figure 4.

6.3 Analyzing a Gene Expression Dataset

Table 5.

7 Discussion

Supplementary Material

Acknowledgments

A Proofs in Section 4

Lemma A.1

Proof

Lemma A.2

Proof

Proof of Theorem 4.4

Proof of Lemma 4.5

Proof of Theorem 4.7

Proof of Theorem 4.9

B Proofs in Section 5

Lemma B.1

Proof

Corollary B.2

Proof of Theorem 5.2

C Technical Lemmas

Lemma C.1

Proof

Lemma C.2

Proof

Lemma C.3

Proof

Lemma C.4

Proof

Lemma C.5

Proof

Lemma C.6

Proof

Lemma C.7