Asymptotic Behavior of Cox’s Partial Likelihood and its Application to Variable Selection

Runze Li; Jian-Jian Ren; Guangren Yang; Ye Yu

doi:10.5705/ss.202016.0401

. Author manuscript; available in PMC: 2019 Oct 1.

Published in final edited form as: Stat Sin. 2018 Oct;28(4):2713–2731. doi: 10.5705/ss.202016.0401

Asymptotic Behavior of Cox’s Partial Likelihood and its Application to Variable Selection

Runze Li ¹, Jian-Jian Ren ², Guangren Yang ³, Ye Yu ⁴

PMCID: PMC6168090 NIHMSID: NIHMS866285 PMID: 30294192

Abstract

For theoretical properties of variable selection procedures for Cox’s model, we study the asymptotic behavior of partial likelihood for the Cox model. We find that the partial likelihood does not behave like an ordinary likelihood, whose sample average typically tends to its expected value, a finite number, in probability. Under some mild conditions, we prove that the sample average of partial likelihood tends to infinity at the rate of the logarithm of the sample size, in probability. We apply the asymptotic results on the partial likelihood to study tuning parameter selection for penalized partial likelihood. We find that the penalized partial likelihood with the generalized cross-validation (GCV) tuning parameter proposed in Fan and Li (2002) enjoys the model selection consistency property, despite the fact that GCV, AIC and C_p, equivalent in the context of linear regression models, are not model selection consistent. Our empirical studies via Monte Carlo simulation and a data example confirm our theoretical findings.

Key words and phrases: Akaike information criterion, Bayesian information criterion, LASSO penalized partial likelihood, SCAD, variable selection

1. Introduction

The Cox model (Cox (1972)) has been the most popular model in the survival data analysis during the past decades, and the partial likelihood (Cox (1975)) is perhaps the most commonly-used technique for analysis of right censored data. In practice, many risk factors and covariates are available for the initial analysis, thus an important task is to identify the significant risk factors and covariates. Variable selection is a useful technique in the analysis of survival data in the presence of many covariates. Classical variable selection criteria for linear regression models can be extended for the Cox model by replacing the log-likelihood by the log-partial likelihood (AIC (Akaike (1974)) and BIC (Schwarz (1978))). The LASSO (Tibshirani (1996)) variable selection technique has been extended for the Cox model (Tibshirani (1997); Zhang and Lu (2007); Zou (2008)). Nonconcave partial likelihood variable selection procedures have been developed for the Cox model (Fan and Li (2002); Bradic, Fan, and Jiang (2011)). To investigate the theoretical property of these procedures, we have to study the asymptotic behavior of the partial likelihood.

There has been little work on the asymptotic behavior of the partial likelihood, though the asymptotic properties of the partial likelihood estimator have been extensively studied (Tsiatis (1981); Andersen and Gill (1982); Takemi and Toshinari (1984)). Under mild regularity condition, the maximum partial likelihood estimator behaves the same as the ordinary maximum likelihood estimator of i.i.d. random samples in terms of asymptotic consistency, asymptotic normality and asymptotic efficiency. See, for example, Murphy and van der Vaart (2000). In this paper, we first study the asymptotic behavior of the partial likelihood, and prove that the ‘sample average’ of partial likelihood diverges to infinity at a rate of the logarithm of the sample size. This clearly indicates that the Cox partial likelihood does not behave like an ordinary likelihood in that under mild regularity conditions, the sample average of the ordinary likelihood function converges to its expectation (a finite value) in probability as the sample size tends to infinity.

With the aid of the asymptotic property of partial likelihood, we study the selection of regularization parameter in penalized partial likelihood for variable selection. Tibshirani (1997) proposed penalized partial likelihood with LASSO penalty for the Cox model. Fan and Li (2002) proposed the partial likelihood with the SCAD penalty for the Cox models, and showed that under certain regularity conditions, the resulting estimate enjoys the oracle property. Zhang and Lu (2007) and Zou (2008) further proposed adaptive LASSO for the Cox model to improve the SCAD procedure in terms of computational efficiency, while retaining the oracle property. The oracle property depends on the choice of the regularization parameter in penalized partial likelihood. It is well known that the regularization parameter controls the model complexity of the selected models, and plays a crucial role in these variable selection procedures. The issue of regularization parameter selection for penalized partial likelihood has not been systematically studied, in part because the asymptotic behavior of partial likelihood was not well understood. Wang, Li, and Tsai (2007) studied the selection of regularization parameter in the SCAD penalized least squares for linear regression models. They showed that with a positive probability, the generalized cross-validation (GCV, Craven and Wahba (1979)) selector yields an over-fitted model, and therefore this procedure does not enjoy the oracle property.

In this paper, we prove that the GCV selector for the SCAD method for the Cox model enjoys model selection consistency, in contrast to its model selection inconsistency in the least squares setting as demonstrated in Wang, Li, and Tsai (2007). Although GCV is equivalent to AIC and the C_p in the context of linear regression models, AIC and C_p yield an overfitted models with a positive probability, and thus are not model selection consistent.

The rest of this paper is organized as follows. Section 2 studies the asymptotic behavior of the partial likelihood of the Cox model. We study the regularization parameter selection for the penalized partial likelihood in Section 3. Simulation study and a data example are presented in Section 4. Proofs are given in the Appendix.

2. Asymptotic Behavior of Cox’s Partial Likelihood

Let T and X = (X₁, ···, X_d)^T be the survival time and associated d-dimensional vector of covariates, respectively. Consider the Cox proportional hazard regression model:

h (t ∣ x) = h_{0} (t) exp (x^{T} β),

(2.1)

where β is the regression coefficient vector, and h(t | x) is the conditional hazard function of T given X = x with h₀(t) as an arbitrary baseline hazard function. Suppose that (T₁, x₁), …, (T_n, x_n) is a random sample of (T, X), and the observed right censored survival data are as follows: (V₁, δ₁, x₁), …, (V_n, δ_n, x_n), where V_i = min{T_i, C_i}, δ_i = I{T_i ≤ C_i}, and C_i is the right censoring variable independent of T_i given X = x_i. Without loss of the generality, assume that there are no ties among observed continuous random variables V_i’s. The log-partial likelihood function of the observed data is

ℓ_{c} (β) = \sum_{i = 1}^{n} δ_{i} x_{i}^{T} β - \sum_{i = 1}^{n} δ_{i} log (\sum_{j = 1}^{n} I {V_{j} \geq V_{i}} exp (x_{j}^{T} β)) .

(2.2)

(Cox (1975)). The goal is to study the asymptotic behavior of ℓ_c(β). We first illustrate the different behaviors of the log-partial likelihood and the likelihood of an i.i.d. sample by an example.

Example 1

Suppose that we have an i.i.d. random sample {Y₁, …, Y_n} from a population with probability density/mass function f(y; θ), so $ℓ (θ) = \sum_{i = 1}^{n} log {f (Y_{i}; θ)}$ is the log-likelihood function. By the weak law of large number, n⁻¹ℓ(θ) → E log{f(Y; θ)} in probability under mild regularity conditions. Furthermore, under mild regularity conditions, the maximum partial likelihood estimator, the maximizer of ℓ_c(β), behaves the same as the ordinary maximum likelihood estimator, the maximizer of ℓ(θ), in terms of asymptotic consistency, asymptotic normality and asymptotic efficiency. See, for example, Murphy and van der Vaart (2000). Here, we numerically illustrate that

n^{- 1} ℓ_{c} (β) \to \infty as n \to \infty .

(2.3)

We generated a random sample of size n from the proportional hazard model

h (t ∣ x) = h_{0} (t) exp (X β),

where h₀(t) ≡ 1, β = 1 and X ~ N(0, 1) The censoring variable C was generated from an exponential distribution with mean U. Therefore, the average censoring rate varies with different values of U. We list several values of U in Table 2.1 together with their corresponding average censoring rates, 1 − E I(T ≤ C) ≙1 − ρ₁, and take 10 different values of n ranging from 4(= 2²) to 1024(= 2¹⁰). Figure 2.1 depicts the scatter plot of log(n) versus −n⁻¹ℓ_c based on a set of typical samples based on the different U listed in Table 2.1. Figure 2.1 clearly suggests that −n⁻¹ℓ_c increases at log(n) rate.

Table 2.1.

Values of U and the corresponding Average Censoring Rates (1 − ρ₁) together with μ₀≙E{(T ≤ C) X}.

U	10.00	5.00	2.75	1.80	1.20	0.80	0.50	0.30
(1 − ρ₁)	0.1222	0.2055	0.2991	0.3797	0.4613	0.5485	0.6393	0.7311
μ₀	0.0968	0.1429	0.1775	0.1971	0.2007	0.2028	0.1921	0.1652

Open in a new tab

Figure 2.1 — Plot of log(n) versus −n⁻¹ℓ_c. ‘○’ is the scatter plot of log(n) versus −n⁻¹ℓ_c based on a typical simulated data set. The solid line in each plot is log(n)ρ̂₁ − β^T μ̂₀ with β = 1, where ρ̂₁ is an estimate of EI {T ≤ C} and μ̂₀ is an estimate of E{I{T ≤ C}X}.

We next show that −n⁻¹ℓ_c(β) tends to infinite at the rate of log(n) using techniques related to empirical processes. Let

G_{n} (v, x) = n_{.}^{- 1} \sum_{i = 1}^{n} I {V_{i} \leq v, x_{i} \leq x}, H_{n} (v) = n^{- 1} \sum_{i = 1}^{n} I {V_{i} \leq v, δ_{i} = 1},

(2.4)

with G(v, x) and H(v) as the limits of the empirical distribution functions G_n(v, x) and H_n(v), respectively. Take μ₀ = E{I{T ≤ C}X}, W(t) = ∫ ∫_v_≥_t exp (x^T β) dG(v, x), and ρ₁ = EI{T ≤ C}. The proof of the following theorem is given in Appendix A.

Theorem 1

If (V_i, δ_i_, x_i), i = 1, ···, n, is a random sample from the Cox model (2.1) and the censoring time C_i is independent of T_i given x_i, then the following statements hold.

If X has a finite bounded support, then
$- n^{- 1} ℓ_{c} (β) = ρ_{1} log n - μ_{0}^{T} β + O_{p} (1), a s n \to \infty .$ (2.5)
If $μ_{1} = \int_{0}^{\infty} log W (t) d H (t)$ is well-defined, E|X_j| < ∞ for all j = 1, ···, d, and 0 < E exp(X^T β) < ∞,
$- n^{- 1} ℓ_{c} (β) = ρ_{1} log n - μ_{0}^{T} β + μ_{1} + o_{p} (1) .$ (2.6)

When there is no censoring, it can be shown that W(t) = f_T(t)/h₀(t) and $μ_{1} = \int_{0}^{\infty} log [f_{T} (t) / h_{0} (t)] {d F}_{T} (t)$ , where f_T(t) and F_T(t) are the probability density and cumulative distribution function of T in (2.1), respectively. Thus, the assumption about μ₁ holds for many distributions, such as the exponential distribution.

Remark

From the proof of this theorem, the leading term ρ₁ log(n) comes from $log (n) (\frac{1}{n} \sum_{i = 1}^{n} δ_{i})$ , which does not depend on the regression coefficient β and does not affect the first and second order derivatives of the partial likelihood function. As the asymptotic normality of the maximum partial likelihood estimator relies on the first and second order derivatives, the divergent behavior of the partial likelihood function does not impact the asymptotic normality of the partial likelihood estimator. On other hand, the tuning parameter selector for penalized partial likelihood, studied in next section, depends on the partial likelihood function itself. As a result, the asymptotic behavior of the partial likelihood function directly affects the property of the tuning parameter selector.

3. Tuning parameter selector in penalized partial likelihood

Take the penalized partial likelihood to be

ℓ_{c} (β) - n \sum_{j = 1}^{d} p_{λ} (∣ β_{j} ∣),

(2.7)

where d is the dimension of β, p_λ(·) is a penalty function with a tuning parameter λ (or more generally, λ_js). The penalized partial likelihood estimate of β maximizes (2.7) with respect to β. Denote by β₀ the true value of β, and let $β_{0} = {(β_{10}, \dots, β_{d 0})}^{T} = {(β_{10}^{T}, β_{20}^{T})}^{T}$ . Without loss of generality, we take β₂₀ = 0 with all components of β₁₀ nonzero. Under some regularity conditions, Fan and Li (2002) showed that the nonconcave penalized likelihood estimator $\hat{β} = {({\hat{β}}_{1}^{T}, {\hat{β}}_{2}^{T})}^{T}$ possesses the oracle property: with probability tending to 1, for a certain choice of p_{λ_n}(·), we have β̂₂ = 0 and

\sqrt{n} ({\hat{β}}_{1} - β_{10}) \to N {0, I_{1}^{- 1} (β_{10}, 0)},

where I₁(β₁₀, 0) is the Fisher information matrix for β₁ knowing β₂ = 0.

The oracle property depends on the choice of the tuning parameter. Thus, the selection of tuning parameter is fundamental in the penalized likelihood procedure. Wang, Li, and Tsai (2007) studied the selection of the tuning parameter for penalized least squares for linear regression models. They showed that the GCV tuning parameter of Fan and Li (2001) cannot yield an oracle estimator. The issue of tuning parameter selection for the penalized partial likelihood has not been studied. Based on the asymptotic results about the partial likelihood, we show that the GCV tuning parameter selector for (2.7) possesses model selection consistency, in contrast to the model selection inconsistency of the GCV tuning parameter selector in the penalized least squares setting.

Let β̂_λ be the penalized partial likelihood estimator with tuning parameter λ. Define the GCV statistic to be

GCV (λ) = \frac{- ℓ_{c} ({\hat{β}}_{λ})}{n {1 - {df}_{λ} / n}^{2}} .

(2.8)

When λ̂_GCV = argmin_λ{GCV(λ)} is selected, where the degree of freedom df_λ is set to be the number of the nonzero penalized partial likelihood estimate corresponding to the tuning parameter λ. It can be shown that with probability tending to one, the effective number of parameters in Fan and Li (2002) is df_λ by using related techniques in Zhang, Li, and Tsai (2010).

We define the corresponding AIC and BIC statistics as

AIC (λ) = - 2 ℓ_{c} ({\hat{β}}_{λ}) + 2 {df}_{λ}

(2.9)

BIC (λ) = - 2 ℓ_{c} ({\hat{β}}_{λ}) + log (n) {df}_{λ},

(2.10)

with the AIC and BIC tuning parameter selectors

{\hat{λ}}_{AIC} = {argmin}_{λ} {AIC (λ)} and {\hat{λ}}_{BIC} = {argmin}_{λ} {BIC (λ)}

When t lies in the neighborhood of 0, (1 − t)⁻² ≈ 1 − 2t so, when n is large enough,

2 n GCV (λ) \approx - 2 ℓ_{c} ({\hat{β}}_{λ}) + 4 (- ℓ_{c} ({\hat{β}}_{λ}) / n) {df}_{λ} .

If −ℓ_c(β̂_λ)/(n log(n)) → EI{T ≤ C} as n → ∞, then

2 n GCV (λ) \approx - 2 ℓ_{c} ({\hat{β}}_{λ}) + 4 ρ_{1} log (n) {df}_{λ} .

(2.11)

For ρ₁ ≥ 1/4, the GCV tuning parameter can yield a sparser model than the one selected by the BIC-tuning parameter selector, as is seen in the simulation study in Section 4.

3.1. Definition and Notation

We first need to define the candidate models considered in model selection. Let ᾱ = {1, ···, d} denote the label of predictors for the full model. Hence α, the subset of ᾱ represents a candidate model including the predictors labelled by α. For each candidate model α, its model size and the corresponding coefficients are df_α and β_α. Therefore, each tuning parameter λ determined in the penalty function results in a selected model α_λ with model size df_{α_λ} and the corresponding coefficients β̂_λ. The collection of all candidate models is denoted by 𝓐.

For any given model α, we are able to obtain its non-penalized estimates ${\hat{β}}_{α}^{★}$ by maximizing the corresponding partial likelihood ℓ_c(β). Similarly, for any selected model α_λ obtained from penalized partial likelihood with given λ, we are able to obtain the corresponding non-penalized estimates ${\hat{β}}_{α_{λ}}^{★}$ .

To study the asymptotic behaviors of the tuning parameter selectors for penalized partial likelihood, we define a general tuning parameter selector

{GIC}_{κ_{n}} (\hat{β}) = - 2 ℓ_{c} (\hat{β}) + κ_{n} {df}_{\hat{β}},

(2.12)

where β̂ is the parameter estimator and df_β̂ is the corresponding degree freedom associated with β̂. Here κ_n is a positive number that denotes different variable selection criterion. When κ_n = 2, GIC_{κ_n} is the AIC at (2.9), and when κ_n = log(n), GIC_{κ_n} is the BIC at (2.10).

3.2. Theoretical Property

In this section, we assume that the set of candidate models contain the unique true model and that the number of parameters in the full model is finite. Assume that the coefficients of the unique true model α₀ in 𝓐 are nonzero. Therefore, any candidate model α ⊉ α₀ is an underfitted model while any model α ⊃ α₀ is an overfitted model. We partition the tuning parameters into

Ω_{-} = {λ : α_{λ} ⊉ α_{0}}, Ω_{0} = {λ : α_{λ} = α_{0}} and Ω_{+} = {λ : α_{λ} \supset α_{0}} .

We need the following conditions.

(E1)
λ_max depends on n and satisfies λ_max → 0 as n → ∞.
(E2)
There exits a constant m such that the penalty p_λ(ξ) satisfies $p_{λ}^{'} (ξ) = 0$ for ξ > mλ.
(E3)
If λ_n → 0 as n → ∞, then the penalty function satisfies
$\underset{n \to \infty}{lim inf} \underset{ξ ↓ 0}{lim inf} \sqrt{n} p_{λ}^{'} (ξ) \to \infty .$
(E4)
For any candidate model α ∈ 𝓐, there exits c_α > 0 such that $- n^{1} ℓ_{c} ({\hat{β}}_{α}^{*}) - log (n) ρ_{1} \to c_{α}$ . In addition, for any underfitted model α ⊉ α₀, c_α > c_α₀.

Conditions (E1)–(E3) are conditions on the penalty while condition (E4) is the technical condition needed to investigate the asymptotic properties of the tuning parameter selectors for penalized partial likelihood. Condition (E1) indicates a smaller tuning parameter is required if the sample size is large; (E2) implies that the penalty is chosen to have an asymptotic unbiased estimator; (E3) is used to study the oracle property of the penalized estimator; (E4) assures that the underfitted model yields a larger model deviance than that of the true model.

Theorem 2

Suppose that the partial likelihood function of the Cox’s model satisfies Conditions (A)–(D) in Fan and Li (2002) and that Conditions (E1)–(E4) hold.

If there exits a positive constant M such that κ_n < M, then the tuning parameter λ̂ obtained by minimizing GIC_{κ_n}(λ) satisfies P{λ̂ ∈ Ω₋} → 0 and P{λ̂ ∈ Ω₊} > 0.
If κ_n → ∞ and $κ_{n} / \sqrt{n} \to 0$ , then the tuning parameter λ̂ obtained by minimizing GIC_{κ_n} (λ) satisfies P{α_λ̂ = α₀} → 1.
If ρ₁ > 0, then the tuning parameter λ̂ obtained by minimizing the GCV score defined in (2.8) satisfies P{α_λ̂ = α₀} → 1.

The proof of Theorem 2 is given in the supplement (Li, Ren, Yang and Yu (2016)).

Here, Theorem 2(A) implies that the GIC_{κ_n} selector with bounded κ_n tends to overfit without considering which penalty function is used, while Theorem 2(B) indicates that the GIC_{κ_n} selector with diverging κ_n enables us to identify the true model consistently. Thus, the penalized partial likelihood with diverging κ_n possesses the oracle property. Theorem 2(C) implies that the penalized partial likelihood estimator with the GCV selector also possesses the oracle property. This is quite different from penalized least squares for the linear regression model; as shown in Wang, Li and Tsai (2007), the GCV selector for the penalized least squares with linear model results in an overfitted model with positive probability.

4. Numerical Results

We assessed the finite sample performance of proposed procedures. Since there exist various comparisons among penalized partial likelihood with different penalties such as the LASSO and SCAD. In our simulation studies, we focused on comparisons among different tuning parameter selectors for penalized partial likelihood with the SCAD penalty. For simplicity, we refer to the SCAD penalized partial likelihood with κ_n = 2 and log(n) in GIC_{κ_n} tuning parameter selector as SCAD-AIC and SCAD-BIC, respectively. Similarly we refer to the SCAD method with the GCV as SCAD-GCV. The best subset selection with AIC and BIC criteria for the Cox model are denoted by AIC and BIC in this section, respectively. In our simulation, we employed the local linear algorithm (LLA, Zou and Li (2008)) to compute the parameter estimates of the SCAD penalized partial likelihood function.

Example 4.1

We adapted the model structure in Fan and Li (2002) to generate the data with sample sizes n = 100, 200, and 400 from the Cox model with hazard function

h (t ∣ x) = h_{0} (t) exp (x^{T} β),

where h₀(t) ≡ 1, β = (0.8, 0, 0, 1, 0, 0, 0, 0, 0, 0.6, 0, 0)^T, and x had a 12-dimensional normal distribution, with the correlation between x_i and x_j as 0.5^|ⁱ⁻^j^|. Accordingly, μ(x^T β) = exp(−x^T β). The censoring distribution was exponential with mean U exp(−x^T β), where U was sampled from a uniform distribution over [1, 3]. Consequently, the average censoring percentage was 35%. We include the case with no censoring as a benchmark. For each scenario, we conducted 1000 simulations.

To assess finite sample performance, we report the percentage of models correctly fitted, underfitted, and overfitted with 1, 2, 3, 4, 5 or more parameters by five variable selection procedures, as well as the simulated data fitted with the true model over 1000 simulations. We report the average number of zero coefficients that were correctly (C) and incorrectly (IC) identified in the selected models over 1000 simulations. To compare model fittings, we calculated the model error for the new observation (V, δ, x),

M E (\hat{β}) = E_{x} {μ (x^{T} β) - μ (x^{T} \hat{β})}^{2},

where the expectation is taken with respect to the new observed covariate vector x, and μ(x^T β) = E(T|x, β). We report the median of the relative model error (MRME) over 1000 simulations, where the relative model error is defined as RME = ME/ME_full, and ME_full is the model error calculated by fitting the data with the full model.

In Fan and Li (2002), it was shown that

M E (\hat{β}) = E_{x} {μ (x^{T} β) - μ (x^{T} \hat{β})}^{2} = E_{x} {exp (- x^{T} β) - exp (- x^{T} \hat{β})}^{2} .

By using the moment generating function of the multinormal distribution, we can simplify this to

M E (\hat{β}) = exp (2 {\hat{β}}^{T} \sum \hat{β}) + exp (2 β^{T} \sum β) - 2 exp {\frac{1}{2} {(\hat{β} + β)}^{T} \sum (\hat{β} + β)},

where Σ is the covariance matrix of x. We use this formula to calculate model errors for our simulations.

Table 2.2 gives the results for the uncensored case, and shows that the MRME of SCAD-BIC/GCV is smaller than that of SCAD-AIC. As the sample size increases, the MRME of SCAD-BIC/GCV approaches that of the oracle estimator, whereas the MRME of SCAD-AIC remains at the same level. Interestingly, SCAD-BIC and SCAD-AIC have smaller MRME than that of the best subset selection with BIC and AIC, respectively.

Table 2.2.

Simulation results for the Cox model (No Censoring)

n	Method	MRME (%)	Zeros		Under (%)	Exact (%)	Over Fitted (%)
n	Method	MRME (%)	C	IC	Under (%)	Exact (%)	1	2	3	4	≥ 5
100	SCAD-AIC	45.75	7.255	0.001	0.1	37.5	15.3	16.1	13.1	8.5	9.4
	SCAD-BIC	20.90	8.576	0.003	0.3	74.0	15.7	5.2	3.8	0.9	0.1
	SCAD-GCV	17.29	8.940	0.059	5.6	89.2	4.9	0.3	0.0	0.0	0.0
	AIC	52.52	7.349	0.001	0.1	20.1	29.4	26.9	15.3	6.0	2.2
	BIC	25.68	8.666	0.004	0.4	72.5	22.6	3.4	1.1	0.0	0.0
	Oracle	15.73	9.000	0.000	0.0	100.0	0.0	0.0	0.0	0.0	0.0

200	SCAD-AIC	58.53	7.591	0.000	0.0	46.6	15.9	13.8	8.5	7.3	7.9
	SCAD-BIC	36.33	8.867	0.000	0.0	90.1	7.3	1.8	0.8	0.0	0.0
	SCAD-GCV	33.96	8.995	0.003	0.3	99.2	0.5	0.0	0.0	0.0	0.0
	AIC	66.37	7.506	0.000	0.0	23.1	32.2	24.8	13.8	4.5	1.6
	BIC	41.89	8.781	0.000	0.0	81.2	16.1	2.3	0.4	0.0	0.0
	Oracle	33.95	9.000	0.000	0.0	100.0	0.0	0.0	0.0	0.0	0.0

400	SCAD-AIC	68.14	7.553	0.000	0.0	45.1	14.7	15.4	9.3	9.1	6.4
	SCAD-BIC	44.10	8.936	0.000	0.0	94.7	4.4	0.7	0.2	0.0	0.0
	SCAD-GCV	42.33	8.999	0.000	0.0	99.9	0.1	0.0	0.0	0.0	0.0
	AIC	74.71	7.530	0.000	0.0	22.6	33.4	25.7	12.2	5.0	1.1
	BIC	47.38	8.875	0.000	0.0	88.6	10.5	0.7	0.2	0.0	0.0
	Oracle	42.30	9.000	0.000	0.0	100.0	0.0	0.0	0.0	0.0	0.0

Open in a new tab

Table 2.2 also shows that SCAD-BIC/GCV has a higher probability of correctly estimating the true zero coefficients to zero than does SCAD-AIC. However, SCAD-BIC/GCV was more prone than SCAD-AIC to incorrectly set the three nonzero coefficients to zero when the sample size was small, and SCAD-GCV was more aggressive than SCAD-BIC with larger values in “IC” columns. In addition, SCAD-BIC/GCV had a much higher probability of correctly identifying the true model.

For the censored case, Table 2.3 shows findings similar to those presented in Table 2.2. Accordingly, SCAD-BIC/GCV was superior to SCAD-AIC in both identifying the true model, and in reducing the model error and complexity. When the data was 35% censored, all methods declined slightly in their efficacy, while the relative performance of SCAD-BIC/GCV versus SCAD-AIC remained the same as in the uncensored case. This is consistent with our theoretical analysis in Section 3.

Table 2.3.

Simulation results for the Cox model (35% Censoring)

n	Method	MRME (%)	Zeros		Under (%)	Exact (%)	Over Fitted (%)
n	Method	MRME (%)	C	IC	Under (%)	Exact (%)	1	2	3	4	≥ 5
100	SCAD-AIC	42.43	7.235	0.012	1.2	33.4	18.8	16.6	12.0	8.5	9.5
	SCAD-BIC	21.42	8.491	0.060	5.8	63.4	19.7	7.3	2.4	1.1	0.3
	SCAD-GCV	19.04	8.800	0.153	13.6	71.6	12.3	2.1	0.3	0.1	0.0
	AIC	50.03	7.370	0.016	1.6	20.4	30.0	25.9	13.6	6.5	2.0
	BIC	23.45	8.648	0.036	3.6	68.8	23.7	3.5	0.4	0.0	0.0
	Oracle	14.35	9.000	0.000	0.0	100.0	0.0	0.0	0.0	0.0	0.0

200	SCAD-AIC	59.24	7.535	0.000	0.0	42.3	19.5	13.2	10.2	7.7	7.1
	SCAD-BIC	35.53	8.841	0.000	0.0	87.4	9.8	2.3	0.5	0.0	0.0
	SCAD-GCV	32.48	8.963	0.006	0.6	95.9	3.3	0.2	0.0	0.0	0.0
	AIC	64.64	7.513	0.000	0.0	22.8	35.5	21.5	12.1	6.9	1.2
	BIC	37.90	8.830	0.000	0.0	84.8	13.5	1.6	0.1	0.0	0.0
	Oracle	31.45	9.000	0.000	0.0	100.0	0.0	0.0	0.0	0.0	0.0

400	SCAD-AIC	69.31	7.552	0.000	0.0	41.5	19.5	14.7	10.6	7.4	6.3
	SCAD-BIC	45.07	8.920	0.000	0.0	93.2	5.7	1.0	0.1	0.0	0.0
	SCAD-GCV	42.75	8.993	0.000	0.0	99.4	0.5	0.1	0.0	0.0	0.0
	AIC	73.64	7.547	0.000	0.0	23.8	33.7	24.7	11.6	4.5	1.7
	BIC	48.85	8.856	0.000	0.0	86.8	12.0	1.2	0.0	0.0	0.0
	Oracle	43.47	9.000	0.000	0.0	100.0	0.0	0.0	0.0	0.0	0.0

Open in a new tab

Example 4.2

(Heart attack data) We applied the proposed regularization parameter selection procedures to the heart attack data set used in Hosmer and Lemeshow (1999). The data were collected in the Worcester Heart Attack Study which describes trends over time in survival rates following hospital admission for acute myocardial infarction. The total length of follow-up on the admission of 481 hospital patients was recorded for years 1975, 1978, 1981, 1984, 1986, and 1988. Among those patients, 249 died and the rest were censored at the rate of 48%.

To model survival time, Hosmer and Lemeshow (1999) suggested fitting the Cox proportional hazards model with five explanatory variables: x₁-age; x₂-cpk (peak cardiac enzymes in international units); x₃-sex (male=0 and female=1); x₄-chf (left heart failure complications, yes=1 and no=0); x₅-miord (MI order, first=0 and recurrent=1). In addition to these variables, we included the six interactions between the two continuous variables (age and cpk) and the three indicator variables (sex, chf, and miord). Thus, there were 11 variables in our full model. We applied the penalized partial likelihood approach. The resulting regularization parameters selected by SCAD-AIC, SCAD-BIC, and SCAD-GCV were 0.0533, 0.0878, and 0.1326, respectively. The corresponding tuning parameters selector curves are depicted in Figure 2.2.

Figure 2.2 — The left panel is the GIC scores with *κ_n* = 2 versus λ, the middle panel is the GIC score with log(n) versus λ, and the right panel is the GCV scores versus λ.

Table 2.4 presents the maximum partial penalized likelihood estimates (MPLE) from the full model as well as the SCAD-AIC/BIC/GCV parameter estimates, together with their standard errors. The full model contained six insignificant variables (x₂, x₃, and x₇ to x₁₀) at level 0.05, SCAD-AIC included two insignificant variables (x₆ and x₁₀) at level 0.05. In contrast, the four variables x₁, x₄, x₅, and x₁₁, selected by SCAD-BIC were significant at level 0.05. For this data set, SCAD-GCV looks to be overly aggressive in that it excludes x₅, and x₁₁.

Table 2.4 .

Estimates and Standard Errors for Heart Attack Data

	MPLE	SCAD-AIC	SCAD-BIC	SCAD-GCV
age (x₁)	0.60(0.13)	0.56(0.09)	0.43(0.07)	0.41(0.05)
cpk (x₂)	0.03(0.14)	0(-)	0(-)	0(-)
sex (x₃)	0.17(0.14)	0(-)	0(-)	0(-)
chf (x₄)	0.80(0.14)	0.80(0.13)	0.80(0.14)	0.82(0.13)
miord (x₅)	0.42(0.14)	0.43(0.13)	0.41(0.13)	0(-)
age*sex(x₆)	−0.29(0.14)	−0.22(0.13)	0(-)	0(-)
age*chf (x₇)	−0.07(0.15)	0(-)	0(-)	0(-)
age*miord (x₈)	0.03(0.15)	0(-)	0(-)	0(-)
cpk*sex (x₉)	−0.16(0.16)	0(-)	0(-)	0(-)
cpk*chf (x₁₀)	0.19(0.15)	0.19(0.09)	0(-)	0(-)
cpk*miord (x₁₁)	0.29(0.15)	0.25(0.12)	0.21(0.05)	0(-)

Open in a new tab

Based on Table 2.4, the p-values of the partial likelihood ratio test for examining the SCAD-AIC, SCAD-BIC, and SCAD-GCV model versus the full model are 0.6752, 0.1749, and 0.0034, respectively. Consequently, there is no evidence of lack of fit in the SCAD-BIC model. The SCAD-GCV model may be too aggressive, consistent with our simulation results that GCV tends to be underfitted when the sample size is not large enough.

5. A tribute to Peter Hall

Professor Peter Hall made wide ranging and ground-breaking contributions to many statistical fields and played major leadership roles throughout the statistical profession. He was a true scholar, and a mentor and friend of many of us. We grieve his loss.

Runze Li (RL) had the great fortune to learn from Peter and interact with him directly when they jointly served as Editors of the Annals of Statistics from 2013 to 2015. As an eminent scientist, Peter was an extremely kind, modest and optimistic person. Peter was always super fast, and handled whatever came to him promptly. His speed was unbeatable. Once, RL was asked to review a grant proposal by an international grant agency within a tight deadline. When RL sent back his report the next day, he was told that Peter’s report had already been received.

Professor Peter Hall had a huge influence on RL’s research on variable selection and feature screening, although he never collaborated with Peter on a paper. Many of RL’s works were inspired by Peter’s ideas. For example, Hall and Miller (2009) proposed using generalized correlation to conduct feature screening and the use of the bootstrap to quantify the uncertainty of feature ranking. Motivated by this work, Li, Zhong, and Zhu (2012) proposed using distance correlation for feature screening.

Professor Peter Hall will be remembered forever as a legendary statistician, a great scholar, beloved colleague, mentor and friend, and his work will continue to have a far-reaching impact on statistical methodology and theory.

Supplementary Material

Suppl

NIHMS866285-supplement-Suppl.pdf^{(75.3KB, pdf)}

Acknowledgments

The authors would like to thank the editors Professors Raymond J. Carroll and Qiwei Yao for organizing this special issue. Runze Li is grateful to the editors for their invitation and their constructive comments on an earlier version of this paper. Li’s research is supported by National Institute on Drug Abuse grants P50 DA039838, P50 DA036107, and R01 DA039854, National Science Foundation (NSF) grant DMS 1512422, and National Library of Medicine grant T32 LM012415. His research was also partially supported by National Nature Science Foundation of China grants 11690014 and 11690015. Ren’s research is supported by NSF grants DMS 0905772, DMS 1232424, and DMS 1407461. Yang’s research was supported by the NNSFC grant 11471086, the National Social Science Foundation of China grant 16BTJ032, the Fundamental Research Funds for the Central Universities 15JNQM019 and 21615452, the National Statistical Scientific Research Center Projects 2015LD02, the China Scholarship Council 201506785010 and Science and Technology Program of Guangzhou 2016201604030074. All authors equally contributed to this paper and are listed in alphabetic order. Guangren Yang is the corresponding author. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF, the NIDA, or the NIH.

Appendices

Appendix A: Proof of Theorem 1

Without loss of generality, assume that there are no ties among V_i’s in the observed data, and that

V_{1} < V_{2} < \dots < V_{n} .

(A.1)

This simplifies n⁻¹ℓ_c(β) to

n^{- 1} ℓ_{c} (β) = n^{- 1} β^{T} \sum_{i = 1}^{n} δ_{i} x_{i} - n^{- 1} \sum_{i = 1}^{n} δ_{i} log (exp (x_{i}^{T} β) + \dots + exp (x_{n}^{T} β)) .

(A.2)

It follows by the Weak Law of Large Numbers (WLLN) that ${(n^{- 1} \sum_{i = 1}^{n} δ_{i} x_{i})}^{T} β = μ_{0}^{T} β + o_{P} (1)$ . Let

R_{n} = n^{- 1} \sum_{i = 1}^{n} δ_{i} log (exp (x_{i}^{T} β) + \dots + exp (x_{n}^{T} β)) .

(A.3)

Thus,

- n^{- 1} ℓ_{c} (β) = - μ_{0}^{T} β + R_{n} + o_{P} (1),

(A.4)

From (A.1), we have

\begin{array}{l} exp (x_{i}^{T} β) + \dots + exp (x_{n}^{T} β) = \sum_{j = 1}^{n} I {V_{j} \geq V_{i}} e^{x_{j}^{T} β} \\ \int \int I {v \geq V_{i}} exp (x^{T} β) d {\sum_{j = 1}^{n} I {V_{j} \leq v, x_{j} \leq x}} \\ = \int \int_{v \geq V_{i}} exp (x^{T} β) d {\sum_{j = 1}^{n} I {V_{j} \leq v, x_{j} \leq x}} = n W_{n} (V_{i}), \end{array}

(A.5)

where W_n(t) = ∫ ∫_v_≥_t exp(x^T β) dG_n(v, x) with G_n(v, x) given in (2.4). Here δ_i is a binary random variable, $n^{- 1} \sum_{i = 1}^{n} δ_{i} = ρ_{1} + O_{P} (1 / \sqrt{n})$ . With $A_{n} = \int_{0}^{\infty} log [W_{n} (t)] d H_{n} (t)$ , it follows that

\begin{array}{l} R_{n} = n^{- 1} \sum_{i = 1}^{n} δ_{i} log (n W_{n} (V_{i})) = n^{- 1} \int_{0}^{\infty} log (n W_{n} (t)) d {\sum_{i = 1}^{n} δ_{i} I {V_{i} \leq t}} \\ = \int_{0}^{\infty} {log n + log (W_{n} (t))} d H_{n} (t) = log n (H_{n} (\infty) - H_{n} (0)) + A_{n} \\ = log n (n^{- 1} \sum_{i = 1}^{n} δ_{i}) + A_{n} = ρ_{1} log n + log n (n^{- 1} \sum_{i = 1}^{n} δ_{i} - ρ_{1}) + A_{n} \\ = ρ_{1} log n + O_{P} (n^{- 1 / 2} log n) + A_{n} = ρ_{1} log n + A_{n} + o_{P} (1), \end{array}

(A.6)

To prove Part (a), we next deal with A_n. Since

(n - i + 1) min_{i \leq j \leq n} exp (x_{j}^{T} β) \leq \sum_{j = i}^{n} exp (x_{j}^{T} β) \leq (n - i + 1) max_{i \leq j \leq n} exp (x_{j}^{T} β)

and X has a finite bounded support, it follows

\begin{array}{l} A_{n} = n^{- 1} \sum_{i = 1}^{n} δ_{i} log (W_{n} (V_{i})) = n^{- 1} \sum_{i = 1}^{n} δ_{i} log (\frac{exp (x_{i}^{T} β) + \dots + exp (x_{n}^{T} β)}{n}) \\ = n^{- 1} O_{P} (\sum_{i = 1}^{n} log (\frac{n - i + 1}{n})) = n^{- 1} O_{P} (log (n! / n^{n})) = O_{P} (1) . \end{array}

(A.7)

The last equality is due to Sterling’s formula. and this completes the proof of (a).

For Part (b), it suffices to show that

A_{n} \overset{P}{\to} μ_{1}, as n \to \infty .

(A.8)

From (2.4), we know that H_n(v) is the empirical process of a random sample of V_i’s with δ_i = 1. Thus, ||H_n − H|| = sup_v|H_n(v) − H(v)| = O_p(n^−1/2) by the DWK inequality (van der Vaart (1998)) since EI{δ_i = 1} = ρ₁ > 0. Hence, from (2.4), (A.5), and integration by parts, we have

\begin{array}{l} A_{n} = \int_{0}^{V_{n}} log [W_{n} (t)] d H_{n} (t) = B_{n} + \int_{0}^{V_{n}} log [W_{n} (t)] d [H_{n} (t) - H (t)] \\ = B_{n} + {[H_{n} (t) - H (t)] log [W_{n} (t)] |}_{0}^{V_{n}} - \int_{0}^{V_{n}} [H_{n} (t) - H (t)] d {log [W_{n} (t)]} \\ = B_{n} + [H_{n} (V_{n}) - H (V_{n})] log [W_{n} (V_{n})] - \int_{0}^{V_{n}} [H_{n} (t) - H (t)] d {log [W_{n} (t)]} \\ = B_{n} + [H_{n} (V_{n}) - H (V_{n})] log {exp (x_{n}^{T} β) / n} - \int_{0}^{V_{n}} [H_{n} (t) - H (t)] d {log [W_{n} (t)]} \\ = B_{n} + O_{p} (\frac{log n}{\sqrt{n}}) - \int_{0}^{V_{n}} [H_{n} (t) - H (t)] d {log [W_{n} (t)]}, \end{array}

(A.9)

where $B_{n} = \int_{0}^{V_{n}} log [W_{n} (t)] d H (t)$ by using the fact that $x_{n}^{T} β = O_{P} (1)$ , since E(|X_j|) < ∞ by the assumption on E(|X_j|) < ∞ for all j = 1, ···, p. From (2.4) and (A.5), we have

\begin{array}{l} | \int_{0}^{V_{n}} [H_{n} (t) - H (t)] d {log [W_{n} (t)]} | \leq ‖ H_{n} - H ‖ \times ∣ log W_{n} (V_{n}) - log W_{n} (0) ∣ \\ = O_{p} (n^{- 1 / 2}) log (\frac{exp (x_{1}^{T} β) + \dots + exp (x_{n}^{T} β)}{exp (x_{n}^{T} β)}) . \end{array}

By the assumption in Part (b) and the WLLN, $\frac{1}{n} \sum_{i = 1}^{n} exp (x_{i}^{T} β) \overset{P}{\to} E exp {X^{T} β}$ . This implies that $log {\sum_{i = 1}^{n} exp (x_{i}^{T} β)} - log (n) = O_{P} (1)$ . Furthermore, $log {exp (x_{n}^{T} β)} = x_{n}^{T} β = O_{P} (1)$ . Thus,

log (\frac{exp (x_{1}^{T} β) + \dots + exp (x_{n}^{T} β)}{exp (x_{n}^{T} β)}) = O_{P} {log (n)} .

It then follows that

| \int_{0}^{V_{n}} [H_{n} (t) - H (t)] d {log [W_{n} (t)]} | = O_{p} (\frac{log n}{\sqrt{n}}) .

(A.10)

Therefore, (A.8) follows from (A.9)–(A.10), the assumption about μ₁, and the Dominated Convergence Theore. Thus,

B_{n} \overset{P}{\to} μ_{1}, as n \to \infty .

(A.11)

Footnotes

Supplemental Materials

The proof of Theorem 2 is in the supplemental materials of this paper.

References

Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19:716–723. [Google Scholar]
Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The Annals of Statistics. 1982;10:1100–1120. [Google Scholar]
Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Annals of Statistics. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik. 1979;31:377–403. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics. 2009;18:533–550. [Google Scholar]
Hosmer DW, Lemeshow S. Applied Survival Analysis: Regression Modeling of Time to Event Data. John Wiley & Sons Inc; New York, NY: 1999. [Google Scholar]
Li R, Ren J-J, Yang G, Yu Y. Supplement to “Asymptotic behavior of Cox’s partial likelihood and its application to variable selection”. 2016 doi: 10.5705/ss.202016.0401. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy SA, van der Vaart AW. On profile likelihood. Journal of American Statistical Association. 2000;95:449–465. [Google Scholar]
Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;19:461–464. [Google Scholar]
Takemi Y, Toshinari K. Maximum full and partial likelihood estimators in the proportional hazard model. Annals of the Institute of Statistical Mathematics. 1984;36:363–373. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Tibshirani R. The lasso method for variable selection in the Cox Model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
Tsiatis AA. A large sample study of Cox’s regression model. The Annals of Statistics. 1981;9:93–108. [Google Scholar]
van der Vaart AW. Asymptotic Statistics. Cambridge U. Press; 1998. [Google Scholar]
Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang Y, Li R, Tsai CL. Regularization parameter selections via generalized information criterion. Journal of American Statistical Association. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang H, Lu W. Adaptive LASSO for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) The Annals of Statistics. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl

NIHMS866285-supplement-Suppl.pdf^{(75.3KB, pdf)}

[R1] Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19:716–723. [Google Scholar]

[R2] Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The Annals of Statistics. 1982;10:1100–1120. [Google Scholar]

[R3] Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Annals of Statistics. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]

[R5] Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]

[R6] Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik. 1979;31:377–403. [Google Scholar]

[R7] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R8] Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]

[R9] Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics. 2009;18:533–550. [Google Scholar]

[R10] Hosmer DW, Lemeshow S. Applied Survival Analysis: Regression Modeling of Time to Event Data. John Wiley & Sons Inc; New York, NY: 1999. [Google Scholar]

[R11] Li R, Ren J-J, Yang G, Yu Y. Supplement to “Asymptotic behavior of Cox’s partial likelihood and its application to variable selection”. 2016 doi: 10.5705/ss.202016.0401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Murphy SA, van der Vaart AW. On profile likelihood. Journal of American Statistical Association. 2000;95:449–465. [Google Scholar]

[R14] Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;19:461–464. [Google Scholar]

[R15] Takemi Y, Toshinari K. Maximum full and partial likelihood estimators in the proportional hazard model. Annals of the Institute of Statistical Mathematics. 1984;36:363–373. [Google Scholar]

[R16] Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R17] Tibshirani R. The lasso method for variable selection in the Cox Model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]

[R18] Tsiatis AA. A large sample study of Cox’s regression model. The Annals of Statistics. 1981;9:93–108. [Google Scholar]

[R19] van der Vaart AW. Asymptotic Statistics. Cambridge U. Press; 1998. [Google Scholar]

[R20] Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Zhang Y, Li R, Tsai CL. Regularization parameter selections via generalized information criterion. Journal of American Statistical Association. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Zhang H, Lu W. Adaptive LASSO for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]

[R23] Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]

[R24] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) The Annals of Statistics. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Asymptotic Behavior of Cox’s Partial Likelihood and its Application to Variable Selection

Runze Li

Jian-Jian Ren

Guangren Yang

Ye Yu

Abstract

1. Introduction

2. Asymptotic Behavior of Cox’s Partial Likelihood

Example 1

Table 2.1.

Figure 2.1.

Theorem 1

Remark

3. Tuning parameter selector in penalized partial likelihood

3.1. Definition and Notation

3.2. Theoretical Property

Theorem 2

4. Numerical Results

Example 4.1

Table 2.2.

Table 2.3.

Example 4.2

Figure 2.2.

Table 2.4 .

5. A tribute to Peter Hall

Supplementary Material

Acknowledgments

Appendices

Appendix A: Proof of Theorem 1

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Asymptotic Behavior of Cox’s Partial Likelihood and its Application to Variable Selection

Runze Li

Jian-Jian Ren

Guangren Yang

Ye Yu

Abstract

1. Introduction

2. Asymptotic Behavior of Cox’s Partial Likelihood

Example 1

Table 2.1.

Figure 2.1.

Theorem 1

Remark

3. Tuning parameter selector in penalized partial likelihood

3.1. Definition and Notation

3.2. Theoretical Property

Theorem 2

4. Numerical Results

Example 4.1

Table 2.2.

Table 2.3.

Example 4.2

Figure 2.2.

Table 2.4 .

5. A tribute to Peter Hall

Supplementary Material

Acknowledgments

Appendices

Appendix A: Proof of Theorem 1

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases