Abstract
For theoretical properties of variable selection procedures for Cox’s model, we study the asymptotic behavior of partial likelihood for the Cox model. We find that the partial likelihood does not behave like an ordinary likelihood, whose sample average typically tends to its expected value, a finite number, in probability. Under some mild conditions, we prove that the sample average of partial likelihood tends to infinity at the rate of the logarithm of the sample size, in probability. We apply the asymptotic results on the partial likelihood to study tuning parameter selection for penalized partial likelihood. We find that the penalized partial likelihood with the generalized cross-validation (GCV) tuning parameter proposed in Fan and Li (2002) enjoys the model selection consistency property, despite the fact that GCV, AIC and Cp, equivalent in the context of linear regression models, are not model selection consistent. Our empirical studies via Monte Carlo simulation and a data example confirm our theoretical findings.
Key words and phrases: Akaike information criterion, Bayesian information criterion, LASSO penalized partial likelihood, SCAD, variable selection
1. Introduction
The Cox model (Cox (1972)) has been the most popular model in the survival data analysis during the past decades, and the partial likelihood (Cox (1975)) is perhaps the most commonly-used technique for analysis of right censored data. In practice, many risk factors and covariates are available for the initial analysis, thus an important task is to identify the significant risk factors and covariates. Variable selection is a useful technique in the analysis of survival data in the presence of many covariates. Classical variable selection criteria for linear regression models can be extended for the Cox model by replacing the log-likelihood by the log-partial likelihood (AIC (Akaike (1974)) and BIC (Schwarz (1978))). The LASSO (Tibshirani (1996)) variable selection technique has been extended for the Cox model (Tibshirani (1997); Zhang and Lu (2007); Zou (2008)). Nonconcave partial likelihood variable selection procedures have been developed for the Cox model (Fan and Li (2002); Bradic, Fan, and Jiang (2011)). To investigate the theoretical property of these procedures, we have to study the asymptotic behavior of the partial likelihood.
There has been little work on the asymptotic behavior of the partial likelihood, though the asymptotic properties of the partial likelihood estimator have been extensively studied (Tsiatis (1981); Andersen and Gill (1982); Takemi and Toshinari (1984)). Under mild regularity condition, the maximum partial likelihood estimator behaves the same as the ordinary maximum likelihood estimator of i.i.d. random samples in terms of asymptotic consistency, asymptotic normality and asymptotic efficiency. See, for example, Murphy and van der Vaart (2000). In this paper, we first study the asymptotic behavior of the partial likelihood, and prove that the ‘sample average’ of partial likelihood diverges to infinity at a rate of the logarithm of the sample size. This clearly indicates that the Cox partial likelihood does not behave like an ordinary likelihood in that under mild regularity conditions, the sample average of the ordinary likelihood function converges to its expectation (a finite value) in probability as the sample size tends to infinity.
With the aid of the asymptotic property of partial likelihood, we study the selection of regularization parameter in penalized partial likelihood for variable selection. Tibshirani (1997) proposed penalized partial likelihood with LASSO penalty for the Cox model. Fan and Li (2002) proposed the partial likelihood with the SCAD penalty for the Cox models, and showed that under certain regularity conditions, the resulting estimate enjoys the oracle property. Zhang and Lu (2007) and Zou (2008) further proposed adaptive LASSO for the Cox model to improve the SCAD procedure in terms of computational efficiency, while retaining the oracle property. The oracle property depends on the choice of the regularization parameter in penalized partial likelihood. It is well known that the regularization parameter controls the model complexity of the selected models, and plays a crucial role in these variable selection procedures. The issue of regularization parameter selection for penalized partial likelihood has not been systematically studied, in part because the asymptotic behavior of partial likelihood was not well understood. Wang, Li, and Tsai (2007) studied the selection of regularization parameter in the SCAD penalized least squares for linear regression models. They showed that with a positive probability, the generalized cross-validation (GCV, Craven and Wahba (1979)) selector yields an over-fitted model, and therefore this procedure does not enjoy the oracle property.
In this paper, we prove that the GCV selector for the SCAD method for the Cox model enjoys model selection consistency, in contrast to its model selection inconsistency in the least squares setting as demonstrated in Wang, Li, and Tsai (2007). Although GCV is equivalent to AIC and the Cp in the context of linear regression models, AIC and Cp yield an overfitted models with a positive probability, and thus are not model selection consistent.
The rest of this paper is organized as follows. Section 2 studies the asymptotic behavior of the partial likelihood of the Cox model. We study the regularization parameter selection for the penalized partial likelihood in Section 3. Simulation study and a data example are presented in Section 4. Proofs are given in the Appendix.
2. Asymptotic Behavior of Cox’s Partial Likelihood
Let T and X = (X1, ···, Xd)T be the survival time and associated d-dimensional vector of covariates, respectively. Consider the Cox proportional hazard regression model:
| (2.1) |
where β is the regression coefficient vector, and h(t | x) is the conditional hazard function of T given X = x with h0(t) as an arbitrary baseline hazard function. Suppose that (T1, x1), …, (Tn, xn) is a random sample of (T, X), and the observed right censored survival data are as follows: (V1, δ1, x1), …, (Vn, δn, xn), where Vi = min{Ti, Ci}, δi = I{Ti ≤ Ci}, and Ci is the right censoring variable independent of Ti given X = xi. Without loss of the generality, assume that there are no ties among observed continuous random variables Vi’s. The log-partial likelihood function of the observed data is
| (2.2) |
(Cox (1975)). The goal is to study the asymptotic behavior of ℓc(β). We first illustrate the different behaviors of the log-partial likelihood and the likelihood of an i.i.d. sample by an example.
Example 1
Suppose that we have an i.i.d. random sample {Y1, …, Yn} from a population with probability density/mass function f(y; θ), so is the log-likelihood function. By the weak law of large number, n−1ℓ(θ) → E log{f(Y; θ)} in probability under mild regularity conditions. Furthermore, under mild regularity conditions, the maximum partial likelihood estimator, the maximizer of ℓc(β), behaves the same as the ordinary maximum likelihood estimator, the maximizer of ℓ(θ), in terms of asymptotic consistency, asymptotic normality and asymptotic efficiency. See, for example, Murphy and van der Vaart (2000). Here, we numerically illustrate that
| (2.3) |
We generated a random sample of size n from the proportional hazard model
where h0(t) ≡ 1, β = 1 and X ~ N(0, 1) The censoring variable C was generated from an exponential distribution with mean U. Therefore, the average censoring rate varies with different values of U. We list several values of U in Table 2.1 together with their corresponding average censoring rates, 1 − E I(T ≤ C) ≙1 − ρ1, and take 10 different values of n ranging from 4(= 22) to 1024(= 210). Figure 2.1 depicts the scatter plot of log(n) versus −n−1ℓc based on a set of typical samples based on the different U listed in Table 2.1. Figure 2.1 clearly suggests that −n−1ℓc increases at log(n) rate.
Table 2.1.
Values of U and the corresponding Average Censoring Rates (1 − ρ1) together with μ0≙E{(T ≤ C) X}.
| U | 10.00 | 5.00 | 2.75 | 1.80 | 1.20 | 0.80 | 0.50 | 0.30 |
| (1 − ρ1) | 0.1222 | 0.2055 | 0.2991 | 0.3797 | 0.4613 | 0.5485 | 0.6393 | 0.7311 |
| μ0 | 0.0968 | 0.1429 | 0.1775 | 0.1971 | 0.2007 | 0.2028 | 0.1921 | 0.1652 |
Figure 2.1.
Plot of log(n) versus −n−1ℓc. ‘○’ is the scatter plot of log(n) versus −n−1ℓc based on a typical simulated data set. The solid line in each plot is log(n)ρ̂1 − βT μ̂0 with β = 1, where ρ̂1 is an estimate of EI {T ≤ C} and μ̂0 is an estimate of E{I{T ≤ C}X}.
We next show that −n−1ℓc(β) tends to infinite at the rate of log(n) using techniques related to empirical processes. Let
| (2.4) |
with G(v, x) and H(v) as the limits of the empirical distribution functions Gn(v, x) and Hn(v), respectively. Take μ0 = E{I{T ≤ C}X}, W(t) = ∫ ∫v≥t exp (xT β) dG(v, x), and ρ1 = EI{T ≤ C}. The proof of the following theorem is given in Appendix A.
Theorem 1
If (Vi, δi, xi), i = 1, ···, n, is a random sample from the Cox model (2.1) and the censoring time Ci is independent of Ti given xi, then the following statements hold.
- If X has a finite bounded support, then
(2.5) - If is well-defined, E|Xj| < ∞ for all j = 1, ···, d, and 0 < E exp(XT β) < ∞,
(2.6)
When there is no censoring, it can be shown that W(t) = fT(t)/h0(t) and , where fT(t) and FT(t) are the probability density and cumulative distribution function of T in (2.1), respectively. Thus, the assumption about μ1 holds for many distributions, such as the exponential distribution.
Remark
From the proof of this theorem, the leading term ρ1 log(n) comes from , which does not depend on the regression coefficient β and does not affect the first and second order derivatives of the partial likelihood function. As the asymptotic normality of the maximum partial likelihood estimator relies on the first and second order derivatives, the divergent behavior of the partial likelihood function does not impact the asymptotic normality of the partial likelihood estimator. On other hand, the tuning parameter selector for penalized partial likelihood, studied in next section, depends on the partial likelihood function itself. As a result, the asymptotic behavior of the partial likelihood function directly affects the property of the tuning parameter selector.
3. Tuning parameter selector in penalized partial likelihood
Take the penalized partial likelihood to be
| (2.7) |
where d is the dimension of β, pλ(·) is a penalty function with a tuning parameter λ (or more generally, λjs). The penalized partial likelihood estimate of β maximizes (2.7) with respect to β. Denote by β0 the true value of β, and let . Without loss of generality, we take β20 = 0 with all components of β10 nonzero. Under some regularity conditions, Fan and Li (2002) showed that the nonconcave penalized likelihood estimator possesses the oracle property: with probability tending to 1, for a certain choice of pλn(·), we have β̂2 = 0 and
where I1(β10, 0) is the Fisher information matrix for β1 knowing β2 = 0.
The oracle property depends on the choice of the tuning parameter. Thus, the selection of tuning parameter is fundamental in the penalized likelihood procedure. Wang, Li, and Tsai (2007) studied the selection of the tuning parameter for penalized least squares for linear regression models. They showed that the GCV tuning parameter of Fan and Li (2001) cannot yield an oracle estimator. The issue of tuning parameter selection for the penalized partial likelihood has not been studied. Based on the asymptotic results about the partial likelihood, we show that the GCV tuning parameter selector for (2.7) possesses model selection consistency, in contrast to the model selection inconsistency of the GCV tuning parameter selector in the penalized least squares setting.
Let β̂λ be the penalized partial likelihood estimator with tuning parameter λ. Define the GCV statistic to be
| (2.8) |
When λ̂GCV = argminλ{GCV(λ)} is selected, where the degree of freedom dfλ is set to be the number of the nonzero penalized partial likelihood estimate corresponding to the tuning parameter λ. It can be shown that with probability tending to one, the effective number of parameters in Fan and Li (2002) is dfλ by using related techniques in Zhang, Li, and Tsai (2010).
We define the corresponding AIC and BIC statistics as
| (2.9) |
| (2.10) |
with the AIC and BIC tuning parameter selectors
When t lies in the neighborhood of 0, (1 − t)−2 ≈ 1 − 2t so, when n is large enough,
If −ℓc(β̂λ)/(n log(n)) → EI{T ≤ C} as n → ∞, then
| (2.11) |
For ρ1 ≥ 1/4, the GCV tuning parameter can yield a sparser model than the one selected by the BIC-tuning parameter selector, as is seen in the simulation study in Section 4.
3.1. Definition and Notation
We first need to define the candidate models considered in model selection. Let ᾱ = {1, ···, d} denote the label of predictors for the full model. Hence α, the subset of ᾱ represents a candidate model including the predictors labelled by α. For each candidate model α, its model size and the corresponding coefficients are dfα and βα. Therefore, each tuning parameter λ determined in the penalty function results in a selected model αλ with model size dfαλ and the corresponding coefficients β̂λ. The collection of all candidate models is denoted by 𝓐.
For any given model α, we are able to obtain its non-penalized estimates by maximizing the corresponding partial likelihood ℓc(β). Similarly, for any selected model αλ obtained from penalized partial likelihood with given λ, we are able to obtain the corresponding non-penalized estimates .
To study the asymptotic behaviors of the tuning parameter selectors for penalized partial likelihood, we define a general tuning parameter selector
| (2.12) |
where β̂ is the parameter estimator and dfβ̂ is the corresponding degree freedom associated with β̂. Here κn is a positive number that denotes different variable selection criterion. When κn = 2, GICκn is the AIC at (2.9), and when κn = log(n), GICκn is the BIC at (2.10).
3.2. Theoretical Property
In this section, we assume that the set of candidate models contain the unique true model and that the number of parameters in the full model is finite. Assume that the coefficients of the unique true model α0 in 𝓐 are nonzero. Therefore, any candidate model α ⊉ α0 is an underfitted model while any model α ⊃ α0 is an overfitted model. We partition the tuning parameters into
We need the following conditions.
-
(E1)
λmax depends on n and satisfies λmax → 0 as n → ∞.
-
(E2)
There exits a constant m such that the penalty pλ(ξ) satisfies for ξ > mλ.
-
(E3)If λn → 0 as n → ∞, then the penalty function satisfies
-
(E4)
For any candidate model α ∈ 𝓐, there exits cα > 0 such that . In addition, for any underfitted model α ⊉ α0, cα > cα0.
Conditions (E1)–(E3) are conditions on the penalty while condition (E4) is the technical condition needed to investigate the asymptotic properties of the tuning parameter selectors for penalized partial likelihood. Condition (E1) indicates a smaller tuning parameter is required if the sample size is large; (E2) implies that the penalty is chosen to have an asymptotic unbiased estimator; (E3) is used to study the oracle property of the penalized estimator; (E4) assures that the underfitted model yields a larger model deviance than that of the true model.
Theorem 2
Suppose that the partial likelihood function of the Cox’s model satisfies Conditions (A)–(D) in Fan and Li (2002) and that Conditions (E1)–(E4) hold.
If there exits a positive constant M such that κn < M, then the tuning parameter λ̂ obtained by minimizing GICκn(λ) satisfies P{λ̂ ∈ Ω−} → 0 and P{λ̂ ∈ Ω+} > 0.
If κn → ∞ and , then the tuning parameter λ̂ obtained by minimizing GICκn (λ) satisfies P{αλ̂ = α0} → 1.
If ρ1 > 0, then the tuning parameter λ̂ obtained by minimizing the GCV score defined in (2.8) satisfies P{αλ̂ = α0} → 1.
The proof of Theorem 2 is given in the supplement (Li, Ren, Yang and Yu (2016)).
Here, Theorem 2(A) implies that the GICκn selector with bounded κn tends to overfit without considering which penalty function is used, while Theorem 2(B) indicates that the GICκn selector with diverging κn enables us to identify the true model consistently. Thus, the penalized partial likelihood with diverging κn possesses the oracle property. Theorem 2(C) implies that the penalized partial likelihood estimator with the GCV selector also possesses the oracle property. This is quite different from penalized least squares for the linear regression model; as shown in Wang, Li and Tsai (2007), the GCV selector for the penalized least squares with linear model results in an overfitted model with positive probability.
4. Numerical Results
We assessed the finite sample performance of proposed procedures. Since there exist various comparisons among penalized partial likelihood with different penalties such as the LASSO and SCAD. In our simulation studies, we focused on comparisons among different tuning parameter selectors for penalized partial likelihood with the SCAD penalty. For simplicity, we refer to the SCAD penalized partial likelihood with κn = 2 and log(n) in GICκn tuning parameter selector as SCAD-AIC and SCAD-BIC, respectively. Similarly we refer to the SCAD method with the GCV as SCAD-GCV. The best subset selection with AIC and BIC criteria for the Cox model are denoted by AIC and BIC in this section, respectively. In our simulation, we employed the local linear algorithm (LLA, Zou and Li (2008)) to compute the parameter estimates of the SCAD penalized partial likelihood function.
Example 4.1
We adapted the model structure in Fan and Li (2002) to generate the data with sample sizes n = 100, 200, and 400 from the Cox model with hazard function
where h0(t) ≡ 1, β = (0.8, 0, 0, 1, 0, 0, 0, 0, 0, 0.6, 0, 0)T, and x had a 12-dimensional normal distribution, with the correlation between xi and xj as 0.5|i−j|. Accordingly, μ(xT β) = exp(−xT β). The censoring distribution was exponential with mean U exp(−xT β), where U was sampled from a uniform distribution over [1, 3]. Consequently, the average censoring percentage was 35%. We include the case with no censoring as a benchmark. For each scenario, we conducted 1000 simulations.
To assess finite sample performance, we report the percentage of models correctly fitted, underfitted, and overfitted with 1, 2, 3, 4, 5 or more parameters by five variable selection procedures, as well as the simulated data fitted with the true model over 1000 simulations. We report the average number of zero coefficients that were correctly (C) and incorrectly (IC) identified in the selected models over 1000 simulations. To compare model fittings, we calculated the model error for the new observation (V, δ, x),
where the expectation is taken with respect to the new observed covariate vector x, and μ(xT β) = E(T|x, β). We report the median of the relative model error (MRME) over 1000 simulations, where the relative model error is defined as RME = ME/MEfull, and MEfull is the model error calculated by fitting the data with the full model.
In Fan and Li (2002), it was shown that
By using the moment generating function of the multinormal distribution, we can simplify this to
where Σ is the covariance matrix of x. We use this formula to calculate model errors for our simulations.
Table 2.2 gives the results for the uncensored case, and shows that the MRME of SCAD-BIC/GCV is smaller than that of SCAD-AIC. As the sample size increases, the MRME of SCAD-BIC/GCV approaches that of the oracle estimator, whereas the MRME of SCAD-AIC remains at the same level. Interestingly, SCAD-BIC and SCAD-AIC have smaller MRME than that of the best subset selection with BIC and AIC, respectively.
Table 2.2.
Simulation results for the Cox model (No Censoring)
| n | Method | MRME (%) | Zeros | Under (%) | Exact (%) | Over Fitted (%) | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| C | IC | 1 | 2 | 3 | 4 | ≥ 5 | |||||
| 100 | SCAD-AIC | 45.75 | 7.255 | 0.001 | 0.1 | 37.5 | 15.3 | 16.1 | 13.1 | 8.5 | 9.4 |
| SCAD-BIC | 20.90 | 8.576 | 0.003 | 0.3 | 74.0 | 15.7 | 5.2 | 3.8 | 0.9 | 0.1 | |
| SCAD-GCV | 17.29 | 8.940 | 0.059 | 5.6 | 89.2 | 4.9 | 0.3 | 0.0 | 0.0 | 0.0 | |
| AIC | 52.52 | 7.349 | 0.001 | 0.1 | 20.1 | 29.4 | 26.9 | 15.3 | 6.0 | 2.2 | |
| BIC | 25.68 | 8.666 | 0.004 | 0.4 | 72.5 | 22.6 | 3.4 | 1.1 | 0.0 | 0.0 | |
| Oracle | 15.73 | 9.000 | 0.000 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
|
| |||||||||||
| 200 | SCAD-AIC | 58.53 | 7.591 | 0.000 | 0.0 | 46.6 | 15.9 | 13.8 | 8.5 | 7.3 | 7.9 |
| SCAD-BIC | 36.33 | 8.867 | 0.000 | 0.0 | 90.1 | 7.3 | 1.8 | 0.8 | 0.0 | 0.0 | |
| SCAD-GCV | 33.96 | 8.995 | 0.003 | 0.3 | 99.2 | 0.5 | 0.0 | 0.0 | 0.0 | 0.0 | |
| AIC | 66.37 | 7.506 | 0.000 | 0.0 | 23.1 | 32.2 | 24.8 | 13.8 | 4.5 | 1.6 | |
| BIC | 41.89 | 8.781 | 0.000 | 0.0 | 81.2 | 16.1 | 2.3 | 0.4 | 0.0 | 0.0 | |
| Oracle | 33.95 | 9.000 | 0.000 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
|
| |||||||||||
| 400 | SCAD-AIC | 68.14 | 7.553 | 0.000 | 0.0 | 45.1 | 14.7 | 15.4 | 9.3 | 9.1 | 6.4 |
| SCAD-BIC | 44.10 | 8.936 | 0.000 | 0.0 | 94.7 | 4.4 | 0.7 | 0.2 | 0.0 | 0.0 | |
| SCAD-GCV | 42.33 | 8.999 | 0.000 | 0.0 | 99.9 | 0.1 | 0.0 | 0.0 | 0.0 | 0.0 | |
| AIC | 74.71 | 7.530 | 0.000 | 0.0 | 22.6 | 33.4 | 25.7 | 12.2 | 5.0 | 1.1 | |
| BIC | 47.38 | 8.875 | 0.000 | 0.0 | 88.6 | 10.5 | 0.7 | 0.2 | 0.0 | 0.0 | |
| Oracle | 42.30 | 9.000 | 0.000 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
Table 2.2 also shows that SCAD-BIC/GCV has a higher probability of correctly estimating the true zero coefficients to zero than does SCAD-AIC. However, SCAD-BIC/GCV was more prone than SCAD-AIC to incorrectly set the three nonzero coefficients to zero when the sample size was small, and SCAD-GCV was more aggressive than SCAD-BIC with larger values in “IC” columns. In addition, SCAD-BIC/GCV had a much higher probability of correctly identifying the true model.
For the censored case, Table 2.3 shows findings similar to those presented in Table 2.2. Accordingly, SCAD-BIC/GCV was superior to SCAD-AIC in both identifying the true model, and in reducing the model error and complexity. When the data was 35% censored, all methods declined slightly in their efficacy, while the relative performance of SCAD-BIC/GCV versus SCAD-AIC remained the same as in the uncensored case. This is consistent with our theoretical analysis in Section 3.
Table 2.3.
Simulation results for the Cox model (35% Censoring)
| n | Method | MRME (%) | Zeros | Under (%) | Exact (%) | Over Fitted (%) | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| C | IC | 1 | 2 | 3 | 4 | ≥ 5 | |||||
| 100 | SCAD-AIC | 42.43 | 7.235 | 0.012 | 1.2 | 33.4 | 18.8 | 16.6 | 12.0 | 8.5 | 9.5 |
| SCAD-BIC | 21.42 | 8.491 | 0.060 | 5.8 | 63.4 | 19.7 | 7.3 | 2.4 | 1.1 | 0.3 | |
| SCAD-GCV | 19.04 | 8.800 | 0.153 | 13.6 | 71.6 | 12.3 | 2.1 | 0.3 | 0.1 | 0.0 | |
| AIC | 50.03 | 7.370 | 0.016 | 1.6 | 20.4 | 30.0 | 25.9 | 13.6 | 6.5 | 2.0 | |
| BIC | 23.45 | 8.648 | 0.036 | 3.6 | 68.8 | 23.7 | 3.5 | 0.4 | 0.0 | 0.0 | |
| Oracle | 14.35 | 9.000 | 0.000 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
|
| |||||||||||
| 200 | SCAD-AIC | 59.24 | 7.535 | 0.000 | 0.0 | 42.3 | 19.5 | 13.2 | 10.2 | 7.7 | 7.1 |
| SCAD-BIC | 35.53 | 8.841 | 0.000 | 0.0 | 87.4 | 9.8 | 2.3 | 0.5 | 0.0 | 0.0 | |
| SCAD-GCV | 32.48 | 8.963 | 0.006 | 0.6 | 95.9 | 3.3 | 0.2 | 0.0 | 0.0 | 0.0 | |
| AIC | 64.64 | 7.513 | 0.000 | 0.0 | 22.8 | 35.5 | 21.5 | 12.1 | 6.9 | 1.2 | |
| BIC | 37.90 | 8.830 | 0.000 | 0.0 | 84.8 | 13.5 | 1.6 | 0.1 | 0.0 | 0.0 | |
| Oracle | 31.45 | 9.000 | 0.000 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
|
| |||||||||||
| 400 | SCAD-AIC | 69.31 | 7.552 | 0.000 | 0.0 | 41.5 | 19.5 | 14.7 | 10.6 | 7.4 | 6.3 |
| SCAD-BIC | 45.07 | 8.920 | 0.000 | 0.0 | 93.2 | 5.7 | 1.0 | 0.1 | 0.0 | 0.0 | |
| SCAD-GCV | 42.75 | 8.993 | 0.000 | 0.0 | 99.4 | 0.5 | 0.1 | 0.0 | 0.0 | 0.0 | |
| AIC | 73.64 | 7.547 | 0.000 | 0.0 | 23.8 | 33.7 | 24.7 | 11.6 | 4.5 | 1.7 | |
| BIC | 48.85 | 8.856 | 0.000 | 0.0 | 86.8 | 12.0 | 1.2 | 0.0 | 0.0 | 0.0 | |
| Oracle | 43.47 | 9.000 | 0.000 | 0.0 | 100.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | |
Example 4.2
(Heart attack data) We applied the proposed regularization parameter selection procedures to the heart attack data set used in Hosmer and Lemeshow (1999). The data were collected in the Worcester Heart Attack Study which describes trends over time in survival rates following hospital admission for acute myocardial infarction. The total length of follow-up on the admission of 481 hospital patients was recorded for years 1975, 1978, 1981, 1984, 1986, and 1988. Among those patients, 249 died and the rest were censored at the rate of 48%.
To model survival time, Hosmer and Lemeshow (1999) suggested fitting the Cox proportional hazards model with five explanatory variables: x1-age; x2-cpk (peak cardiac enzymes in international units); x3-sex (male=0 and female=1); x4-chf (left heart failure complications, yes=1 and no=0); x5-miord (MI order, first=0 and recurrent=1). In addition to these variables, we included the six interactions between the two continuous variables (age and cpk) and the three indicator variables (sex, chf, and miord). Thus, there were 11 variables in our full model. We applied the penalized partial likelihood approach. The resulting regularization parameters selected by SCAD-AIC, SCAD-BIC, and SCAD-GCV were 0.0533, 0.0878, and 0.1326, respectively. The corresponding tuning parameters selector curves are depicted in Figure 2.2.
Figure 2.2.
The left panel is the GIC scores with κn = 2 versus λ, the middle panel is the GIC score with log(n) versus λ, and the right panel is the GCV scores versus λ.
Table 2.4 presents the maximum partial penalized likelihood estimates (MPLE) from the full model as well as the SCAD-AIC/BIC/GCV parameter estimates, together with their standard errors. The full model contained six insignificant variables (x2, x3, and x7 to x10) at level 0.05, SCAD-AIC included two insignificant variables (x6 and x10) at level 0.05. In contrast, the four variables x1, x4, x5, and x11, selected by SCAD-BIC were significant at level 0.05. For this data set, SCAD-GCV looks to be overly aggressive in that it excludes x5, and x11.
Table 2.4 .
Estimates and Standard Errors for Heart Attack Data
| MPLE | SCAD-AIC | SCAD-BIC | SCAD-GCV | |
|---|---|---|---|---|
| age (x1) | 0.60(0.13) | 0.56(0.09) | 0.43(0.07) | 0.41(0.05) |
| cpk (x2) | 0.03(0.14) | 0(-) | 0(-) | 0(-) |
| sex (x3) | 0.17(0.14) | 0(-) | 0(-) | 0(-) |
| chf (x4) | 0.80(0.14) | 0.80(0.13) | 0.80(0.14) | 0.82(0.13) |
| miord (x5) | 0.42(0.14) | 0.43(0.13) | 0.41(0.13) | 0(-) |
| age*sex(x6) | −0.29(0.14) | −0.22(0.13) | 0(-) | 0(-) |
| age*chf (x7) | −0.07(0.15) | 0(-) | 0(-) | 0(-) |
| age*miord (x8) | 0.03(0.15) | 0(-) | 0(-) | 0(-) |
| cpk*sex (x9) | −0.16(0.16) | 0(-) | 0(-) | 0(-) |
| cpk*chf (x10) | 0.19(0.15) | 0.19(0.09) | 0(-) | 0(-) |
| cpk*miord (x11) | 0.29(0.15) | 0.25(0.12) | 0.21(0.05) | 0(-) |
Based on Table 2.4, the p-values of the partial likelihood ratio test for examining the SCAD-AIC, SCAD-BIC, and SCAD-GCV model versus the full model are 0.6752, 0.1749, and 0.0034, respectively. Consequently, there is no evidence of lack of fit in the SCAD-BIC model. The SCAD-GCV model may be too aggressive, consistent with our simulation results that GCV tends to be underfitted when the sample size is not large enough.
5. A tribute to Peter Hall
Professor Peter Hall made wide ranging and ground-breaking contributions to many statistical fields and played major leadership roles throughout the statistical profession. He was a true scholar, and a mentor and friend of many of us. We grieve his loss.
Runze Li (RL) had the great fortune to learn from Peter and interact with him directly when they jointly served as Editors of the Annals of Statistics from 2013 to 2015. As an eminent scientist, Peter was an extremely kind, modest and optimistic person. Peter was always super fast, and handled whatever came to him promptly. His speed was unbeatable. Once, RL was asked to review a grant proposal by an international grant agency within a tight deadline. When RL sent back his report the next day, he was told that Peter’s report had already been received.
Professor Peter Hall had a huge influence on RL’s research on variable selection and feature screening, although he never collaborated with Peter on a paper. Many of RL’s works were inspired by Peter’s ideas. For example, Hall and Miller (2009) proposed using generalized correlation to conduct feature screening and the use of the bootstrap to quantify the uncertainty of feature ranking. Motivated by this work, Li, Zhong, and Zhu (2012) proposed using distance correlation for feature screening.
Professor Peter Hall will be remembered forever as a legendary statistician, a great scholar, beloved colleague, mentor and friend, and his work will continue to have a far-reaching impact on statistical methodology and theory.
Supplementary Material
Acknowledgments
The authors would like to thank the editors Professors Raymond J. Carroll and Qiwei Yao for organizing this special issue. Runze Li is grateful to the editors for their invitation and their constructive comments on an earlier version of this paper. Li’s research is supported by National Institute on Drug Abuse grants P50 DA039838, P50 DA036107, and R01 DA039854, National Science Foundation (NSF) grant DMS 1512422, and National Library of Medicine grant T32 LM012415. His research was also partially supported by National Nature Science Foundation of China grants 11690014 and 11690015. Ren’s research is supported by NSF grants DMS 0905772, DMS 1232424, and DMS 1407461. Yang’s research was supported by the NNSFC grant 11471086, the National Social Science Foundation of China grant 16BTJ032, the Fundamental Research Funds for the Central Universities 15JNQM019 and 21615452, the National Statistical Scientific Research Center Projects 2015LD02, the China Scholarship Council 201506785010 and Science and Technology Program of Guangzhou 2016201604030074. All authors equally contributed to this paper and are listed in alphabetic order. Guangren Yang is the corresponding author. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF, the NIDA, or the NIH.
Appendices
Appendix A: Proof of Theorem 1
Without loss of generality, assume that there are no ties among Vi’s in the observed data, and that
| (A.1) |
This simplifies n−1ℓc(β) to
| (A.2) |
It follows by the Weak Law of Large Numbers (WLLN) that . Let
| (A.3) |
Thus,
| (A.4) |
From (A.1), we have
| (A.5) |
where Wn(t) = ∫ ∫v≥t exp(xT β) dGn(v, x) with Gn(v, x) given in (2.4). Here δi is a binary random variable, . With , it follows that
| (A.6) |
To prove Part (a), we next deal with An. Since
and X has a finite bounded support, it follows
| (A.7) |
The last equality is due to Sterling’s formula. and this completes the proof of (a).
For Part (b), it suffices to show that
| (A.8) |
From (2.4), we know that Hn(v) is the empirical process of a random sample of Vi’s with δi = 1. Thus, ||Hn − H|| = supv|Hn(v) − H(v)| = Op(n−1/2) by the DWK inequality (van der Vaart (1998)) since EI{δi = 1} = ρ1 > 0. Hence, from (2.4), (A.5), and integration by parts, we have
| (A.9) |
where by using the fact that , since E(|Xj|) < ∞ by the assumption on E(|Xj|) < ∞ for all j = 1, ···, p. From (2.4) and (A.5), we have
By the assumption in Part (b) and the WLLN, . This implies that . Furthermore, . Thus,
It then follows that
| (A.10) |
Therefore, (A.8) follows from (A.9)–(A.10), the assumption about μ1, and the Dominated Convergence Theore. Thus,
| (A.11) |
Footnotes
The proof of Theorem 2 is in the supplemental materials of this paper.
References
- Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19:716–723. [Google Scholar]
- Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. The Annals of Statistics. 1982;10:1100–1120. [Google Scholar]
- Bradic J, Fan J, Jiang J. Regularization for Cox’s proportional hazards model with NP-dimensionality. Annals of Statistics. 2011;39:3092–3120. doi: 10.1214/11-AOS911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
- Cox DR. Partial likelihood. Biometrika. 1975;62:269–276. [Google Scholar]
- Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik. 1979;31:377–403. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
- Hall P, Miller H. Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics. 2009;18:533–550. [Google Scholar]
- Hosmer DW, Lemeshow S. Applied Survival Analysis: Regression Modeling of Time to Event Data. John Wiley & Sons Inc; New York, NY: 1999. [Google Scholar]
- Li R, Ren J-J, Yang G, Yu Y. Supplement to “Asymptotic behavior of Cox’s partial likelihood and its application to variable selection”. 2016 doi: 10.5705/ss.202016.0401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li R, Zhong W, Zhu L. Feature screening via distance correlation learning. Journal of American Statistical Association. 2012;107:1129–1139. doi: 10.1080/01621459.2012.695654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murphy SA, van der Vaart AW. On profile likelihood. Journal of American Statistical Association. 2000;95:449–465. [Google Scholar]
- Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978;19:461–464. [Google Scholar]
- Takemi Y, Toshinari K. Maximum full and partial likelihood estimators in the proportional hazard model. Annals of the Institute of Statistical Mathematics. 1984;36:363–373. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Tibshirani R. The lasso method for variable selection in the Cox Model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- Tsiatis AA. A large sample study of Cox’s regression model. The Annals of Statistics. 1981;9:93–108. [Google Scholar]
- van der Vaart AW. Asymptotic Statistics. Cambridge U. Press; 1998. [Google Scholar]
- Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Y, Li R, Tsai CL. Regularization parameter selections via generalized information criterion. Journal of American Statistical Association. 2010;105:312–323. doi: 10.1198/jasa.2009.tm08013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang H, Lu W. Adaptive LASSO for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
- Zou H. A note on path-based variable selection in the penalized proportional hazards model. Biometrika. 2008;95:241–247. [Google Scholar]
- Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) The Annals of Statistics. 2008;36:1509–1566. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


