Summary
Penalization methods have been shown to yield both consistent variable selection and oracle parameter estimation under correct model specification. In this article, we study such methods under model misspecification, where the assumed form of the regression function is incorrect, including generalized linear models for uncensored outcomes and the proportional hazards model for censored responses. Estimation with the adaptive least absolute shrinkage and selection operator, lasso, penalty is proven to achieve sparse estimation of regression coefficients under misspecification. The resulting estimators are selection consistent, asymptotically normal and oracle, where the selection is based on the limiting values of the parameter estimators obtained using the misspecified model without penalization. We further derive conditions under which the penalized estimators from the misspecified model may yield selection consistency under the true model. The robustness is explored numerically via simulation and an application to the Wisconsin Epidemiological Study of Diabetic Retinopathy.
Keywords: Least false parameter, Model misspecification, Oracle property, Penalization, Selection consistency, Shrinkage estimation, Variable selection
1. Introduction
Variable selection has attracted much attention recently due to its importance in constructing parsimonious models with superior predictive performance. Traditional variable selection algorithms include forward selection, backward elimination and best subset selection. Model selection criteria such as Mallow’s Cp (Mallows, 1973), Akaike’s information criterion (Akaike, 1973), and the Bayesian information criterion (Schwarz, 1978) have been heavily utilized in conjunction with such procedures. While these methods are conceptually simple and widely used in practice, the lack of a general theoretical justification across a range of algorithms and the unstable empirical performance of the algorithms have motivated the development of alternative model selection techniques. Penalization methods are available in general regression settings which simultaneously yield sparse models which shrink some coefficients to zero and parameter estimation for the nonzero coefficients. Methods exist for generalized linear models, e.g., the least absolute shrinkage and selection operator (Tibshirani, 1996), the smoothly clipped absolute deviation (Fan & Li, 2001) and the adaptive lasso (Zou, 2006), and have been extended to the proportional hazards model for censored data (e.g., Tibshirani, 1997; Fan & Li, 2002; Zhang & Lu, 2007). With the proper choice of the tuning parameter, the smoothly clipped absolute deviation and adaptive lasso estimators can be shown to achieve selection consistency and asymptotic normality, with asymptotic variance equalling that of the oracle procedure in which the zero coefficients are known a priori.
A key assumption for the shrinkage methods is that the regression model is correctly specified. To our knowledge, the robustness of the variable selection to misspecification has not been rigorously examined. The conditions yielding selection consistency in the true model are unclear, as is the extent to which penalization procedures may be oracle under misspecification. Our approach builds on earlier work for misspecified models, adapting theoretical results for unpenalized estimation using likelihood (White, 1982) and partial likelihood (Lin & Wei, 1989; Sasieni, 1993)to establish the asymptotic properties of the corresponding penalized estimators from the adaptive lasso. The quantities being estimated, the so-called least false parameters, are defined implicitly as the maximizers of the asymptotic limits of the penalized likelihoods. In §§ 2 and 3,weshow that selection consistency may be achieved for the nonzero least false parameters. Moreover, under certain conditions (Li & Duan, 1989; Kosorok et al., 2004), selection consistency for the least false parameters provides selection consistency for the parameters in the true model. The oracle property may be achieved, in the sense that the variances of the penalized estimators equal that, for the oracle procedure, in which the nonzero least false parameters are known a priori.
2. Variable selection for misspecified generalized linear models
Let V1,..., Vn be independently and identically distributed random vectors, where Vi = (Xi, Yi), Xi is a p-dimensional vector of covariates and Yi is a response variable. Here p is assumed to be fixed. In generalized linear models (McCullagh & Nelder, 1989), it is assumed that the true model has density function f {y; g(x′β)}c(x),where β is a p-dimensional vector of unknown regression coefficients, c(x) is a p-variate function of X, f {y; g(x′β)} is a conditional density function of Y given X = x and g is a specified link function. The loglikelihood is , where ℓ(υ; β) = log f {y; g(x′β)}. Let β̃(1) denote the maximizer of .
To select variables, we adopt the adaptive lasso (Zou, 2006). Specifically, we consider the penalized loglikelihood function
where for some γ > 0, is the jth component of β̃(1) and λn > 0 is a tuning parameter. Let β̂(1) denote the maximizer of .
Next, we study the theoretical properties of the adaptive lasso estimator β̂(1) under model misspecification. We first introduce some notation. Define ℓ̇(υ; β) = ∂ℓ(υ; β)/∂β, ℓ̈(υ; β) = ∂2ℓ(υ; β)/(∂β∂β′), A(1)(β) = E{ℓ̈ (V; β)} and B(1)(β) = E{ℓ̇(V; β) ℓ̇(V ; β)′}. In addition, following White (1982), we assume the following conditions.
Condition 1. The functions f {y; g(x′β)} are continuous in β for all β ∈ ℬ, where ℬ is a compact set.
Condition 2. The functions satisfy (i) E{log c(X)} < ∞; (ii) E{log fo(Y|X)} exists, where fo is the true conditional density; (iii) E{ℓ(V ; β)} has a unique maximizer β* and β* is interior to ℬ.
Condition 3. The likelihood function ℓ(υ; β) is twice differentiable with respect to β for all β ∈ ℬ, and the components of |ℓ(υ; β)|, |ℓ̇(υ; β) ℓ̇(υ; β)′| and |ℓ̈(υ; β)| are dominated by integrable functions.
Condition 4. The matrices A(1) ≡ A(1)(β*) are nonsingular and negative definite, and B(1) ≡ B(1)(β*) is nonsingular.
Condition 5. For all i, j, k = 1,..., p, and all β in some neighbourhood of β*,|∂3ℓ(υ; β)/∂βj∂βk∂βl| are dominated by integrable functions.
Conditions 1–5 ensure the consistency and asymptotic normality of the unpenalized estimator β̃(1). The quantity β* is the least false parameter to which β̃(1) converges. This value corresponds to the maximizer of the asymptotic limit of the unpenalized loglikelihood. In general, β* does not correspond to a well-defined parameter in the true conditional distribution fo. Additional assumptions, like those described below, are needed to link the least false parameter for the misspecified model to the true regression parameter in f0.
Without loss of generality, we can write , where is a p1-dimensional vector of nonzero parameters and is a p2 = (p − p1)-dimensional vector of zero parameters. Accordingly, we write . In the Appendix, we establish the following theorem.
Theorem 1. Assume that Conditions 1–5 hold, and n−1/2λn → 0 and n(γ−1)/2λn →∞ as n →∞. Then and
in distribution, where and are the upper left p1 × p1 submatrices of A(1)(β*) and B(1)(β*), respectively.
With fixed p, sparseness implies that there are zero components in the least false parameter, although the theory permits p2 = 0, where all components are nonzero. The notion of an oracle estimator is defined in terms of such sparseness. With growing p, sparseness generally refers to the proportion of zero parameters, which typically grows as some function of n. Such sparseness is critical to the consistency of penalized estimation procedures, which break down if there are too many nonzero parameters. The complications that arise with growing p are described in § 5.
For fixed p, Theorem 1 gives that if the least false parameter β* is sparse, then the adaptive lasso procedure is selection consistent and achieves sparsity with probability going to one asymptotically. Moreover, the estimator of the nonzero parameters in β*, is asymptotically normal with variance equal to the oracle estimator, in which is known a priori. The variance generally has the sandwich form described in White (1982) and may be estimated by replacing the unknown quantities with empirical estimates. If the generalized linear model is correctly specified, then under mild regularity conditions (e.g., White, 1982, Assumption A7) −A(1)(β0) = B(1)(β0) = I, where I is the Fisher information matrix of the true model with β* = β0, the true regression coefficients. Hence, Theorem 1 generalizes Zou (2006, Theorem 4).
Next, we study the relationship between the true model fo and β* when the true model satisfies a generalized linear model but the link function is misspecified when the model is fitted using the loglikelihood functions. To be specific, we reformulate the posited model using the exponential family with canonical parameter θ, that is, the conditional distribution of Y given X = x is given by f (y | θ) = h(y) exp{yθ − ψ(θ)} with θ = β′x. Note that E(Y | X = x) = ψ̇(θ) in the posited model, where ψ̇ is the derivative of ψ. Moreover, assume that the true conditional distribution of Y given X = x, fo, also follows a one-parameter family: Hθ0 (y) with θ0 = β′0x,where β0 is the true regression parameter. Here the family {Hθ} can be arbitrary and unknown.
We make the following additional assumptions (Li & Duan, 1989).
Condition 6. The p-variate density function c(x) of X is nondegenerate in ℝp.
Condition 7. The conditional expectation E(β′X | β′0X = β′0x) exists and is linear in β′0x for all β ∈ ℝp.
Corollary 1. Assume that ψ(θ) is strictly convex, ℬ is nonempty and convex in ℝp, and Conditions 1–5 and 6–7 hold. Then β* = α*β0 for some nonzero scalar α*.
Corollary 1 is a consequence of Li & Duan (1989, Theorem 2.1). It gives that the zero coefficients in β* are the same as those in β0. Condition 7 requires that the covariate distribution is elliptically symmetric, as occurs under the multivariate normal distribution. Combining this result with Theorem 1 yields that the adaptive lasso estimator achieves sparsity and selection consistency for the true model, i.e., correctly identifies the zero and nonzero parameters in β0 with probability tending to unity, even under model misspecification.
3. Variable selection for the misspecified Cox model
Let Ti be the survival time and Ci be the censoring time of subject i. Define Yi = min(Ti, Ci) and Δi = I (Ti ⩽ Ci). The observed data consist of V1,..., Vn, where Vi = (Xi, Yi, Δi) and the covariate vector Xi is defined as before. We assume that Ti is independent of Ci given Xi. The proportional hazards model (Cox, 1972) assumes that the conditional hazard function satisfied λ(t | X) = λ(t) exp(β′X),where λ(t) is an unspecified baseline hazard function and β is a p-dimensional vector of unknown regression coefficients.
Define and s(r)(β, t) = E{S(r)(β, t)} for r = 0, 1, 2, where for a column vector a, a⊗0 = 1, a⊗1 = a and a⊗2 = aa′. The adaptive lasso estimator based on the penalized log partial likelihood is , where
where . The unpenalized partial likelihood estimator maximizes .
Let τ > 0 be a constant such that pr(Y ⩾ τ) > 0. Define
where P(2)(t) = pr(Y ⩽ t, Δ = 1),
and β* is the least false parameter that maximizes . To account for misspecification of the proportional hazards model, we make the following assumptions (Sasieni, 1993).
Condition 8. The expectation E{exp(X′β)} < ∞ for all β ∈ ℬ.
Condition 9. There is no pair (α, ϕ) with α ≠ 0 ∈ ℝp and ϕ : ℝ ↦ ℝ, a monotone decreasing function, such that for P(2)-almost all t < τ, α′XΔ = ϕ(Y)Δ and α′XI (Y ⩾ t) ⩽ ϕ(t) almost surely.
Conditions 8–9 ensure the consistency and asymptotic normality of the estimator β̃(2) under model misspecification. In particular, Condition 9 was originally proposed by Sasieni (1993), who studied the proportional hazards model under model misspecification. It ensures that as the sample size gets larger, there exists a unique maximizer of the partial likelihood under model misspecification. The condition is needed because of the potential lack of convexity of the partial likelihood loss function under model misspecification. This differs from simpler M-estimation problems, like those in § 2, in which the loss function is guaranteed to be convex. One should recognize that if the proportional hazards model does not hold, then the estimand β* will depend on the distribution of T given X, the distribution of C given X and the marginal distribution of X. Establishing the relationship between the parameters in β* and those in the true model is thus more challenging than for the generalized linear model.
Similarly to before, we can write , where is a p1-dimensional vector of nonzero parameters and is a p2 = (p − p1)-dimensional vector of zero parameters. Analogously, we write . In the Appendix, we prove the following.
Theorem 2. Assume that Conditions 8–9 hold, and n−1/2λn → 0 and n(γ−1)/2λn →∞ as n →∞. Then and
in distribution, where and are the upper left p1 × p1 submatrices of A(2)(β*) and B(2)(β*), respectively.
Theorem 2 implies that the penalized partial likelihood estimator is sparse with probability one when p1 < p. Moreover, the nonzero parameter estimators are asymptotically normal, with the same variance as the oracle partial likelihood procedure with the zero components in β* known. A robust variance estimator may be obtained using the plug-in formulas in Lin & Wei (1989).
If the Cox model is correctly specified, then β* = β0, and − A(2)(β0) = B(2)(β0) = I, the log partial likelihood information matrix of the Cox model. Theorem 2 thus generalizes Zhang & Lu (2007, Theorem 2) for the case with γ = 1. This follows since Sasieni (1993, Lemmas 2.3 and 7.3–7.5) and Conditions 6–7 imply that conditions of Andersen & Gill (1982) hold. Hence, by Andersen & Gill (1982, Theorem 3.2), A(2)(β0) = −B(2)(β0) and the result follows.
Under a misspecified model, the relationship between β* and β0 is quite complicated, depending on the joint distribution of the failure time, the censoring time and the covariate. We now consider a special case where the true failure time distribution satisfies a one-parameter proportional hazards frailty model (Kosorok et al., 2004). The true conditional hazard function is
where λ0(t) is an unspecified base hazard function and W is a positive continuous frailty variable with density fW (·) that is independent of X and E(W) = 1. Let and be the Laplace transform of W.
Additional conditions beyond Conditions 6–7 on the covariate distribution are needed. As in Kosorok et al. (2004), we assume the following conditions.
Condition 10. The censoring variable C is independent of T.
Condition 11. The true density of T given X exists and is bounded over [0, τ ] almost surely, and pr(T > τ|X) > 0 almost surely.
Condition 12. Both LW(ct) and c/LW(ct) are decreasing functions of c for each finite t > 0.
Condition 10 strengthens the usual conditional independence assumption. Condition 12 is a regularity condition on the frailty distribution which relaxes those in Kosorok et al. (2004), which have been shown to hold for many common distributions.
Corollary 2. Assume that ℬ is nonempty and convex in ℝp, and that Conditions 6–7 and 8–12 hold. Then β* = α*β0 for some nonzero scalar α*.
Corollary 2 follows immediately from Kosorok et al. (2004, Proposition 5). Theorem 2 and Corollary 2 imply that under model misspecification the adaptive lasso estimator may achieve selection consistency for the regression coefficients β0 in the true model.
4. Numerical studies
4.1. Simulations
We conducted simulations to study the variable selection performance of the adaptive lasso estimators under various model misspecification scenarios. Specifically, we consider the following scenarios.
Scenario 1. The true model is given by Y ∼ N {exp(β′0X), 0.52}, but we fit a linear model of Y on X with the identity link function.
Scenario 2. The true model is given by pr(Y = 1|X) = Φ (0.5 + β′0X),where Φ(·) is the cumulative distribution function of the standard normal random variable, but we fit a logistic regression of Y on X.
Scenario 3. The true model for the failure time is from a class of linear transformation models: H(T) = −β′0X + ∊, where H(·) is an unspecified monotone increasing function and the error term has a known density function that is independent of X. Here we consider three error distributions: e∊ follows a Pareto(r) distribution with r = 1and r = 2, and a standard log-normal, where Pareto(1) error corresponds to the proportional odds model. The linear transformation model with Pareto(r) distribution is equivalent to the one-dimensional frailty model considered in § 3, with the frailty W from the gamma distribution with mean 1 and variance r. For each model, we fit the standard proportional hazards model.
For all scenarios, we consider a ten-dimensional covariate vector X = (X1,..., X10)′ generated from a multivariate normal distribution with mean 0, variance 1 and correlation between Xj and Xk given by 0.5|j−k| for j ≠ k. The linearity Condition 9 is satisfied for multivariate normal covariates. We set β0 = (1.0, −0.9, 0, 0, −0.8, 0, 0, 0, 0, 0)′. In Scenario 3, H(t) = log{0.5(e2t − 1)} and the censoring time C is generated from a uniform distribution on [0, τ], where τ is chosen to obtain 15% censoring. For Scenario 1, the effect size, defined as the absolute value of the full model estimates divided by the standard errors of the estimates for important variables, ranges from 1.5 to 3.5, while for Scenarios 2 and 3, it ranges from 2 to 5.5. The adaptive lasso estimators with γ = 1 are computed using the R packages LARS (R Development Core Team, 2012) for Scenario 1 and glmnet for Scenarios 2 and 3. The tuning parameter is chosen by minimizing bic.
For Scenario 1, we take n = 50, 100, 200, while for Scenarios 2 and 3, we take n = 100, 200, 400. For each setting, we run 500 simulations. The variable selection performance is summarized by the average number of correct zeros, the average number of incorrect zeros and the frequency that each variable is selected. For comparison, we also include the selection results using three alternative methods, the standard lasso, forward selection with aic tuning and forward selection with bic tuning. The results for Scenarios 1 and 2 are reported in Table 1 and for Scenario 3 in Table 2. As the sample size increases, the variable selection performance of the adaptive lasso estimators quickly improves: the average number of correct zeros is close to 7, the average number of incorrect zeros is close to 0 and the selection frequencies of the three important variables approach 500 while those for the unimportant variables decrease quickly. This confirms the theoretical finding that the adaptive lasso may have the variable selection consistency property under model misspecification when Condition 7 holds. In addition, the adaptive lasso and forward selection with bic tuning methods have very comparable selection performance under model misspecification, and are much better than the standard lasso and forward selection with aic tuning, which tend to select more unimportant variables even when sample size increases. This suggests that, under certain conditions, forward selection with bic tuning may also enjoy the selection consistency property under model misspecification, which needs to be formally investigated in future research.
Table 1.
Variable selection results for normal covariates: scenarios 1 and 2
| Selection frequency | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n | Method | CZ | IZ | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
| Scenario 1 | |||||||||||||
| 50 | L | 5.59 | 0.12 | 487 | 460 | 148 | 130 | 495 | 94 | 81 | 74 | 90 | 89 |
| AL | 6.49 | 0.12 | 487 | 471 | 44 | 46 | 483 | 33 | 28 | 34 | 35 | 34 | |
| FRA | 5.55 | 0.03 | 498 | 494 | 116 | 115 | 495 | 99 | 92 | 92 | 102 | 111 | |
| FRB | 6.47 | 0.04 | 496 | 490 | 53 | 39 | 492 | 37 | 27 | 40 | 35 | 35 | |
| 100 | L | 5.81 | 0.01 | 500 | 496 | 130 | 114 | 500 | 94 | 64 | 60 | 60 | 71 |
| AL | 6.76 | 0.01 | 500 | 498 | 19 | 17 | 497 | 26 | 15 | 18 | 10 | 15 | |
| FRA | 5.86 | 0.01 | 500 | 499 | 98 | 90 | 498 | 89 | 77 | 77 | 67 | 72 | |
| FRB | 6.73 | 0.01 | 500 | 498 | 24 | 21 | 496 | 23 | 16 | 18 | 18 | 15 | |
| 200 | L | 5.91 | 0.00 | 500 | 499 | 119 | 114 | 500 | 88 | 61 | 64 | 41 | 60 |
| AL | 6.90 | 0.00 | 500 | 499 | 6 | 5 | 500 | 11 | 8 | 7 | 6 | 9 | |
| FRA | 5.82 | 0.00 | 500 | 500 | 84 | 80 | 500 | 86 | 84 | 103 | 67 | 88 | |
| FRB | 6.83 | 0.00 | 500 | 499 | 16 | 11 | 500 | 15 | 14 | 9 | 10 | 10 | |
| Scenario 2 | |||||||||||||
| 100 | L | 4.88 | 0.04 | 491 | 492 | 176 | 177 | 497 | 154 | 126 | 140 | 137 | 149 |
| AL | 6.38 | 0.06 | 492 | 483 | 62 | 46 | 493 | 43 | 39 | 45 | 36 | 39 | |
| FRA | 5.62 | 0.01 | 500 | 499 | 114 | 106 | 497 | 91 | 82 | 96 | 103 | 97 | |
| FRB | 6.64 | 0.02 | 499 | 497 | 32 | 33 | 492 | 20 | 19 | 19 | 26 | 29 | |
| 200 | L | 4.96 | 0.00 | 500 | 500 | 173 | 167 | 500 | 152 | 156 | 118 | 124 | 132 |
| AL | 6.76 | 0.00 | 500 | 500 | 19 | 13 | 500 | 18 | 20 | 18 | 18 | 13 | |
| FRA | 5.81 | 0.00 | 500 | 500 | 86 | 72 | 500 | 92 | 94 | 73 | 87 | 89 | |
| FRB | 6.82 | 0.00 | 500 | 500 | 17 | 11 | 500 | 12 | 11 | 11 | 16 | 13 | |
| 400 | L | 5.12 | 0.00 | 500 | 500 | 159 | 153 | 500 | 147 | 130 | 92 | 124 | 133 |
| AL | 6.84 | 0.00 | 500 | 500 | 6 | 11 | 500 | 10 | 17 | 9 | 13 | 16 | |
| FRA | 5.76 | 0.00 | 500 | 500 | 88 | 80 | 500 | 101 | 91 | 78 | 83 | 101 | |
| FRB | 6.89 | 0.00 | 500 | 500 | 6 | 5 | 500 | 10 | 11 | 4 | 7 | 13 | |
CZ, average number of correct zeros; IZ, average number of incorrect zeros; L, lasso; AL, adaptive lasso; FRA, forward selection with aic tuning; FRB, forward selection with bic tuning.
Table 2.
Variable selection results for normal covariates: scenario 3
| Selection frequency | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n | Method | CZ | IZ | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 |
| Scenario 3: Pareto(1) error | |||||||||||||
| 100 | L | 5.39 | 0.39 | 427 | 418 | 131 | 142 | 461 | 106 | 108 | 101 | 103 | 115 |
| AL | 6.41 | 0.50 | 422 | 401 | 50 | 50 | 426 | 45 | 47 | 26 | 43 | 36 | |
| FRA | 5.34 | 0.06 | 498 | 491 | 116 | 132 | 481 | 126 | 107 | 114 | 121 | 114 | |
| FRB | 6.50 | 0.25 | 469 | 459 | 43 | 51 | 449 | 38 | 30 | 26 | 30 | 32 | |
| 200 | L | 5.21 | 0.02 | 496 | 492 | 146 | 155 | 500 | 142 | 108 | 109 | 117 | 118 |
| AL | 6.51 | 0.05 | 496 | 482 | 33 | 36 | 496 | 37 | 33 | 37 | 40 | 31 | |
| FRA | 5.51 | 0.01 | 500 | 497 | 101 | 115 | 499 | 102 | 97 | 103 | 115 | 111 | |
| FRB | 6.64 | 0.02 | 499 | 496 | 30 | 21 | 495 | 31 | 18 | 32 | 27 | 22 | |
| 400 | L | 5.42 | 0.00 | 499 | 499 | 149 | 129 | 500 | 127 | 86 | 97 | 95 | 108 |
| AL | 6.74 | 0.00 | 500 | 500 | 18 | 19 | 500 | 26 | 18 | 12 | 16 | 20 | |
| FRA | 5.53 | 0.00 | 500 | 500 | 97 | 114 | 500 | 114 | 98 | 95 | 103 | 116 | |
| FRB | 6.80 | 0.00 | 500 | 500 | 10 | 19 | 500 | 16 | 10 | 10 | 15 | 18 | |
| Scenario 3: Pareto(2) error | |||||||||||||
| 100 | L | 6.23 | 1.70 | 210 | 175 | 65 | 76 | 266 | 61 | 39 | 46 | 54 | 45 |
| AL | 6.58 | 1.65 | 227 | 197 | 35 | 29 | 251 | 39 | 26 | 25 | 35 | 22 | |
| FRA | 5.37 | 0.53 | 449 | 397 | 128 | 133 | 390 | 116 | 104 | 111 | 113 | 111 | |
| FRB | 6.46 | 1.14 | 332 | 275 | 47 | 60 | 324 | 44 | 25 | 34 | 32 | 30 | |
| 200 | L | 5.77 | 0.64 | 385 | 351 | 103 | 115 | 443 | 94 | 59 | 74 | 90 | 80 |
| AL | 6.51 | 0.65 | 395 | 358 | 39 | 41 | 420 | 42 | 24 | 38 | 33 | 30 | |
| FRA | 5.48 | 0.10 | 496 | 482 | 109 | 115 | 474 | 111 | 89 | 109 | 120 | 105 | |
| FRB | 6.60 | 0.45 | 440 | 401 | 51 | 37 | 435 | 25 | 15 | 26 | 24 | 21 | |
| 400 | L | 5.50 | 0.06 | 490 | 479 | 134 | 140 | 500 | 126 | 81 | 87 | 91 | 92 |
| AL | 6.59 | 0.09 | 487 | 476 | 28 | 36 | 493 | 41 | 26 | 23 | 31 | 20 | |
| FRA | 5.57 | 0.00 | 499 | 500 | 99 | 103 | 500 | 117 | 95 | 105 | 95 | 100 | |
| FRB | 6.82 | 0.04 | 496 | 491 | 15 | 17 | 493 | 15 | 11 | 11 | 9 | 14 | |
| Scenario 3: normal error | |||||||||||||
| 100 | L | 4.77 | 0.00 | 500 | 500 | 188 | 161 | 500 | 154 | 146 | 153 | 154 | 158 |
| AL | 6.57 | 0.00 | 500 | 500 | 35 | 28 | 500 | 30 | 34 | 32 | 23 | 32 | |
| FRA | 5.37 | 0.00 | 500 | 500 | 127 | 111 | 500 | 109 | 120 | 115 | 111 | 122 | |
| FRB | 6.53 | 0.00 | 500 | 500 | 40 | 33 | 500 | 30 | 40 | 33 | 26 | 31 | |
| 200 | L | 4.80 | 0.00 | 500 | 500 | 193 | 183 | 500 | 156 | 139 | 143 | 135 | 150 |
| AL | 6.80 | 0.00 | 500 | 500 | 16 | 18 | 500 | 15 | 10 | 12 | 11 | 20 | |
| FRA | 5.48 | 0.00 | 500 | 500 | 112 | 111 | 500 | 96 | 103 | 111 | 108 | 117 | |
| FRB | 6.70 | 0.00 | 500 | 500 | 25 | 25 | 500 | 21 | 9 | 20 | 26 | 31 | |
| 400 | L | 4.90 | 0.00 | 500 | 500 | 181 | 170 | 500 | 152 | 144 | 130 | 119 | 153 |
| AL | 6.82 | 0.00 | 500 | 500 | 14 | 15 | 500 | 13 | 15 | 14 | 11 | 9 | |
| FRA | 5.39 | 0.00 | 500 | 500 | 115 | 123 | 500 | 103 | 120 | 114 | 102 | 130 | |
| FRB | 6.72 | 0.00 | 500 | 500 | 21 | 23 | 500 | 18 | 21 | 21 | 14 | 20 | |
CZ, average number of correct zeros; IZ, average number of incorrect zeros; L, lasso; AL, adaptive lasso; FRA, forward selection with aic tuning; FRB, forward selection with bic tuning.
Next, we conduct simulations when the linearity Condition 7 is violated, with a skewed distribution for the covariates in X. Conditionally, on a positive random variable α, the jth (j = 1,..., 10) covariate is generated from an exponential distribution with mean α, where α is generated from a gamma distribution with mean 1 and variance σ2. We choose σ2 = 0 or 1, where σ2 = 0 implies that Xj and Xk are mutually independent for j ≠ k. We consider Scenarios 1 and 2 described above. The simulation results are given in the Supplementary Material. For Scenario 1 with independent covariates, the adaptive lasso estimators tend to miss some important covariates while with correlated covariates, they tend to select more unimportant covariates, with no improvement as the sample size increases. For Scenario 2, the selection performance of the adaptive lasso estimators improves when the sample size increases and is very comparable with that for the normal covariates reported in Table 1. This suggests that even when the linearity Condition 7 is violated, the adaptive lasso may have the variable selection consistency property under certain types of model misspecification. As in the previous simulations, the adaptive lasso and forward selection with bic tuning methods have very comparable selection performance, and they are better than the standard lasso and forward selection with aic tuning for Scenario 2 by selecting fewer unimportant variables. However, for Scenario 1, the advantage of the adaptive lasso and forward selection with bic tuning methods is not obvious since they also tend to miss more important variables.
Additional simulations were also conducted to compare the selection performance of the adaptive lasso estimates under the true and misspecified models, to consider models with varying effect sizes and to check the asymptotic distributions established in Theorem 1. The results are given in the Supplementary Material. The results show that if Condition 7 is satisfied, the selection performance of adaptive lasso is very similar under the true and misspecified models. In addition, for moderate sample sizes, the sampling distribution of the penalized estimators agrees reasonably well with the limiting distributions.
4.2. Application to diabetic retinopathy data
To further illustrate the robustness of the adaptive lasso method under model misspecification, we analyse data from the Wisconsin Epidemiological Study of Diabetic Retinopathy. In this study, the baseline examination of patients with retinopathy occurred in 1980–1982, with additional data collected at 4-, 10-, 14- and 20-year follow-ups. Study details may be found in Klein et al. (1984, 1989, 1998). The current analysis employs 648 binary response variables coding the status of 4-year progression of retinopathy. As in previous analyses (Wahba et al., 1995; Zhang et al., 2004), we consider 14 potential risk factors, which include 9 continuous covariates, namely the duration of diabetes at baseline examination, glycosylated haemoglobin, body mass index, systolic blood pressure, retinopathy level, pulse rate in 30 s, insulin dose per day, years of school completed, intraocular pressure, and five categorical covariates such as smoking status, sex, use of at least one aspirin for at least 3 months while diabetic, family history of diabetes and marital status.
We fit binary regression models with different link functions g in the model for the probability of retinopathy at 4 years, E(Y | X) = g(β′X). Five link functions are examined: probit, log-log and g(u) = 1 − (1 + reu)−1/r with r = 0.5, 1 and 2, where r = 1 corresponds to the logistic link. The adaptive lasso method with γ = 1 is used for variable selection with the tuning parameter chosen using bic. The adaptive lasso estimates for the regression parameters under different link functions are obtained using the least squares approximation method of Wang & Leng (2007) and the nonzero components of the adaptive lasso estimates are summarized in Table 3. In addition, the standard errors for the nonzero estimates, obtained based on the asymptotic distribution established in Theorem 1, are given in the parentheses. On the basis of the results, the adaptive lasso estimators all select the same two covariates: glycosylated haemoglobin and body mass index, under different link functions, although the estimated covariate effects may be different. The adaptive lasso method exhibits quite robust selection performance across models, where Condition 7 would seem to be violated, owing to the categorical covariates. This is further exemplified by comparing the ratios of the three nonzero coefficients between two models, which are rather constant, as predicted by Corollary 1 assuming Condition 7 is satisfied. Increased glycosylated haemoglobin and increased body mass index increase the risk of disease progression, with the relative effects being comparable across models.
Table 3.
Estimation and selection results for diabetic retinopathy data
| Link functions | |||||
|---|---|---|---|---|---|
| Covariates | r = 0.5 | r = 1 | r = 2 | Probit | Log-Log |
| Intercept | −5.176 (0.679) | −6.011 (0.738) | −7.567 (0.933) | −3.637 (0.607) | −4.335 (0.649) |
| Gly | 0.391 (0.044) | 0.468 (0.050) | 0.617 (0.068) | 0.284 (0.038) | 0.311 (0.040) |
| BMI | 0.025 (0.019) | 0.032 (0.020) | 0.046 (0.023) | 0.019 (0.018) | 0.019 (0.018) |
Gly, glycosylated haemoglobin; BMI, body mass index.
5. Remarks
While the theoretical findings in the current paper were established for the adaptive lasso under model misspecification, we anticipate that other shrinkage methods will have similar robustness properties as long as the sparsity of the least false parameter is equivalent to that in the true parameter. The objective function is critical in determining whether the necessary conditions are met, not the penalty function. If Corollary 1 holds for the generalized linear model or Corollary 2 holds for proportional hazards model, any suitably penalized likelihood or partial likelihood estimator will achieve sparsity, selection consistency, and the oracle property under misspecification. Further work is needed to rigorously establish such results for other variable selection techniques, like smoothly clipped absolute deviation.
In an as yet unpublished 2009 report from Cornell University G. V. Rocha, X. Wang and B. Yu studied penalized M-estimation using an L1 penalty, such as the lasso, and established selection consistency for the true model under Gaussian covariates. They do not consider whether their estimators are oracle, in the sense of having the same limiting distribution as that for an estimator for which the zero parameters are known a priori. Our results go further in several respects. First, we study penalized estimation with right censoring under misspecified proportional hazards models, which does not fit into the M-estimation framework of the report by Rocha et al., and obtain selection consistency for the true model under certain conditions on the covariate and censoring distributions. Secondly, we establish selection consistency for M-estimators under weaker conditions than in the report by Rocha et al.. Thirdly, we obtain a general oracle result for M-estimators and misspecified proportional hazards models that does not require selection consistency.
The setting considered in the current paper is that where the number of regression parameters p is fixed and small relative to sample size n, which is assumed to grow. There is recent interest in the case that p grows at some rate of n. Here, one must first establish that the least false parameter from the penalized estimation procedure is well defined under the model being estimated. This will presumably require sparseness, in the sense that the number of nonzero parameters in the least false parameter must grow at a certain rate. Such sparseness may hold under the fitted model even when it does not hold under the true model. Conversely, it is possible that sparseness holds under the true model but that the penalized least false parameter is not well defined owing to a lack of sparseness under the misspecified model.
With large p, once the existence of the least false parameter has been established, one may study the properties of the corresponding penalized estimator. One may derive conditions such that this estimator is oracle, in the sense of having the same large sample properties as an estimator in which the zero components of the least false parameter are known a priori. Finally, one can determine whether the nonzero components of the least false parameter match those of the regression parameter from the true model. If so, then consistent variable selection is achievable. The primary technical difficulty is showing that there is sufficient information to permit the definition of the least false parameter from the penalized estimation procedure, which is defined implicitly as the limiting value of the resulting estimator. In situations where p grows too fast, this limiting value may not exist with correctly specified models, as discussed in Fan & Peng (2004), Lam & Fan (2008) and Zou & Zhang (2009). One can adapt such proofs, being careful to state the necessary regularity conditions under a potentially misspecified model. This warrants further study.
Acknowledgments
We would like to thank Michael Kosorok for helpful discussions regarding misspecification of frailty models. We also thank the editor, an associate editor and three referees for very insightful comments. Lu and Fine’s research was supported by the National Cancer Institute.
Appendix
Proof of Theorem 1. The proof is constructed from three steps. The first step studies the behaviour of β̂(1) = β* + n−1/2u for a fixed u. The second step proves the asymptotic normality of β̂(1). The third step shows the sparsity property, i.e., that .
Define
| (A1) |
Let ûn = argmaxu Hn(u); then ûn = n1/2(β̂(1) − β*).
In the following, we study the asymptotic behaviour of (A1) under the generalized linear model. By Taylor expansion, we obtain
| (A2) |
for some β̌ between β* + n1/2u and β*.
By White (1982, Theorem 3.2), in distribution. By the law of large numbers, in probability. By Condition 5, we obtain that in probability. It follows from White (1982, Theorem 2.2) that in probability. Thus, if ,
in probability, and thus
in probability. If , then and , where by White (1982, Theorem 3.2). Thus, similar to Zou (2006), we obtain that
| (A3) |
in probability. Summarizing, for every u, Hn(u) → H(u) in probability, where
| (A4) |
Here 𝒜 is the index set of nonzero components in the least false parameters β*, u1 is the p1-length beginning part of the vector u and W ∼ N (0, B11).
Next, we show that ûn →û in probability, where û is the maximizer of H(u). Note that Hn(u) and H(u) are stochastic processes with sample paths that are upper semicontinuous and H(u) possesses a unique maximum at . The inverse of A11 is well defined since A11 is symmetric and negative definite. To establish that ûn = Op(1), we will show with probability tending to one, there is a local maximizer β̂(1) of such that n1/2ûn = ‖β̂(1) − β*‖ = Op(n1/2). In particular, for any ɛ > 0, there exists a constant C such that
| (A5) |
By Taylor expansion of (A2), we obtain
Note that , , ,
and that n1/2λn → 0, and
in probability. Since A(β*) is negative definite by Condition 4, taking C large enough, dominates the other three terms, and thus (A5) holds and ûn is uniformly tight. Hence, all the conditions of the argmax theorem (Kosorok, 2008, Theorem 14.1) hold, and consequently ûn →û in probability. It follows that and ûn2 → 0 in probability, and hence in distribution.
The sparsity proof generalizes the proof of Theorem 4 in Zou (2006). For all j ∈ 𝒜, the asymptotic normality above gives that pr(j ∈ 𝒜n) → 1, where 𝒜n is the index set of nonzero components in the adaptive lasso estimates β̂(1). Thus, it suffices to show that for every j′ ∉ 𝒜, pr(j′∈ 𝒜n) → 0. Consider the event j′∈ 𝒜n. By the Karush–Kuhn–Tucker optimality conditions, we have
Thus, . By Taylor expansion,
where β̌ is between β* and β̂(1). By Condition 3, and the fact that n1/2(β̂(1) − β*) converges in distribution to a normal random vector, both and converge in distribution to normal random vectors. By Condition 5, . On the other hand, in probability. Thus, pr(j′∈ 𝒜n) → 0 and the proof is complete.
Proof of Theorem 2. Define Hn(u) as in (A2) for . Then
| (A6) |
where β̌ is on a line segment between β and β*. It follows from Lin & Wei (1989) that in distribution and that in probability. The asymptotic behaviour of is given by (A3), where the facts that in probability when and that when follow from Sasieni (1993, Corollary 4.1). Therefore, for every u, Hn(u) → H(u) in probability, where H(u) is given by (A4).
We would like to show that ûn → û in probability, where û is the maximizer of H(u). If û = Op(1), then the argmax theorem (Kosorok, 2008, Theorem 14.1) can be applied, similar to the proof for the generalized linear model. We need that, for any given ɛ > 0, there exists a constant C such that
| (A7) |
By Taylor expansion of (A6), we obtain
Note that , and . Since A(β*) is positive definite (see Sasieni, 1993, Lemma 7.3), taking C large enough, dominates the other two terms, and thus (A7) holds and ûn is uniformly tight.
For sparsity, it suffices to show that for every j′∉ 𝒜, pr(j′ ∈ 𝒜n) → 0. Consider the event j′ ∈ 𝒜n. By the Karush–Kuhn–Tucker optimality conditions, we have
Thus,
By Taylor expansion, we have
where β̌ is between β* and β̂(2). Combining earlier approximations yields
in distribution, and
in probability. By Sasieni (1993, Corollary 4.1), n1/2(β̂(2) − β*) converges in distribution to a normal random vector. By applying Slutsky’s theorem, we obtain that both and converge in distribution to normal random vectors. On the other hand, n1/2λnŵnj′ → ∞ in probability. Thus, pr(j′ ∈ 𝒜n) → 0 and the proof is complete.
Supplementary material
Supplementary material available at Biometrika online includes additional simulation study results.
References
- Akaike H. Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika. 1973;60:255–65. [Google Scholar]
- Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Statist. 1982;10:1100–20. [Google Scholar]
- Cox DR. Regression models and life-tables (with discussion). J. R. Statist. Soc. B. 1972;34:187–220. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Statist Assoc. 2001;96:1348–60. [Google Scholar]
- Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30:74–99. [Google Scholar]
- Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Statist. 2004;32:928–61. [Google Scholar]
- Klein R, Klein B, Moss SE, Davis MD, Demets DL. The Wisconsin epidemiologic study of diabetic retinopathy II: prevalence and risk of diabetes when age at diagnosis is less than 30 years. Arch Ophthal. 1984;102:520–6. doi: 10.1001/archopht.1984.01040030398010. [DOI] [PubMed] [Google Scholar]
- Klein R, Klein B, Moss SE, Davis MD, DeMets DL. The Wisconsin epidemiologic study of diabetic retinopathy IX: four-year incidence and progression of diabetic retinopathy when age at diagnosis is less than 30 years. Arch Ophthal. 1989;107:237–43. doi: 10.1001/archopht.1989.01070010243030. [DOI] [PubMed] [Google Scholar]
- Klein R, Klein B, Moss S, Cruickshanks K. The Wisconsin epidemiologic study of diabetic retinopathy XVII: the 14-year incidence and progression of diabetic retinopathy and associated risk factors in type 1 diabetes. Ophthalmology. 1998;105:1801–15. doi: 10.1016/S0161-6420(98)91020-X. [DOI] [PubMed] [Google Scholar]
- Kosorok MR. Introduction to Empirical Processes and Semiparametric Inference. New York: Springer; 2008. [Google Scholar]
- Kosorok MR, Lee BL, Fine JP. Robust inference for univariate proportional hazards frailty regression models. Ann Statist. 2004;32:1448–91. [Google Scholar]
- Lam C, Fan J. Profile-kernel likelihood inference with diverging number of parameters. Ann Statist. 2008;36:2232–60. doi: 10.1214/07-AOS544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li K-C, Duan N. Regression analysis under link violation. Ann Statist. 1989;17:1009–52. [Google Scholar]
- Lin DY, Wei LJ. The robust inference for the Cox proportional hazards model. J Am Statist Assoc. 1989;84:1074–8. [Google Scholar]
- Mallows CL. Some comments on Cp. Technometrics. 1973;15:661–75. [Google Scholar]
- McCullagh P, Nelder JA. Generalized Linear Models. 2nd ed. New York: Chapman and Hall; 1989. [Google Scholar]
- R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]
- Sasieni P. Some new estimators for Cox regression. Ann Statist. 1993;21:1721–59. [Google Scholar]
- Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–4. [Google Scholar]
- Tibshirani RJ. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B. 1996;58:267–88. [Google Scholar]
- Tibshirani RJ. The lasso method for variable selection in the Cox model. Statist Med. 1997;16:385–95. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- Wahba G, Wang Y, Gu C, Klein R, Klein B. Smoothing spline anova for exponential families, with application to the Wisconsin epidemiological study of diabetic retinopathy. Ann Statist. 1995;23:1865–95. [Google Scholar]
- Wang H, Leng C. Unified lasso estimation with least squares approximation. J Am Statist Assoc. 2007;102:1039–48. [Google Scholar]
- White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25. [Google Scholar]
- Zhang HH, Lu W. Adaptive Lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703. [Google Scholar]
- Zhang HH, Wahba G, Lin Y, Voelker M, Ferris M, Klein R, Klein B. Variable selection and model building via likelihood basis pursuit. J Am Statist Assoc. 2004;99:659–72. [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. J Am Statist Assoc. 2006;101:1418–29. [Google Scholar]
- Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Ann Statist. 2009;37:1733–51. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary material available at Biometrika online includes additional simulation study results.
