Abstract
We propose a general strategy for variable selection in semiparametric regression models by penalizing appropriate estimating functions. Important applications include semiparametric linear regression with censored responses and semiparametric regression with missing predictors. Unlike the existing penalized maximum likelihood estimators, the proposed penalized estimating functions may not pertain to the derivatives of any objective functions and may be discrete in the regression coefficients. We establish a general asymptotic theory for penalized estimating functions and present suitable numerical algorithms to implement the proposed estimators. In addition, we develop a resampling technique to estimate the variances of the estimated regression coefficients when the asymptotic variances cannot be evaluated directly. Simulation studies demonstrate that the proposed methods perform well in variable selection and variance estimation. We illustrate our methods using data from the Paul Coverdell Stroke Registry.
Keywords: Accelerated failure time model, Buckley-James estimator, Censoring, Least absolute shrinkage and selection operator, Least squares, Linear regression, Missing data, Smoothly clipped absolute deviation
1. INTRODUCTION
A major challenge in regression analysis is to decide which predictors among many potential ones are to be included in the model. It is customary to use stepwise selection and subset selection. But these procedures are unstable and ignore the stochastic errors introduced by the selection process. Several methods, including bridge regression (Frank and Friedman 1993), least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), smoothly clipped absolute deviation (SCAD) (Fan and Li 2001), elastic net (EN) (Zou and Hastie 2005), and adaptive lasso (ALASSO) (Zou 2006), have been proposed to select variables and estimate their regression coefficients simultaneously. These methods can be cast in the framework of penalized least squares and likelihood.
Consider the linear regression model
(1) |
where Yi is the response variable, xi is a d-vector of predictors for the ith subject, β is a d-vector of regression coefficients, and (ε1, …, εn) are independent and identically distributed errors. For simplicity, assume that the εi’s have means 0. Define l(β) = ||y − Xβ||2, where y = (Y1, …, Yn)T and X = (x1, …, xn)T. Then the penalized least squares estimator of β is the minimizer of the objective function
(2) |
where pλ(·) is a penalty function. Appropriate choices of pλ (detailed in Sec. 2) yield the aforementioned variable selection procedures. For likelihood-based models, the penalized maximum likelihood estimator is obtained by setting l(β) to the minus log-likelihood.
For many semiparametric problems, the estimation of regression coefficients (without the task of variable selection) does not pertain to the minimization of any objective function. Important examples include weighted estimating equations for missing data (Robins, Rotnitzky, and Zhao 1994; Tsiatis 2006) and the Buckley–James estimator for semiparametric linear regression with censored responses (Buckley and James 1979). Another example arises from Lin and Ying’s (2001) semiparametric regression analysis of longitudinal data. For this example, Fan and Li (2004) proposed a variable selection method by incorporating the SCAD penalty into Lin and Ying’s estimator. They noted that their estimator may be cast in the form of (2), so that their earlier results (Fan and Li 2001) for penalized least squares could be applied. In this article we go beyond specific problems and provide a very general theory for a broad class of penalized estimating functions. In this regard, only Fu’s (2003) work on generalized estimating equations (GEEs) (Liang and Zeger 1986) with bridge penalty (Frank and Friedman 1993; Knight and Fu 2000) is similar. That work deals only with smooth estimating functions, whereas our theory applies to very general, possibly discrete estimating functions. In addition, we present general computational strategies.
The remainder of the article is organized as follows. We present our penalized estimating functions in Section 2, paying special attention to the aforementioned missing-data and censored-data problems. We state the asymptotic results in Section 3 and address implementation issues in Section 4. We report the results of our simulation studies in Section 5 and apply the methods to real data in Section 6.
2. PENALIZED ESTIMATING FUNCTIONS
2.1 General Setting
Suppose that U(β) ≡ (U1(β), …, Ud (β))T is an estimating function for β ≡ (β1, …, βd)T based on a random sample of size n. For maximum likelihood estimation, U(β) is simply the score function. We are interested mainly in the situations in which U(β) is not a score function or the derivative of any objective function. A penalized estimating function is defined as
where qλ (|β|) = (qλ,1(|β1|), …, qλ,d (|βd|))T, qλ,j (·), j = 1, …, d, are coefficient-dependent continuous functions and the second term is the componentwise product of qλ and sgn(β). In most cases, for some penalty function pλ,j, and the functions qλ,j, j = 1, …, d, are the same for all d components of qλ (|β|), that is, qλ,j = qλ,k, j ≠ k. When the functions qλ,j, j = 1, …, d, do not vary with j, we drop the subscript for simplicity and ease of notation.
When , we consider five penalty functions: (a) the LASSO penalty (Tibshirani 1996, 1997), pλ (|θ|) = λ|θ|; (b) the hard thresholding penalty, pλ (|θ|) = λ2 − (|θ| − λ)2I(|θ|<λ); (c) the SCAD penalty (Fan and Li 2001, 2002, 2004), defined by
for a > 2; (d) the EN penalty (Zou and Hastie 2005), pλ (|θ|) = λ1|θ| + λ2θ2; and (e) the ALASSO penalty (Zou 2006), pλ,j (|θ|) = λ|θ|ωj, for a known data-driven weight ωj. In our applications we use the weight , j = 1, …, d, where refers to the d-vector of regression coefficient estimates obtained from solving the original estimating equation, U(β) = 0.
The hard thresholding penalty is important because it corresponds to best subset selection and stepwise deletion in certain cases. The LASSO (Tibshirani 1996, 1997) is one of the most popular shrinkage estimators, but it has some deficiencies; in particular, it is inconsistent for certain designs (Meinshausen and Bühlmann 2006; Zou 2006). Fan and Li (2001, 2002) attempted to avoid such deficiencies by constructing a new penalty function (SCAD) that results in an estimator that achieves an oracle property: that is, the estimator has the same limiting distribution as an estimator that knows the true model a priori. Recently, Zou (2006) introduced ALASSO, which, like SCAD, achieves the oracle property and may have numerical advantages for some problems. Finally, Zou and Hastie (2005) introduced the mixture penalty EN to effectively select “grouped” variables; this penalty is popular in the statistical analysis of large data sets.
2.2 Application to Censored Data
Censoring is a common phenomenon in scientific studies (see Kalbfleisch and Prentice 2002, p. 12). The presence of censoring causes major complications in implementation of the penalized least squares approach, because the values of the Yi are unknown for the censored observations. The problem is much simpler for the proportional hazards regression because the partial likelihood (Cox 1972) plays essentially the same role as the standard likelihood (Tibshirani 1997; Fan and Li 2002; Cai, Fan, and Zhou 2005). However, the proportional hazards model may not be appropriate in some applications, especially when the response variable does not pertain to failure time.
Let Yi and Ci denote the response variable and censoring variable for the ith subject, i = 1, …, n. The data consist of (Ỹi, Δi, xi), i = 1, …, n, where Ỹi = min(Yi, Ci), Δi = I (Yi ≤ Ci) and xi is a d-vector of predictors. We relate Yi to xi through the semiparametric linear regression model given in (1), where εi are independent and identically distributed with an unspecified distribution function F (·). We assume that Yi is independent of Ci conditional on xi. When the response variable pertains to failure time, both Yi and Ci are commonly measured on the log scale, and model (1) is called the accelerated failure time model (Kalbfleisch and Prentice 2002, p. 44).
Clearly,
and
where α = E(εi) and ei (β) = Ỹi − βTxi. Thus Buckley and James (1979) proposed the estimating function for β,
(3) |
where
and F̂(t; β) is the Kaplan–Meier estimator of F (t) based on {ei (β), Δi}, i = 1, …, n. If Δi = 1 for all i, then the penalized estimating function UP (β) corresponding to (3) becomes the penalized least squares estimating function arising from (2). Thus the penalized Buckley–James estimator is a direct generalization of the penalized least squares estimator to censored data.
2.3 Application to Missing Data
It often is difficult to have complete data on all study subjects. Let Ri be the missingness indicator for the ith subject, with the event {Ri = ∞} indicating that the ith subject has complete data. The observed data for the ith subject are Gr (Zi), where Gr (·) is the missingness operator acting on the full data Zi of the ith subject when Ri = r. In simple linear regression, for example, we may consider only Ri ∈ {1, 2, ∞} corresponding to G1(Zi) = {Yi}, G2(Zi) = {xi}, and G∞(Zi) = {Yi, xi} = Zi. The observed data are represented as {Ri, GRi (Zi), i = 1, …, n}. We focus on monotone missingness and make two assumptions: (a) P (Ri = ∞|Zi = z) > κ > 0 and (b) P (Ri = r|Zi = z) = P (Ri = r |Gr (z) = gr).
Consider the semiparametric linear regression model given in (1). The weighted complete-case estimating function takes the form
where si (β) = xi (Yi − α − βT xi) and π(r, Gr (z)) = P (Ri = r|Gr (z) = gr). To improve efficiency, we adopt the strategy of Robins et al. (1994) and propose the estimating function
where λ̃r {Gr (Zi), η} = {1 + exp[−μr {Gr (Zi), η}]}−1, μr {Gr (Zi), η} is a linear predictor based on Gr (Zi) and η, , and Ẽ{si (β)|Gr (Zi)} is the conditional expectation of si (β) given Gr (Zi) under a posited parametric submodel for the full data-generating process.
3. ASYMPTOTIC RESULTS
Fan and Li (2001) showed that the penalized least squares estimator minimizing (2), or more generally the penalized maximum likelihood estimator, with the SCAD or hard thresholding penalty behaves asymptotically as if the true model is known a priori—the so-called oracle property. We show that this property holds for a very broad class of penalized estimating functions, of which the Buckley–James and weighted estimating functions with the SCAD and hard thresholding penalty functions are special cases.
Let β0 ≡ (β01, …, β0d)T denote the true value of β. Without loss of generality, suppose that β0j ≠ 0 for j ≤ s and β0j = 0 for j > s. We impose the following conditions:
-
C.1. There exists a nonsingular matrix A such that for any given constant M,
Furthermore, n−1/2U(β0) → d N(0, V) for V a d × d matrix.
-
C.2. The penalty function qλn(·) has the following properties:
For nonzero fixed θ, lim n1/2qλn (|θ|) = 0 and .
For any M > 0, .
Remark 1
Condition C.1 is not unusual and is satisfied by many commonly used estimating functions. This condition is implied by standard conditions for Z-estimators (van der Vaart and Wellner 1996, thm. 3.3).
Remark 2
Condition C.2 pertains to the choices of the penalty function and regularization parameter. This condition is key to obtaining the oracle property. In particular, condition C.2a prevents the j th element of the penalized estimating function from being dominated by the penalty term, qλn (|βj|) sgn(βj), for βj0 ≠ 0, because vanishes. But if βj0 = 0, then condition C.2b implies that diverges to +∞ or −∞, depending on the sign of βj in the small neighborhood of βj0. Thus the j -element of the penalized estimating function is dominated by the penalty term, so that any consistent solution, say β̂, to the estimating equation UP (β) = 0 must satisfy β̂j = 0.
Remark 3
Condition C.2 is satisfied by several commonly used penalties with proper choices of the regularization parameter λn:
Under the hard penalty [i.e., qλn (|θ|) = 2(λn − |θ|)I (|θ| < λn)], it is straightforward to verify that condition C.2 holds if λn → 0 and .
-
Under the SCAD penalty, that is,
with a > 2, it is easy to see that if we choose λn → 0 and , then condition C.2 holds because for θ ≠ 0 and .
For the ALASSO penalty, we assume that , nλn → ∞ and qλn (|θ|) = λnŵ for some data-dependent weight ŵ. First, n1/2 qλn (|θ|) = n1/2λnŵ → 0 and for |ŵ| < ∞ and θ ≠ 0. Second, to obtain sparsity, we require that the weights be sufficiently large for θ sufficiently small, say |θ| < Mn−1/2. For simplicity, suppose that the data-dependent weights are defined as ŵ = |θ̃|−γ for some γ > 0 and θ̃ pertaining to the solutions to the unpenalized estimating equations. Then, trivially, , which implies that , as desired. In this article we chose γ = 1 but Zou (2006, remarks 1 and 2) noted that other weights may be useful.
When qλn(|θ|) = λn/|θ|, condition C.2 is satisfied if and nλn →∞. To see this, note that for θ ≠ 0, and . An anonymous referee pointed out that qλn(|θ|) = λn/|θ| pertains to pλn(|θ|) =λn log(|θ|) on the original scale.
Condition C.2 does not hold for the LASSO and EN penalty functions.
To accommodate discrete estimating functions such as (3), we provide a formal definition of the solution to the penalized estimating equation. An estimator β̂= (β̂1, …, β̂d)T is called a zero-crossing to the penalized estimating equation if, for j =1, …, d,
where ej is the j th canonical unit vector. In addition, an estimator β̂ is called an approximate zero-crossing if
If UP is continuous, then the zero-crossing is an exact solution to the penalized estimating equation.
The following theorem states the main theoretical results regarding the proposed penalized estimators, including the existence of a root-n–consistent estimator, the sparsity of the estimator, and the asymptotic normality of the estimator.
Theorem 1
Define the number of nonzero coefficients s = #{j |βj0 ≠ 0}. Under conditions C.1 and C.2, the following results hold:
There exists a root-n–consistent approximate zero-crossing of UP (β), that is, β̂ = β0 + Op (n−1/2), such that β̂ is an approximate zero-crossing of UP (β).
-
For any root-n–consistent approximate zero-crossing of UP (β), denoted by β̂ ≡ (β̂1, …, βd)T, limn P (β̂j = 0 for j > s) = 1. Moreover, if we write β̂1 = (β̂1, …, β̂s)T and β01 = (β01, …, β0s)T, then
where A11, Σ11, and V11 are the first s × s submatrices of A, , and V, and bn = −(qλn(|β01|) × sgn(β01),…, qλn(|β0s|) sgn(β0s))T.
-
Let and U1(β) denote the first s-components of UP(β) and U(β), and let , where β1 denotes the first s-components of β and β2 denote the second (d −s)-components of β; that is, without loss of generality, β2 = 0. If is continuous in β1, then there exists β̂1 such that
that is, the solution is exact.
The proof of Theorem 1 is relegated to Appendix A. The asymptotic results for penalized weighted estimators readily follow from this theorem. Applying this theorem to the penalized Buckley–James estimators, we obtain the following result.
Corollary 1
Assume that condition C.2 holds in addition to the following three conditions:
D.1. There exists a constant c0 such that P(Ỹ − βT x < c0) < 1 for all β in some neighborhood of β0.
D.2. The random variable x has compact support.
D.3. F has finite Fisher information for location.
Then the conclusions of Theorem 1 follow.
Remark 4
Corollary 1 implies that the penalized Buckley–James estimators with the penalty functions satisfying condition C.2 have the oracle property. Conditions D.1–D.3 are the regularity conditions given by Ritov (1990, p. 306) to ensure that condition C.1 holds. The expressions for A and V were given by Ritov (1990) and Lai and Ying (1991a). The matrix V is directly estimable from the data, whereas A is not, because the latter involves the unknown density of the error term ε.
Remark 5
A result similar to Corollary 1 exists for the adaptive estimators presented in Section 2.3—namely, the penalized weighted estimators with SCAD, hard thresholding, and ALASSO penalties also have an oracle property. Technical conditions needed to obtain a strongly consistent estimator sequence and hence establish condition C.1 are given by Robins et al. (1994). Such technical conditions are assumed throughout the text of Tsiatis (2006), for example. The matrices A and V may be calculated directly; examples were given by Tsiatis (2006, chaps. 10 and 11).
Theorem 1 implies that the asymptotic covariance matrix of β̂1 is
and that a consistent estimator is given by
Other authors (e.g., Fu 2003) used the following alternative estimator for cov(β̂1):
Using the sandwich matrix Ω̃ actually produces a standard error estimate for the entire vector β̂, that is, both nonzero and zero coefficient estimates. On the other hand, Ω̂11 implicitly sets , its asymptotic value. In this article we useΩ̂11, in agreement with earlier work on variable selection by Fan and Li (2001, 2002, 2004), Cai et al. (2005), and Zou (2006). Note the matrix Ω̂11 can be readily calculated when A and V can be evaluated directly. For discrete estimating functions such as the Buckley–James estimating function, A cannot be estimated reliably from the data. To solve this problem, we propose a re-sampling procedure.
Let denote the components of UP(β) corresponding to the regression coefficients with nonzero penalized estimating function estimates, and define as the solution to the estimating equation
(4) |
where (G1, …, Gn) are independent standard normal variables and (W11, …, W1n) are as given in Appendix B. In Appendix B we show that the conditional distribution of given the observed data is the same in the limit as the unconditional distribution of n1/2(β̂−β01) Thus we may estimate the covariance matrix of β̂1 and construct confidence intervals for individual regression coefficients using the empirical distribution of .
4. IMPLEMENTATION
In this article we use a majorize-minorize (MM) algorithm to estimate the penalized regression coefficients (Hunter and Li 2005). The MM algorithm may be viewed as a Fisher scoring (or Newton–Raphson) type algorithm for solving a perturbed penalized estimating equation and is closely related to the local quadratic algorithm (Tibshirani 1996; Fan and Li 2001). Using condition C.1 and the local quadratic approximations for penalty functions (Fan and Li 2001, sec. 3.3), the MM algorithm is
where β̂(0) is the solution to U(β) = 0 and
for ∊ a small number (∊= 10−6 in our examples). This algorithm requires that the estimating function U(β) be continuous, so that the asymptotic slope matrix A can be evaluated directly, as in the missing-data example. For general estimating functions, we propose the iterative algorithm
where β̂(0) is a minimizer of ||U(β)||. For the penalized Buckley–James estimator, there is a simple iterative algorithm,
where β̂(0) is the original Buckley–James estimator and ξ(β) = [ξ1(β), …, ξn(β)]T. In each algorithm, we iterate until convergence; the final solution is an approximate solution to the penalized estimating equation UP(β) = 0. To improve numerical stability, we standardize each predictor to have mean 0 and variance 1.
We need to choose λ for LASSO, ALASSO, and hard thresholding penalty functions, (a, λ) for the SCAD penalty and (λ1, λ2) for the EN penalty. Fan and Li (2001, 2002) showed that the choice of a ≡ 3.7 performs well in a variety of situations; we use their suggestion throughout our numerical analyses. Zou and Hastie (2005) showed that the EN estimator is equivalent to an ℓ1-penalty on augmented data. In the rest of this section, we include the subscript λ on β̂ (i.e., β̂λ) to stress the dependence of the estimator on the regularization parameter λ. In the case of EN penalty, it is understood that cross-validation is two-dimensional.
For uncensored data, Tibshirani (1996) and Fan and Li (2001) suggested the following generalized cross-validation (GCV) statistic (Wahba 1985):
where RSS(λ) is the residual sum of squares ||y − Xβ̂λ||2, and d(λ) is the effective number of parameters, that is, d(λ) =tr[{Â + Σλ(β̂λ)}−1ÂT]. Note that the intercept is omitted in RSS(λ), because y may be centered at . When the Yi ’s are potentially censored, d(λ) still may be considered the effective number of parameters; however, RSS(λ) is unknown. We propose estimating n−1 RSS(λ) by
where K̂(t) is the Kaplan–Meier estimator for K(t) = P(C > t), and . For missing data, we propose estimating n−1 RSS(λ) by
Both proposals are based on large-sample arguments—namely, ν̂(λ)is a consistent estimator for lim n−1 RSS(λ)for fixed λ under conditional independence between censoring and failure time distribution, for censored outcome data, and under the MAR assumption for missing data (cf. Tsiatis 2006, chap. 6). Thus our GCV statistic is
and we select λ̂ = arg minλGCV(λ).
5. SIMULATION STUDIES
5.1 Censored Data
We simulated 1,000 data sets of size n from the model
where β = (3, 1.5, 0, 0, 2, 0, 0, 0)T and εi and xi are independent standard normal with the correlation between the j th and kth components of x equal to .5|j−k|. This model was considered by Tibshirani (1996) and Fan and Li (2001). We set the censoring distribution to be uniform(0,τ), where τ was chosen to yield approximately 30% censoring. We compared the model error, ME ≡ (β̂−β)T E(xxT)(β̂−β), of the proposed penalized estimator with that of the original Buckley–James estimator using the median relative model error (MRME). We also compared the average numbers of regression coefficients that are correctly or incorrectly shrunk to 0. The results are presented in Table 1, where oracle pertains to the situation in which we know a priori which coefficients are non-zero.
Table 1.
Average number 0’s |
|||
---|---|---|---|
Method | MRME (%) | C | I |
n = 50,σ = 3 | |||
SCAD | 69.48 | 4.73 | .35 |
Hard | 73.41 | 4.30 | .17 |
LASSO | 66.16 | 3.99 | .11 |
ALASSO | 57.77 | 4.40 | .17 |
EN | 76.48 | 3.54 | .08 |
Oracle | 32.76 | 5 | 0 |
n = 50, σ = 1 | |||
SCAD | 40.11 | 4.78 | .01 |
Hard | 69.79 | 4.18 | .01 |
LASSO | 64.48 | 3.97 | .01 |
ALASSO | 48.21 | 4.90 | .01 |
EN | 95.55 | 3.49 | 0 |
Oracle | 31.30 | 5 | 0 |
The performance of the proposed estimator with the SCAD, hard thresholding, and ALASSO penalties approached that of the oracle estimator as n increases. When the signal-to-noise ratio was small (e.g., large n or small σ), oracle methods (SCAD, hard thresholding, ALASSO) outperformed LASSO and EN in terms of model error and model complexity. On the other hand, LASSO and EN tended to perform better than the oracle methods as σ/n increased.
Table 2 reports the results on the accuracy of the proposed re-sampling technique in estimating the variances of the nonzero estimated regression coefficients. The standard deviation (SD) pertains to the median absolute deviation of the estimated regression coefficients divided by .6745. The median of the standard error estimates, denoted by SDm, gauges the performance of the resampling procedure. Evidently, the resampling procedure yielded reasonable standard error estimates, particularly for large n.
Table 2.
β̂1 |
β̂2 |
β̂5 |
||||
---|---|---|---|---|---|---|
SD | SDm | SD | SDm | SD | SDm | |
SCAD | .145 | .129 | .135 | .128 | .128 | .114 |
Hard | .151 | .130 | .145 | .129 | .138 | .119 |
LASSO | .160 | .134 | .145 | .143 | .161 | .130 |
ALASSO | .149 | .132 | .130 | .133 | .133 | .113 |
EN | .172 | .113 | .151 | .111 | .155 | .103 |
Oracle | .144 | .129 | .136 | .126 | .143 | .111 |
NOTE: SD refers to the mean absolute deviation of the estimated regression coefficients divided by .6745; SDm, to the median of the standard error estimates. The table entries are for a sample size n = 100 and (error) standard deviation σ = 1.
5.2 Missing Data
We simulated 1,000 datasets of size n from the model
where εi and xi are independent standard normal with the correlation between the jth and kth components of x equal to .5|j−k|. We considered two scenarios:
Model 1:
and
Model 2:
For a random design X, define the theoretical R2
For σ= 1 and 2, both models 1 and 2 have theoretical R2 =.89 and.67. Although models 1 and 2 have the same theoretical R2, they have differing numbers of nonzero coefficients; the number of nonzero coefficients over the total number of coefficients (i.e., d = 10) in a given model is sometimes referred to as the model fraction. The model fraction in model 1 is.6, whereas model 2 has a model fraction of.3. We simulated data such that subjects fall into one of three categories: R = 1 means that the subject was missing (x1, x2), R = 2 means that the subject was missing x1, and R = ∞ means that the subject had complete data. The observed data {R, GR(Z)} were generated in the following sequence of steps:
Simulate a Bernoulli random variable B1 with probability λ̃1{G1(Zi), η}.
If B1 = 1, then set R = 1; otherwise, continue.
Simulate a Bernoulli random variable B2 with probability λ̃2{G2(Zi), η}.
If B2 = 1, then set R = 2; otherwise, set R = ∞.
We formulated the missingness process by logistic models
and
where
and
These models yielded approximately 40% missing with subjects falling in the R = 1 and R = 2 categories in roughly equal proportions.
Table 3 presents the numerical results with n = 250. Oracle methods (SCAD, hard thresholding, ALASSO) performed better than LASSO and EN in terms of relative model error and complexity when there were a few strong predictors of response, as in model 1; however, oracle methods performed worse than LASSO and EN when there are many weakly significant predictors, as in model 2.
Table 3.
Model 1 |
Model 2 |
|||||
---|---|---|---|---|---|---|
Average number of 0’s |
Average number of 0’s |
|||||
Method | MRME (%) | C | I | MRME (%) | C | I |
σ= 1 | ||||||
SCAD | 81.79 | 3.35 | .21 | 42.60 | 5.56 | 0 |
Hard | 82.38 | 3.37 | .25 | 48.73 | 5.79 | .01 |
LASSO | 87.88 | 2.42 | .09 | 66.49 | 4.11 | 0 |
ALASSO | 82.24 | 3.55 | .23 | 37.74 | 6.25 | 0 |
EN | 85.59 | 2.38 | .08 | 70.56 | 3.92 | 0 |
σ= 2 | ||||||
SCAD | 93.64 | 3.33 | .69 | 48.73 | 5.92 | .02 |
Hard | 90.10 | 3.70 | 1.12 | 46.24 | 6.37 | .05 |
LASSO | 82.29 | 2.54 | .40 | 59.96 | 4.56 | .02 |
ALASSO | 82.01 | 3.37 | .70 | 48.87 | 6.08 | .02 |
EN | 88.62 | 2.55 | .44 | 66.17 | 4.63 | .03 |
NOTE: For σ = 1 and σ = 2, models 1 and 2 have theoretical R2 =.89 and.67; however, the number of nonzero coefficients is six in model 1 but only three in model 2.
6. THE PAUL COVERDELL STROKE REGISTRY
The Paul Coverdell National Acute Stroke Registry collects demographic, quantitative, and qualitative factors related to acute stroke care in four prototype states: Georgia, Massachusetts, Michigan, and Ohio (Paul Coverdell Prototype Registries Writing Group 2005). The goals of the registry include gaining a better understanding of factors associated with stroke and generally improving the quality of acute stroke care in the United States. For the purpose of illustration, we consider a subset of 800 patients with hemorragic or ischemic stroke from the Georgia prototype registry. Our data set includes nine predictors and a hospital length of stay (LOS) endpoint, defined as the number of days from hospital admission to hospital discharge. Conclusions from analyses like ours would be important to investigators in health policy and management, for example. The complete registry data for all four prototypes consist of several thousand hospital admissions and has not been released publicly. A more comprehensive analysis is ongoing.
Our data include the following nine predictors: Glasgow coma scale (GCS; 3–15, with 15 representing excellent health), serum albumin, creatinine, glucose, age, sex (1 if male), race (1 if white), whether or not the patient was admitted to the intensive care unit (ICU; 1 if yes), and stroke subtype (1 if hemorrhagic; 0 if ischemic). Of the 800 patients, 419 (52.4%) had complete data (i.e., R = ∞), 94 (11.8%) were missing both GCS and serumn albumin (i.e., R = 1), and 287 (35.9%) were missing only GCS (i.e., R = 2).
Table 4 presents estimates for the nuisance parameter η in the stroke data. We see that the subjects missing both GCS and albumin (i.e., R = 1) tended to have higher creatinine and glucose levels but were less likely to be admitted to the ICU on admission to the hospital. Ischemic stroke and ICU admission were strongly associated with missing GCS score (i.e., R = 2) only. Because the missingness mechanism is related to other important prognostic variables, this is mild evidence that the missing completely at random (MCAR) assumption is not well supported, and variable selection techniques based on such an assumption will lead to incorrect conclusions. Our analyses using methods described in Section 2 assuming data missing at random (MAR) are displayed in Table 5.
Table 4.
η1 | η2 | |
---|---|---|
(int) | −2.342(.152) | .478(.082) |
Albumin | −.112(.089) | |
Creatinine | −.492(.291) | −.101(.091) |
Sex | −172(.113) | .043(.079) |
Glucose | −.286(.164) | −.067(.084) |
ICU | −.470(.155) | −.304(.091) |
Age | .045(.124) | .006(.087) |
Type | −.101(.144) | −.213(.094) |
Race | .084(.122) | −.034(.085) |
LOS | −.007(.140) | −.045(.092) |
Table 5.
Full | SCAD | Hard | LASSO | ALASSO | EN | |
---|---|---|---|---|---|---|
GCS | −.762(.327) | −.603(.434) | −.864(.587) | −.681(.480) | −.584(.400) | −.628(.424) |
Albumin | −1.142(.306) | −.958(.450) | −1.043(.486) | −.984(.425) | −.876(.402) | −.882(.387) |
Creatinine | −.726(.331) | −.372(.177) | −.734(.347) | −.529(.255) | −.365(.179) | −.402(.199) |
Sex | −.007(.288) | 0(−) | 0(−) | 0(−) | 0(−) | 0(−) |
Glucose | −.312(.310) | 0(−) | 0(−) | −.140(.165) | 0(−) | −.030(.039) |
ICU | 1.861(.323) | 2.043(.442) | 1.970(.469) | 1.807(.419) | 1.947(.415) | 1.771(.392) |
Age | −.696(.324) | −.293(.203) | −.678(.465) | −.586(.369) | −.405(.260) | −.516(.312) |
Type | .553(.333) | .200(.155) | 0(−) | .448(.335) | .213(.158) | .381(.273) |
Race | −1.316(.315) | −1.403(.374) | −1.320(.366) | −1.216(.331) | −1.242(.332) | −1.151(.310) |
We use λ̂= (.28,.63,.11,.16) for the SCAD, Hard, LASSO, and ALASSO estimates, and use (λ̂1, λ∘2) = (.34,.9) for the EN estimates. Table 5 presents the regression coefficient estimates for the stroke data. Higher levels of albumin and creatine are strongly related to shorter LOS, whereas admission to the ICU is associated with longer LOS. Older patients tend to have LOS than younger patients; this is most easily explained by the fact that many older stroke patients quickly die in the hospital because their bodies are too weak to recover. Patients with hemorrhagic strokes have longer recovery periods and thus longer LOS. White stroke patients tend to have shorter LOS than non-whites. Finally, sex and glucose are weak predictors of LOS. The LASSO and EN estimates tend to retain more predictors in the final model and, thus have more complex models compared with the other penalized estimators. Among the SCAD, Hard, and ALASSO estimates, SCAD and ALASSO yielded similar coefficient estimates, whereas the Hard thresholding estimates yielded the sparsest model. Our methods yielded models that appear to have reasonable scientific interpretation and do not make a strong MCAR assumption, an assumption that is not supported by the data.
7. REMARKS
We have developed a general methodology for selecting variables and simultaneously estimating their regression coefficients in semiparametric models. This development overcomes two major challenges that are not present with any of the existing variable selection methods. First, UP (β) may not correspond to the derivative of an objective function or to quasi-likelihood, so that the mathematical arguments used by previous authors to establish the asymptotic properties of penalized maximum likelihood or penalized GEE estimators do not apply. Second, UP (β) may be discrete in β, which entails considerable theoretical and computational challenges. In particular, the variances of the estimated regression coefficients cannot be evaluated directly, and we have developed a novel resampling procedure, which also can be used for variance estimation without the need for variable selection. Our simulation results indicate that the resampling method works well for modest sample sizes.
Rank estimators (Prentice 1978; Tsiatis 1990; Wei, Ying, and Lin 1990; Lai and Ying 1991b; Ying 1993) provide potential alternatives to the Buckley–James estimator but are computationally more demanding to implement (cf. Johnson 2008). In general, rank-estimating functions do not correspond to the derivatives of any objective functions. This is also true of estimating functions for many other semiparametric problems. In all of those situations, we can use Theorem 1 to establish the asymptotic properties of the corresponding variable selection procedures and use the proposed resampling technique to estimate the variances of the selected variables.
The proportional hazards and accelerated failure time models cannot hold simultaneously unless the error distribution is extreme value. Thus, it is useful to have variable selection methods for both models at one’s disposal, because one model may fit the data better than another. A major advantage of model (1) is that the regression coefficients have a direct physical interpretation. Hazard ratio can be an awkward concept, especially when the response variable does not pertain to failure time.
Acknowledgments
This research was supported by National Institutes of Health grants P30 ES10126, T32 ES007018, and R03 AI068484 (B.J.); R37 GM047845 (D.L.); and R01 CA082659 (D.L. and D.Z.). The authors thank Paul Weiss for preparing the stroke data set.
APPENDIX A: PROOF OF THEOREM 1
To prove part a, we consider , where . Because n1/2qλn(|β0j|) →0, j = 1, …, s, under condition C.2.a and β̂ = β0+Op(n−1/2), we have
Under condition C.2b, for j = s + 1, …, d, and are dominated by −n1/2qλn (ε)and n1/2qλn(ε), so they have opposite signs when ε goes to 0. Therefore, β̂ is an approximate zero-crossing by definition.
To prove part b, we consider the sets in the probability space Cj = {β̂j ≠ 0}, j = s + 1, …, d. It suffices to show that for any ε > 0, when n is sufficiently large, P(Cj) < ε. Because β̂j = Op(n−1/2), there exists some M such that when n is large enough,
Using the j th component of the penalized estimating function and the definition of the approximate zero-crossing, we obtain that on the set of {β̂j ≠ 0, |β̂j| < Mn−1/2},
where Aj is the j th row of A. The first three terms on the right side are of order Op(1). As a result, there exists some M′ such that for large n,
Because by condition C.2b, β̂j ≠ 0 and |β̂j| < Mn−1/2 imply that n1/2qλn (|β̂j|) > M′ for large n. Thus P (β̂j ≠ 0, |β̂j| < Mn−1/2) = P(β̂j ≠ 0, |β̂j| < Mn−1/2), n1/2qλn(|β̂j|) > M′. Therefore, P(Cj) < ε/2 + P(β̂j ≠ 0, |β̂j| < Mn−1/2, n1/2qλn(|β̂j|) > M′) < ε.
To prove the second part of part b, because
after the Taylor series expansion of the last term, we conclude that
To prove part c, we consider β1 ∈ Rs on the boundary of a ball around β01, that is, β1 = β01 + n−1/2u with |u| = r for a fixed constant r. From the penalized estimating function , we have
where is between βj and β0j for j = 1, …, s. Because A11 is non-singular, the second term on the right side is larger than a0r2n−1/2, where a0 is the smallest eigenvalue of . The first term is of order rOp(n−1/2). Because , the third term is dominated by the second term. Therefore, for any ε, if we choose r sufficiently large so that for large n, the probability that the absolute value of the first term is larger than the second term is less than ε, we then have
Applying the Brouwer fixed-point theorem to the continuous function , we see that implies that has a solution within this ball or, equivalently, has a solution within this ball. That is, we can choose an exact solution to with β̂ = β̂0+Op(n−1/2). Thus β̂ is a zero-crossing of UP (β).
APPENDIX B: CONDITIONAL DISTRIBUTION OF
Here we justify the resampling procedure for the penalized Buckley–James estimator. Similar justifications can be made for other estimators. Under conditions D.1–D.3, we have the following asymptotic linear expansion for the penalized Buckley–James estimating function:
(B.1) |
In addition,
where w1i comprises the components of wi corresponding to β1, and wi, i = 1, …, n, as given by Lin and Wei (1992), are n independent mean-0 random vectors. Replacing the unknown quantities in wi with their sample estimators yields Wi. Recall that satisfies , where W1i comprises the components of Wi corresponding to β̂1. Applying (B.1) to β̂1 and yields
The conclusion then follows.
Contributor Information
Brent A. Johnson, Assistant Professor, Department of Biostatistics, Emory University, Atlanta, GA 30322 (E-mail: bajohn3@emory.edu)
D. Y. Lin, Dennis Gillings Distinguished Professor (E-mail: lin@bios.unc.edu)
Donglin Zeng, Associate Professor (E-mail: dzeng@bios.unc.edu), Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599.
References
- Buckley J, James I. Linear Regression With Censored Data. Biometrika. 1979;66:429–436. [Google Scholar]
- Cai J, Fan J, Li R, Zhou H. Variable Selection for Multivariate Failure Time Data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox DR. Regression Models and Life-Tables (with discussion) Journal of the Royal Statistical Society, Ser B. 1972;34:187–202. [Google Scholar]
- Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Li R. Variable Selection for Cox’s Proportional Hazards Model and Frailty Model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
- Fan J, Li R. New Estimation and Model Selection Procedures for Semi-parametric Modeling in Longitudinal Data Analysis. Journal of the American Statistical Association. 2004;99:710–723. [Google Scholar]
- Frank IE, Friedman JH. A Statistical View of Some Chemometrics Regression Tools. Technometrics. 1993;35:109–148. [Google Scholar]
- Fu WJ. Penalized Estimating Equations. Biometrics. 2003;35:109–148. doi: 10.1111/1541-0420.00015. [DOI] [PubMed] [Google Scholar]
- Hunter DR, Li R. Variable Selection Using MM Algorithms. The Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Johnson BA. Variable Selection in Semiparametric Linear Regression With Censored Data. Journal of the Royal Statistical Society, Ser B. 2008;70:351–370. [Google Scholar]
- Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2. Hoboken, NJ: Wiley; 2002. [Google Scholar]
- Knight K, Fu W. Asymptotics for Lasso-Type Estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
- Lai TL, Ying Z. Large Sample Theory of a Modified Buckley–James Estimator for Regression Analysis With Censored Data. The Annals of Statistics. 1991a;19:1370–1402. [Google Scholar]
- Lai TL, Ying Z. Rank Regression Methods for Left-Truncated and Right Censored Data. The Annals of Statistics. 1991b;19:531–556. [Google Scholar]
- Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika. 1986;73:13–22. [Google Scholar]
- Lin JS, Wei LJ. Linear Regression Analysis for Multivariate Failure Time Observations. Journal of the American Statistical Association. 1992;87:1091–1097. [Google Scholar]
- Lin DY, Ying Z. Semiparametric and Nonparametric Regression Analysis of Longitudinal Data. Journal of the American Statistical Association. 2001;96:103–126. (with discussion) [Google Scholar]
- Meinshausen N, Bühlmann P. Variable Selection and High-Dimensional Graphs With the Lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
- Paul Coverdell Prototype Registries Writing Group. Acute Stroke Care in the US: Results From 4 Pilot Prototypes of the Paul Coverdell National Acute Stroke Registry. Stroke. 2005;36:1232–1240. doi: 10.1161/01.STR.0000165902.18021.5b. [DOI] [PubMed] [Google Scholar]
- Prentice RL. Linear Rank Tests With Right-Censored Data. Biometrika. 1978;65:167–179. [Google Scholar]
- Ritov Y. Estimation in a Linear Regression Model With Censored Data. The Annals of Statistics. 1990;18:303–328. [Google Scholar]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of Regression Coefficients When Some Regressors Are not Always Observed. Journal of the American Statistical Association. 1994;89:846–866. [Google Scholar]
- Tibshirani RJ. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Ser B. 1996;58:267–288. [Google Scholar]
- Tibshirani RJ. The Lasso Method for Variable Selection in the Cox Model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- Tsiatis AA. Estimating Regression Parameters Using Linear Rank Tests for Censored Data. The Annals of Statistics. 1990;18:354–372. [Google Scholar]
- Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer; 1996. [Google Scholar]
- Wahba G. A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem. The Annals of Statistics. 1985;13:1378–1402. [Google Scholar]
- Wei LJ, Ying Z, Lin DY. Regression Analysis of Censored Survival Data Based on Rank Tests. Biometrika. 1990;77:845–851. [Google Scholar]
- Ying Z. A Large Sample Study of Rank Estimation for Censored Regression Data. The Annals of Statistics. 1993;21:76–99. [Google Scholar]
- Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
- Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Ser B. 2005;67:301–320. [Google Scholar]