Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models

Brent A Johnson; D Y Lin; Donglin Zeng

doi:10.1198/016214508000000184

. Author manuscript; available in PMC: 2010 Apr 6.

Published in final edited form as: J Am Stat Assoc. 2008 Jun 1;103(482):672–680. doi: 10.1198/016214508000000184

Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models

Brent A Johnson ¹, D Y Lin ², Donglin Zeng ³

PMCID: PMC2850080 NIHMSID: NIHMS103682 PMID: 20376193

Abstract

We propose a general strategy for variable selection in semiparametric regression models by penalizing appropriate estimating functions. Important applications include semiparametric linear regression with censored responses and semiparametric regression with missing predictors. Unlike the existing penalized maximum likelihood estimators, the proposed penalized estimating functions may not pertain to the derivatives of any objective functions and may be discrete in the regression coefficients. We establish a general asymptotic theory for penalized estimating functions and present suitable numerical algorithms to implement the proposed estimators. In addition, we develop a resampling technique to estimate the variances of the estimated regression coefficients when the asymptotic variances cannot be evaluated directly. Simulation studies demonstrate that the proposed methods perform well in variable selection and variance estimation. We illustrate our methods using data from the Paul Coverdell Stroke Registry.

Keywords: Accelerated failure time model, Buckley-James estimator, Censoring, Least absolute shrinkage and selection operator, Least squares, Linear regression, Missing data, Smoothly clipped absolute deviation

1. INTRODUCTION

A major challenge in regression analysis is to decide which predictors among many potential ones are to be included in the model. It is customary to use stepwise selection and subset selection. But these procedures are unstable and ignore the stochastic errors introduced by the selection process. Several methods, including bridge regression (Frank and Friedman 1993), least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), smoothly clipped absolute deviation (SCAD) (Fan and Li 2001), elastic net (EN) (Zou and Hastie 2005), and adaptive lasso (ALASSO) (Zou 2006), have been proposed to select variables and estimate their regression coefficients simultaneously. These methods can be cast in the framework of penalized least squares and likelihood.

Consider the linear regression model

Y_{i} = β^{T} x_{i} + ε_{i}, i = 1, \dots, n,

(1)

where Y_i is the response variable, x_i is a d-vector of predictors for the ith subject, β is a d-vector of regression coefficients, and (ε₁, …, ε_n) are independent and identically distributed errors. For simplicity, assume that the ε_i’s have means 0. Define l(β) = ||y − Xβ||², where y = (Y₁, …, Y_n)^T and X = (x₁, …, x_n)^T. Then the penalized least squares estimator of β is the minimizer of the objective function

l (β) + n \sum_{j = 1}^{d} p_{λ} (∣ β_{j} ∣),

(2)

where p_λ(·) is a penalty function. Appropriate choices of p_λ (detailed in Sec. 2) yield the aforementioned variable selection procedures. For likelihood-based models, the penalized maximum likelihood estimator is obtained by setting l(β) to the minus log-likelihood.

For many semiparametric problems, the estimation of regression coefficients (without the task of variable selection) does not pertain to the minimization of any objective function. Important examples include weighted estimating equations for missing data (Robins, Rotnitzky, and Zhao 1994; Tsiatis 2006) and the Buckley–James estimator for semiparametric linear regression with censored responses (Buckley and James 1979). Another example arises from Lin and Ying’s (2001) semiparametric regression analysis of longitudinal data. For this example, Fan and Li (2004) proposed a variable selection method by incorporating the SCAD penalty into Lin and Ying’s estimator. They noted that their estimator may be cast in the form of (2), so that their earlier results (Fan and Li 2001) for penalized least squares could be applied. In this article we go beyond specific problems and provide a very general theory for a broad class of penalized estimating functions. In this regard, only Fu’s (2003) work on generalized estimating equations (GEEs) (Liang and Zeger 1986) with bridge penalty (Frank and Friedman 1993; Knight and Fu 2000) is similar. That work deals only with smooth estimating functions, whereas our theory applies to very general, possibly discrete estimating functions. In addition, we present general computational strategies.

The remainder of the article is organized as follows. We present our penalized estimating functions in Section 2, paying special attention to the aforementioned missing-data and censored-data problems. We state the asymptotic results in Section 3 and address implementation issues in Section 4. We report the results of our simulation studies in Section 5 and apply the methods to real data in Section 6.

2. PENALIZED ESTIMATING FUNCTIONS

2.1 General Setting

Suppose that U(β) ≡ (U₁(β), …, U_d (β))^T is an estimating function for β ≡ (β₁, …, β_d)^T based on a random sample of size n. For maximum likelihood estimation, U(β) is simply the score function. We are interested mainly in the situations in which U(β) is not a score function or the derivative of any objective function. A penalized estimating function is defined as

U^{P} (β) = U (β) - n q_{λ} (∣ β ∣) sgn (β),

where q_λ (|β|) = (q_λ_,1(|β₁|), …, q_λ_,_d (|β_d|))^T, q_λ,j (·), j = 1, …, d, are coefficient-dependent continuous functions and the second term is the componentwise product of q_λ and sgn(β). In most cases, $q_{λ, j} = p_{λ, j}^{'}$ for some penalty function p_λ,j, and the functions q_λ,j, j = 1, …, d, are the same for all d components of q_λ (|β|), that is, q_λ,j = q_λ,k, j ≠ k. When the functions q_λ,j, j = 1, …, d, do not vary with j, we drop the subscript for simplicity and ease of notation.

When $q_{λ} = p_{λ}^{'}$ , we consider five penalty functions: (a) the LASSO penalty (Tibshirani 1996, 1997), p_λ (|θ|) = λ|θ|; (b) the hard thresholding penalty, p_λ (|θ|) = λ² − (|θ| − λ)²I(|θ|<λ); (c) the SCAD penalty (Fan and Li 2001, 2002, 2004), defined by

p_{λ}^{'} (∣ θ ∣) = λ {I (∣ θ ∣ < λ) + \frac{{(a λ - ∣ θ ∣)}_{+}}{(a - 1) λ} I (∣ θ ∣ \geq λ)}

for a > 2; (d) the EN penalty (Zou and Hastie 2005), p_λ (|θ|) = λ₁|θ| + λ₂θ²; and (e) the ALASSO penalty (Zou 2006), p_λ,j (|θ|) = λ|θ|ω_j, for a known data-driven weight ω_j. In our applications we use the weight $ω_{j} = 1 / ∣ {\tilde{β}}_{j}^{o} ∣$ , j = 1, …, d, where ${\tilde{β}}^{o} = {({\tilde{β}}_{1}^{o}, \dots, {\tilde{β}}_{d}^{o})}^{T}$ refers to the d-vector of regression coefficient estimates obtained from solving the original estimating equation, U(β) = 0.

The hard thresholding penalty is important because it corresponds to best subset selection and stepwise deletion in certain cases. The LASSO (Tibshirani 1996, 1997) is one of the most popular shrinkage estimators, but it has some deficiencies; in particular, it is inconsistent for certain designs (Meinshausen and Bühlmann 2006; Zou 2006). Fan and Li (2001, 2002) attempted to avoid such deficiencies by constructing a new penalty function (SCAD) that results in an estimator that achieves an oracle property: that is, the estimator has the same limiting distribution as an estimator that knows the true model a priori. Recently, Zou (2006) introduced ALASSO, which, like SCAD, achieves the oracle property and may have numerical advantages for some problems. Finally, Zou and Hastie (2005) introduced the mixture penalty EN to effectively select “grouped” variables; this penalty is popular in the statistical analysis of large data sets.

2.2 Application to Censored Data

Censoring is a common phenomenon in scientific studies (see Kalbfleisch and Prentice 2002, p. 12). The presence of censoring causes major complications in implementation of the penalized least squares approach, because the values of the Y_i are unknown for the censored observations. The problem is much simpler for the proportional hazards regression because the partial likelihood (Cox 1972) plays essentially the same role as the standard likelihood (Tibshirani 1997; Fan and Li 2002; Cai, Fan, and Zhou 2005). However, the proportional hazards model may not be appropriate in some applications, especially when the response variable does not pertain to failure time.

Let Y_i and C_i denote the response variable and censoring variable for the ith subject, i = 1, …, n. The data consist of (Ỹ_i, Δ_i, x_i), i = 1, …, n, where Ỹ_i = min(Y_i, C_i), Δ_i = I (Y_i ≤ C_i) and x_i is a d-vector of predictors. We relate Y_i to x_i through the semiparametric linear regression model given in (1), where ε_i are independent and identically distributed with an unspecified distribution function F (·). We assume that Y_i is independent of C_i conditional on x_i. When the response variable pertains to failure time, both Y_i and C_i are commonly measured on the log scale, and model (1) is called the accelerated failure time model (Kalbfleisch and Prentice 2002, p. 44).

Clearly,

E {Δ_{i} Y_{i} + (1 - Δ_{i}) E (Y_{i} ∣ Δ_{i} = 0) ∣ x_{i}} = α + β^{T} x_{i}

and

E (Y_{i} ∣ Δ_{i} = 0) = β^{T} x_{i} + \frac{\int_{e_{i} (β)}^{\infty} {1 - F (s)} d s}{1 - F {e_{i} (β)}},

where α = E(ε_i) and e_i (β) = Ỹ_i − β^Tx_i. Thus Buckley and James (1979) proposed the estimating function for β,

U (β) = \sum_{i = 1}^{n} x_{i} {ξ_{i} (β) - β^{T} x_{i}},

(3)

where

ξ_{i} (β) = Δ_{i} Y_{i} + (1 - Δ_{i}) [β^{T} x_{i} + \frac{\int_{e_{i} (β)}^{\infty} {1 - \hat{F} (s; β)} d s}{1 - \hat{F} {e_{i} (β); β}}],

and F̂(t; β) is the Kaplan–Meier estimator of F (t) based on {e_i (β), Δ_i}, i = 1, …, n. If Δ_i = 1 for all i, then the penalized estimating function U^P (β) corresponding to (3) becomes the penalized least squares estimating function arising from (2). Thus the penalized Buckley–James estimator is a direct generalization of the penalized least squares estimator to censored data.

2.3 Application to Missing Data

It often is difficult to have complete data on all study subjects. Let R_i be the missingness indicator for the ith subject, with the event {R_i = ∞} indicating that the ith subject has complete data. The observed data for the ith subject are G_r (Z_i), where G_r (·) is the missingness operator acting on the full data Z_i of the ith subject when R_i = r. In simple linear regression, for example, we may consider only R_i ∈ {1, 2, ∞} corresponding to G₁(Z_i) = {Y_i}, G₂(Z_i) = {x_i}, and G_∞(Z_i) = {Y_i, x_i} = Z_i. The observed data are represented as {R_i, G_{R_i} (Z_i), i = 1, …, n}. We focus on monotone missingness and make two assumptions: (a) P (R_i = ∞|Z_i = z) > κ > 0 and (b) P (R_i = r|Z_i = z) = P (R_i = r |G_r (z) = g_r).

Consider the semiparametric linear regression model given in (1). The weighted complete-case estimating function takes the form

S (β) = \sum_{i = 1}^{n} \frac{I (R_{i} = \infty) s_{i} (β)}{π (\infty, Z_{i})},

where s_i (β) = x_i (Y_i − α − β^T x_i) and π(r, G_r (z)) = P (R_i = r|G_r (z) = g_r). To improve efficiency, we adopt the strategy of Robins et al. (1994) and propose the estimating function

\begin{array}{l} U (β) = S (β) \\ - \sum_{i = 1}^{n} \sum_{r} [\frac{I (R_{i} = r) - {\tilde{λ}}_{r} {G_{r} (Z_{i}), η} I (R_{i} \geq r)}{\tilde{π} {r, G_{r} (Z_{i}), η}}] \\ \times \tilde{E} {s_{i} (β) ∣ G_{r} (Z_{i})}, \end{array}

where λ̃_r {G_r (Z_i), η} = {1 + exp[−μ_r {G_r (Z_i), η}]}⁻¹, μ_r {G_r (Z_i), η} is a linear predictor based on G_r (Z_i) and η, $\tilde{π} {r, G_{r} (Z_{i}), η} = \prod_{m = 1}^{r} {\tilde{λ}}_{m} {G_{m} (Z_{i}), η}$ , and Ẽ{s_i (β)|G_r (Z_i)} is the conditional expectation of s_i (β) given G_r (Z_i) under a posited parametric submodel for the full data-generating process.

3. ASYMPTOTIC RESULTS

Fan and Li (2001) showed that the penalized least squares estimator minimizing (2), or more generally the penalized maximum likelihood estimator, with the SCAD or hard thresholding penalty behaves asymptotically as if the true model is known a priori—the so-called oracle property. We show that this property holds for a very broad class of penalized estimating functions, of which the Buckley–James and weighted estimating functions with the SCAD and hard thresholding penalty functions are special cases.

Let β₀ ≡ (β₀₁, …, β₀_d)^T denote the true value of β. Without loss of generality, suppose that β₀_j ≠ 0 for j ≤ s and β₀_j = 0 for j > s. We impose the following conditions:

C.1. There exists a nonsingular matrix A such that for any given constant M,

$\begin{array}{l} sup_{∣ β - β_{0} ∣ \leq {M n}^{- 1 / 2}} ∣ n^{- 1 / 2} U (β) - n^{- 1 / 2} U (β_{0}) \\ - n^{1 / 2} A (β - β_{0}) ∣ = o_{p} (1) . \end{array}$

Furthermore, n⁻¹^/²U(β₀) → _d N(0, V) for V a d × d matrix.
C.2. The penalty function q_{λ_n}(·) has the following properties:
1. For nonzero fixed θ, lim n¹^/²q_{λ_n} (|θ|) = 0 and $lim q_{λ_{n}}^{'} (∣ θ ∣) = 0$ .
2. For any M > 0, $lim \sqrt{n} {inf}_{∣ θ ∣ \leq M n^{- 1 / 2}} q_{λ_{n}} (∣ θ ∣) \to \infty$ .

Remark 1

Condition C.1 is not unusual and is satisfied by many commonly used estimating functions. This condition is implied by standard conditions for Z-estimators (van der Vaart and Wellner 1996, thm. 3.3).

Remark 2

Condition C.2 pertains to the choices of the penalty function and regularization parameter. This condition is key to obtaining the oracle property. In particular, condition C.2a prevents the j th element of the penalized estimating function from being dominated by the penalty term, q_{λ_n} (|β_j|) sgn(β_j), for β_j₀ ≠ 0, because $\sqrt{n} q_{λ_{n}} (∣ β_{j} ∣) sgn (β_{j})$ vanishes. But if β_j₀ = 0, then condition C.2b implies that $\sqrt{n} q_{λ_{n}} (∣ β_{j} ∣) sgn (β_{j})$ diverges to +∞ or −∞, depending on the sign of β_j in the small neighborhood of β_j₀. Thus the j -element of the penalized estimating function is dominated by the penalty term, so that any consistent solution, say β̂, to the estimating equation U^P (β) = 0 must satisfy β̂_j = 0.

Remark 3

Condition C.2 is satisfied by several commonly used penalties with proper choices of the regularization parameter λ_n:

Under the hard penalty [i.e., q_{λ_n} (|θ|) = 2(λ_n − |θ|)I (|θ| < λ_n)], it is straightforward to verify that condition C.2 holds if λ_n → 0 and $\sqrt{n} λ_{n} \to \infty$ .
Under the SCAD penalty, that is,
$q_{λ_{n}} (∣ θ ∣) = λ_{n} {I (∣ θ ∣ < λ_{n}) + \frac{{(a λ_{n} - ∣ θ ∣)}_{+}}{(a - 1) λ_{n}} I (∣ θ ∣ \geq λ_{n})},$

with a > 2, it is easy to see that if we choose λ_n → 0 and $\sqrt{n} λ_{n} \to \infty$ , then condition C.2 holds because $\sqrt{n} q_{λ_{n}} (∣ θ ∣) = q_{λ_{n}}^{'} (∣ θ ∣) = 0$ for θ ≠ 0 and $\sqrt{n} {inf}_{∣ θ ∣ \leq M n^{- 1 / 2}} q_{λ_{n}} (∣ θ ∣) = \sqrt{n} λ_{n}$ .
For the ALASSO penalty, we assume that $\sqrt{n} λ_{n} \to 0$ , nλ_n → ∞ and q_{λ_n} (|θ|) = λ_nŵ for some data-dependent weight ŵ. First, n¹^/² q_{λ_n} (|θ|) = n¹^/²λ_nŵ → 0 and $q_{λ_{n}}^{'} (∣ θ ∣) = 0$ for |ŵ| < ∞ and θ ≠ 0. Second, to obtain sparsity, we require that the weights be sufficiently large for θ sufficiently small, say |θ| < Mn⁻¹^/². For simplicity, suppose that the data-dependent weights are defined as ŵ = |θ̃|⁻^γ for some γ > 0 and θ̃ pertaining to the solutions to the unpenalized estimating equations. Then, trivially, $\sqrt{n} (\tilde{θ} - θ_{0}) = O_{p} (1)$ , which implies that $\sqrt{n} {inf}_{∣ θ ∣ \leq M n^{- 1 / 2}} λ_{n} \hat{w} = M n λ_{n} \to \infty$ , as desired. In this article we chose γ = 1 but Zou (2006, remarks 1 and 2) noted that other weights may be useful.
When q_{λ_n}(|θ|) = λ_n/|θ|, condition C.2 is satisfied if $\sqrt{n} λ_{n} \to 0$ and nλ_n →∞. To see this, note that $\sqrt{n} q_{λ_{n}} (∣ θ ∣) = \sqrt{n} λ_{n} / ∣ θ ∣ \to 0, q_{λ_{n}}^{'} (∣ θ ∣) = - λ_{n} / ∣ θ ∣^{2} \to 0$ for θ ≠ 0, and $\sqrt{n} \times {inf}_{∣ θ ∣ \leq M n^{- 1 / 2}} λ_{n} / ∣ θ ∣ = M n λ_{n} \to \infty$ . An anonymous referee pointed out that q_{λ_n}(|θ|) = λ_n/|θ| pertains to p_{λ_n}(|θ|) =λ_n log(|θ|) on the original scale.
Condition C.2 does not hold for the LASSO and EN penalty functions.

To accommodate discrete estimating functions such as (3), we provide a formal definition of the solution to the penalized estimating equation. An estimator β̂= (β̂₁, …, β̂_d)^T is called a zero-crossing to the penalized estimating equation if, for j =1, …, d,

\underset{ε \to 0 +}{lim^{¯}} n^{- 1} U_{j}^{P} (\hat{β} + ε e_{j}) U_{j}^{P} (\hat{β} - ε e_{j}) \leq 0,

where e_j is the j th canonical unit vector. In addition, an estimator β̂ is called an approximate zero-crossing if

\underset{n \to \infty}{lim^{¯}} \underset{ε \to 0 +}{lim^{¯}} n^{- 1} U_{j}^{P} (\hat{β} + ε e_{j}) U_{j}^{P} (\hat{β} - ε e_{j}) \leq 0.

If U^P is continuous, then the zero-crossing is an exact solution to the penalized estimating equation.

The following theorem states the main theoretical results regarding the proposed penalized estimators, including the existence of a root-n–consistent estimator, the sparsity of the estimator, and the asymptotic normality of the estimator.

Theorem 1

Define the number of nonzero coefficients s = #{j |β_j₀ ≠ 0}. Under conditions C.1 and C.2, the following results hold:

There exists a root-n–consistent approximate zero-crossing of U^P (β), that is, β̂ = β₀ + O_p (n⁻¹^/²), such that β̂ is an approximate zero-crossing of U^P (β).
For any root-n–consistent approximate zero-crossing of U^P (β), denoted by β̂ ≡ (β̂₁, …, β_d)^T, lim_n P (β̂_j = 0 for j > s) = 1. Moreover, if we write β̂₁ = (β̂₁, …, β̂_s)^T and β₀₁ = (β₀₁, …, β₀_s)^T, then

$n^{1 / 2} (A_{11} + \sum_{11}) {{\hat{β}}_{1} - β_{01} + {(A_{11} + \sum_{11})}^{- 1} b_{n}} \to {}_{d}N (0, V_{11}),$

where A₁₁, Σ₁₁, and V₁₁ are the first s × s submatrices of A, $diag {- q_{λ_{n}}^{'} (∣ β_{0} ∣) sgn (β_{0})}$ , and V, and b_n = −(q_{λ_n}(|β₀₁|) × sgn(β₀₁),…, q_{λ_n}(|β₀_s|) sgn(β₀_s))^T.
Let $U_{1}^{P} (β)$ and U₁(β) denote the first s-components of U^P(β) and U(β), and let $β = {(β_{1}^{T}, β_{2}^{T})}^{T}$ , where β₁ denotes the first s-components of β and β₂ denote the second (d −s)-components of β; that is, without loss of generality, β₂ = 0. If $U_{1} ({(β_{1}^{T}, 0^{T})}^{T})$ is continuous in β₁, then there exists β̂₁ such that

$U_{1}^{P} ({({\hat{β}}_{1}^{T}, 0^{T})}^{T}) = 0;$

that is, the solution is exact.

The proof of Theorem 1 is relegated to Appendix A. The asymptotic results for penalized weighted estimators readily follow from this theorem. Applying this theorem to the penalized Buckley–James estimators, we obtain the following result.

Corollary 1

Assume that condition C.2 holds in addition to the following three conditions:

D.1. There exists a constant c₀ such that P(Ỹ − β^T x < c₀) < 1 for all β in some neighborhood of β₀.
D.2. The random variable x has compact support.
D.3. F has finite Fisher information for location.

Then the conclusions of Theorem 1 follow.

Remark 4

Corollary 1 implies that the penalized Buckley–James estimators with the penalty functions satisfying condition C.2 have the oracle property. Conditions D.1–D.3 are the regularity conditions given by Ritov (1990, p. 306) to ensure that condition C.1 holds. The expressions for A and V were given by Ritov (1990) and Lai and Ying (1991a). The matrix V is directly estimable from the data, whereas A is not, because the latter involves the unknown density of the error term ε.

Remark 5

A result similar to Corollary 1 exists for the adaptive estimators presented in Section 2.3—namely, the penalized weighted estimators with SCAD, hard thresholding, and ALASSO penalties also have an oracle property. Technical conditions needed to obtain a strongly consistent estimator sequence and hence establish condition C.1 are given by Robins et al. (1994). Such technical conditions are assumed throughout the text of Tsiatis (2006), for example. The matrices A and V may be calculated directly; examples were given by Tsiatis (2006, chaps. 10 and 11).

Theorem 1 implies that the asymptotic covariance matrix of β̂₁ is

Ω_{11} = n^{- 1} {(A_{11} + \sum_{11})}^{- 1} V_{11} {(A_{11} + \sum_{11})}^{- 1}

and that a consistent estimator is given by

{\hat{Ω}}_{11} = n^{- 1} {({\hat{A}}_{11} + {\sum^{^}}_{11})}^{- 1} {\hat{V}}_{11} {({\hat{A}}_{11} + {\sum^{^}}_{11})}^{- 1} .

Other authors (e.g., Fu 2003) used the following alternative estimator for cov(β̂₁):

\hat{cov} ({\hat{β}}_{1}) = {\tilde{Ω}}_{11}, \tilde{Ω} = n^{- 1} [{(\hat{A} + \sum^{^})}^{- 1} \hat{V} {(\hat{A} + \sum^{^})}^{- 1}] .

Using the sandwich matrix Ω̃ actually produces a standard error estimate for the entire vector β̂, that is, both nonzero and zero coefficient estimates. On the other hand, Ω̂₁₁ implicitly sets $\hat{var} ({\hat{β}}_{2}) = 0$ , its asymptotic value. In this article we useΩ̂₁₁, in agreement with earlier work on variable selection by Fan and Li (2001, 2002, 2004), Cai et al. (2005), and Zou (2006). Note the matrix Ω̂₁₁ can be readily calculated when A and V can be evaluated directly. For discrete estimating functions such as the Buckley–James estimating function, A cannot be estimated reliably from the data. To solve this problem, we propose a re-sampling procedure.

Let $U_{1}^{P} (β)$ denote the components of U^P(β) corresponding to the regression coefficients with nonzero penalized estimating function estimates, and define ${\hat{β}}_{1}^{*}$ as the solution to the estimating equation

U_{1}^{P} (β) = \sum_{i = 1}^{n} W_{1 i} G_{i},

(4)

where (G₁, …, G_n) are independent standard normal variables and (W₁₁, …, W₁_n) are as given in Appendix B. In Appendix B we show that the conditional distribution of $n^{1 / 2} ({\hat{β}}_{1}^{*} - {\hat{β}}_{1})$ given the observed data is the same in the limit as the unconditional distribution of n¹^/²(β̂−β₀₁) Thus we may estimate the covariance matrix of β̂₁ and construct confidence intervals for individual regression coefficients using the empirical distribution of ${\hat{β}}_{1}^{*}$ .

4. IMPLEMENTATION

In this article we use a majorize-minorize (MM) algorithm to estimate the penalized regression coefficients (Hunter and Li 2005). The MM algorithm may be viewed as a Fisher scoring (or Newton–Raphson) type algorithm for solving a perturbed penalized estimating equation and is closely related to the local quadratic algorithm (Tibshirani 1996; Fan and Li 2001). Using condition C.1 and the local quadratic approximations for penalty functions (Fan and Li 2001, sec. 3.3), the MM algorithm is

{\hat{β}}^{(k + 1)} = {\hat{β}}^{(k)} + {A ({\hat{β}}^{(k)}) + \sum_{λ} ({\hat{β}}^{(k)})}^{- 1} U^{P} ({\hat{β}}^{(k)}), k \geq 0,

where β̂⁽⁰⁾ is the solution to U(β) = 0 and

\sum_{λ} (β) = diag {q_{λ} (∣ β_{1} ∣) / (ε + ∣ β_{1} ∣), \dots, q_{λ} (∣ β_{d} ∣) / (ε + ∣ β_{d} ∣)}

for ∊ a small number (∊= 10⁻⁶ in our examples). This algorithm requires that the estimating function U(β) be continuous, so that the asymptotic slope matrix A can be evaluated directly, as in the missing-data example. For general estimating functions, we propose the iterative algorithm

{\hat{β}}^{(k + 1)} = arg min_{β} ∣ ∣ U (β) - n \sum_{λ} ({\hat{β}}^{(k)}) β ∣ ∣, k \geq 0,

where β̂⁽⁰⁾ is a minimizer of ||U(β)||. For the penalized Buckley–James estimator, there is a simple iterative algorithm,

{\hat{β}}^{(k + 1)} = {X^{T} X + n \sum_{λ} ({\hat{β}}^{(k)})}^{- 1} X^{T} ξ ({\hat{β}}^{(k)}), k \geq 0,

where β̂⁽⁰⁾ is the original Buckley–James estimator and ξ(β) = [ξ₁(β), …, ξ_n(β)]^T. In each algorithm, we iterate until convergence; the final solution is an approximate solution to the penalized estimating equation U^P(β) = 0. To improve numerical stability, we standardize each predictor to have mean 0 and variance 1.

We need to choose λ for LASSO, ALASSO, and hard thresholding penalty functions, (a, λ) for the SCAD penalty and (λ₁, λ₂) for the EN penalty. Fan and Li (2001, 2002) showed that the choice of a ≡ 3.7 performs well in a variety of situations; we use their suggestion throughout our numerical analyses. Zou and Hastie (2005) showed that the EN estimator is equivalent to an ℓ₁-penalty on augmented data. In the rest of this section, we include the subscript λ on β̂ (i.e., β̂_λ) to stress the dependence of the estimator on the regularization parameter λ. In the case of EN penalty, it is understood that cross-validation is two-dimensional.

For uncensored data, Tibshirani (1996) and Fan and Li (2001) suggested the following generalized cross-validation (GCV) statistic (Wahba 1985):

G C V^{†} (λ) = \frac{RSS (λ) / n}{{1 - d (λ) / n}^{2}},

where RSS(λ) is the residual sum of squares ||y − Xβ̂_λ||², and d(λ) is the effective number of parameters, that is, d(λ) =tr[{Â + Σ_λ(β̂_λ)}⁻¹Â^T]. Note that the intercept is omitted in RSS(λ), because y may be centered at $n^{- 1} \sum_{i = 1}^{n} Y_{i}$ . When the Y_i ’s are potentially censored, d(λ) still may be considered the effective number of parameters; however, RSS(λ) is unknown. We propose estimating n⁻¹ RSS(λ) by

\hat{ν} (λ) = \frac{\sum_{i = 1}^{n} Δ_{i} {(Y_{i} - \hat{α} - {\hat{β}}_{λ}^{T} x_{i})}^{2} / \hat{K} (Y_{i})}{\sum_{i = 1}^{n} Δ_{i} / \hat{K} (Y_{i})},

where K̂(t) is the Kaplan–Meier estimator for K(t) = P(C > t), and $\hat{α} = n^{- 1} \sum_{i = 1}^{n} ξ_{i} ({\hat{β}}^{(0)})$ . For missing data, we propose estimating n⁻¹ RSS(λ) by

\hat{ν} (λ) = \frac{\sum_{i = 1}^{n} I (R_{i} = \infty) {(Y_{i} - {\hat{β}}_{λ}^{T} x_{i})}^{2} / \tilde{π} (\infty, Z_{i}, \hat{η})}{\sum_{i = 1}^{n} I (R_{i} = \infty) / \tilde{π} (\infty, Z_{i}, \hat{η})} .

Both proposals are based on large-sample arguments—namely, ν̂(λ)is a consistent estimator for lim n⁻¹ RSS(λ)for fixed λ under conditional independence between censoring and failure time distribution, for censored outcome data, and under the MAR assumption for missing data (cf. Tsiatis 2006, chap. 6). Thus our GCV statistic is

GCV (λ) = \frac{\hat{ν} (λ)}{{1 - d (λ) / n}^{2}},

and we select λ̂ = arg min_λGCV(λ).

5. SIMULATION STUDIES

5.1 Censored Data

We simulated 1,000 data sets of size n from the model

Y_{i} = β^{T} x_{i} + σ ε_{i}, i = 1, \dots, n,

where β = (3, 1.5, 0, 0, 2, 0, 0, 0)^T and ε_i and x_i are independent standard normal with the correlation between the j th and kth components of x equal to .5^|^j⁻^k^|. This model was considered by Tibshirani (1996) and Fan and Li (2001). We set the censoring distribution to be uniform(0,τ), where τ was chosen to yield approximately 30% censoring. We compared the model error, ME ≡ (β̂−β)^T E(xx^T)(β̂−β), of the proposed penalized estimator with that of the original Buckley–James estimator using the median relative model error (MRME). We also compared the average numbers of regression coefficients that are correctly or incorrectly shrunk to 0. The results are presented in Table 1, where oracle pertains to the situation in which we know a priori which coefficients are non-zero.

Table 1.

Simulation results on model selection with censored data: MRME and the average number of correct (c) and incorrect (I) 0’s

		Average number 0’s
Method	MRME (%)	C	I
n = 50,σ = 3
SCAD	69.48	4.73	.35
Hard	73.41	4.30	.17
LASSO	66.16	3.99	.11
ALASSO	57.77	4.40	.17
EN	76.48	3.54	.08
Oracle	32.76	5	0
n = 50, σ = 1
SCAD	40.11	4.78	.01
Hard	69.79	4.18	.01
LASSO	64.48	3.97	.01
ALASSO	48.21	4.90	.01
EN	95.55	3.49	0
Oracle	31.30	5	0

Open in a new tab

The performance of the proposed estimator with the SCAD, hard thresholding, and ALASSO penalties approached that of the oracle estimator as n increases. When the signal-to-noise ratio was small (e.g., large n or small σ), oracle methods (SCAD, hard thresholding, ALASSO) outperformed LASSO and EN in terms of model error and model complexity. On the other hand, LASSO and EN tended to perform better than the oracle methods as σ/n increased.

Table 2 reports the results on the accuracy of the proposed re-sampling technique in estimating the variances of the nonzero estimated regression coefficients. The standard deviation (SD) pertains to the median absolute deviation of the estimated regression coefficients divided by .6745. The median of the standard error estimates, denoted by SD_m, gauges the performance of the resampling procedure. Evidently, the resampling procedure yielded reasonable standard error estimates, particularly for large n.

Table 2.

Simulation results on standard error estimation for the nonzero coefficients (β₁, β₂, β₅) in least squares regression with censored data

	β̂₁		β̂₂		β̂₅
	SD	SD_m	SD	SD_m	SD	SD_m
SCAD	.145	.129	.135	.128	.128	.114
Hard	.151	.130	.145	.129	.138	.119
LASSO	.160	.134	.145	.143	.161	.130
ALASSO	.149	.132	.130	.133	.133	.113
EN	.172	.113	.151	.111	.155	.103
Oracle	.144	.129	.136	.126	.143	.111

Open in a new tab

NOTE: SD refers to the mean absolute deviation of the estimated regression coefficients divided by .6745; SD_m, to the median of the standard error estimates. The table entries are for a sample size n = 100 and (error) standard deviation σ = 1.

5.2 Missing Data

We simulated 1,000 datasets of size n from the model

Y_{i} = β^{T} x_{i} + σ ε_{i}, i = 1, \dots, n,

where ε_i and x_i are independent standard normal with the correlation between the jth and kth components of x equal to .5^|^j⁻^k^|. We considered two scenarios:

Model 1:

β = {(.25, .5, 0, 0, .75, 1.5, .75, 0, 0, 1)}^{T}

and

Model 2:

β = {(0, 1.25, 0, 0, 0, 2, 0, 0, 0, 1.5)}^{T} .

For a random design X, define the theoretical R²

R^{2} = \frac{β_{0}^{T} E (x x^{T}) β_{0}}{β_{0}^{T} E (x x^{T}) β_{0} + σ^{2}} .

For σ= 1 and 2, both models 1 and 2 have theoretical R² =.89 and.67. Although models 1 and 2 have the same theoretical R², they have differing numbers of nonzero coefficients; the number of nonzero coefficients over the total number of coefficients (i.e., d = 10) in a given model is sometimes referred to as the model fraction. The model fraction in model 1 is.6, whereas model 2 has a model fraction of.3. We simulated data such that subjects fall into one of three categories: R = 1 means that the subject was missing (x₁, x₂), R = 2 means that the subject was missing x₁, and R = ∞ means that the subject had complete data. The observed data {R, G_R(Z)} were generated in the following sequence of steps:

Simulate a Bernoulli random variable B₁ with probability λ̃₁{G₁(Z_i), η}.
If B₁ = 1, then set R = 1; otherwise, continue.
Simulate a Bernoulli random variable B₂ with probability λ̃₂{G₂(Z_i), η}.
If B₂ = 1, then set R = 2; otherwise, set R = ∞.

We formulated the missingness process by logistic models

logit {\tilde{λ}}_{1} {G_{1} (Z_{i})} = η_{10} + η_{11} Y_{i} + \sum_{j = 3}^{10} η_{1 j} x_{i j}

and

logit {\tilde{λ}}_{2} {G_{2} (Z_{i})} = η_{20} + η_{21} Y_{i} + \sum_{j = 2}^{10} η_{1 j} x_{i j},

where

η_{1} = {(- 6, .75, 0, 0, 1.25, 1.5, 1.25, 0, 0, 1.25)}^{T}

and

η_{2} = {(- 1.5, .5, 1.5, 0, 0, .5, .5, .5, 0, 0, .5)}^{T} .

These models yielded approximately 40% missing with subjects falling in the R = 1 and R = 2 categories in roughly equal proportions.

Table 3 presents the numerical results with n = 250. Oracle methods (SCAD, hard thresholding, ALASSO) performed better than LASSO and EN in terms of relative model error and complexity when there were a few strong predictors of response, as in model 1; however, oracle methods performed worse than LASSO and EN when there are many weakly significant predictors, as in model 2.

Table 3.

Simulation results on model selection with missing data: MRME and the average number of correct (C) and incorrect (I) 0’s

	Model 1			Model 2
		Average number of 0’s			Average number of 0’s
Method	MRME (%)	C	I	MRME (%)	C	I
σ= 1
SCAD	81.79	3.35	.21	42.60	5.56	0
Hard	82.38	3.37	.25	48.73	5.79	.01
LASSO	87.88	2.42	.09	66.49	4.11	0
ALASSO	82.24	3.55	.23	37.74	6.25	0
EN	85.59	2.38	.08	70.56	3.92	0
σ= 2
SCAD	93.64	3.33	.69	48.73	5.92	.02
Hard	90.10	3.70	1.12	46.24	6.37	.05
LASSO	82.29	2.54	.40	59.96	4.56	.02
ALASSO	82.01	3.37	.70	48.87	6.08	.02
EN	88.62	2.55	.44	66.17	4.63	.03

Open in a new tab

NOTE: For σ = 1 and σ = 2, models 1 and 2 have theoretical R² =.89 and.67; however, the number of nonzero coefficients is six in model 1 but only three in model 2.

6. THE PAUL COVERDELL STROKE REGISTRY

The Paul Coverdell National Acute Stroke Registry collects demographic, quantitative, and qualitative factors related to acute stroke care in four prototype states: Georgia, Massachusetts, Michigan, and Ohio (Paul Coverdell Prototype Registries Writing Group 2005). The goals of the registry include gaining a better understanding of factors associated with stroke and generally improving the quality of acute stroke care in the United States. For the purpose of illustration, we consider a subset of 800 patients with hemorragic or ischemic stroke from the Georgia prototype registry. Our data set includes nine predictors and a hospital length of stay (LOS) endpoint, defined as the number of days from hospital admission to hospital discharge. Conclusions from analyses like ours would be important to investigators in health policy and management, for example. The complete registry data for all four prototypes consist of several thousand hospital admissions and has not been released publicly. A more comprehensive analysis is ongoing.

Our data include the following nine predictors: Glasgow coma scale (GCS; 3–15, with 15 representing excellent health), serum albumin, creatinine, glucose, age, sex (1 if male), race (1 if white), whether or not the patient was admitted to the intensive care unit (ICU; 1 if yes), and stroke subtype (1 if hemorrhagic; 0 if ischemic). Of the 800 patients, 419 (52.4%) had complete data (i.e., R = ∞), 94 (11.8%) were missing both GCS and serumn albumin (i.e., R = 1), and 287 (35.9%) were missing only GCS (i.e., R = 2).

Table 4 presents estimates for the nuisance parameter η in the stroke data. We see that the subjects missing both GCS and albumin (i.e., R = 1) tended to have higher creatinine and glucose levels but were less likely to be admitted to the ICU on admission to the hospital. Ischemic stroke and ICU admission were strongly associated with missing GCS score (i.e., R = 2) only. Because the missingness mechanism is related to other important prognostic variables, this is mild evidence that the missing completely at random (MCAR) assumption is not well supported, and variable selection techniques based on such an assumption will lead to incorrect conclusions. Our analyses using methods described in Section 2 assuming data missing at random (MAR) are displayed in Table 5.

Table 4.

Estimates of η in the stroke data, where η pertains to the parameters in the coarsening models λ̃₁{G₁(Z)} and λ̃₂{G₂(Z)}

	η₁	η₂
(int)	−2.342_(.152)	.478_(.082)
Albumin		−.112_(.089)
Creatinine	−.492_(.291)	−.101_(.091)
Sex	−172_(.113)	.043_(.079)
Glucose	−.286_(.164)	−.067_(.084)
ICU	−.470_(.155)	−.304_(.091)
Age	.045_(.124)	.006_(.087)
Type	−.101_(.144)	−.213_(.094)
Race	.084_(.122)	−.034_(.085)
LOS	−.007_(.140)	−.045_(.092)

Open in a new tab

Table 5.

Estimated regression coefficients and their standard errors in the stroke data

	Full	SCAD	Hard	LASSO	ALASSO	EN
GCS	−.762_(.327)	−.603_(.434)	−.864_(.587)	−.681_(.480)	−.584_(.400)	−.628_(.424)
Albumin	−1.142_(.306)	−.958_(.450)	−1.043_(.486)	−.984_(.425)	−.876_(.402)	−.882_(.387)
Creatinine	−.726_(.331)	−.372_(.177)	−.734_(.347)	−.529_(.255)	−.365_(.179)	−.402_(.199)
Sex	−.007_(.288)	0₍₋₎	0₍₋₎	0₍₋₎	0₍₋₎	0₍₋₎
Glucose	−.312_(.310)	0₍₋₎	0₍₋₎	−.140_(.165)	0₍₋₎	−.030_(.039)
ICU	1.861_(.323)	2.043_(.442)	1.970_(.469)	1.807_(.419)	1.947_(.415)	1.771_(.392)
Age	−.696_(.324)	−.293_(.203)	−.678_(.465)	−.586_(.369)	−.405_(.260)	−.516_(.312)
Type	.553_(.333)	.200_(.155)	0₍₋₎	.448_(.335)	.213_(.158)	.381_(.273)
Race	−1.316_(.315)	−1.403_(.374)	−1.320_(.366)	−1.216_(.331)	−1.242_(.332)	−1.151_(.310)

Open in a new tab

We use λ̂= (.28,.63,.11,.16) for the SCAD, Hard, LASSO, and ALASSO estimates, and use (λ̂₁, λ∘₂) = (.34,.9) for the EN estimates. Table 5 presents the regression coefficient estimates for the stroke data. Higher levels of albumin and creatine are strongly related to shorter LOS, whereas admission to the ICU is associated with longer LOS. Older patients tend to have LOS than younger patients; this is most easily explained by the fact that many older stroke patients quickly die in the hospital because their bodies are too weak to recover. Patients with hemorrhagic strokes have longer recovery periods and thus longer LOS. White stroke patients tend to have shorter LOS than non-whites. Finally, sex and glucose are weak predictors of LOS. The LASSO and EN estimates tend to retain more predictors in the final model and, thus have more complex models compared with the other penalized estimators. Among the SCAD, Hard, and ALASSO estimates, SCAD and ALASSO yielded similar coefficient estimates, whereas the Hard thresholding estimates yielded the sparsest model. Our methods yielded models that appear to have reasonable scientific interpretation and do not make a strong MCAR assumption, an assumption that is not supported by the data.

7. REMARKS

We have developed a general methodology for selecting variables and simultaneously estimating their regression coefficients in semiparametric models. This development overcomes two major challenges that are not present with any of the existing variable selection methods. First, U^P (β) may not correspond to the derivative of an objective function or to quasi-likelihood, so that the mathematical arguments used by previous authors to establish the asymptotic properties of penalized maximum likelihood or penalized GEE estimators do not apply. Second, U^P (β) may be discrete in β, which entails considerable theoretical and computational challenges. In particular, the variances of the estimated regression coefficients cannot be evaluated directly, and we have developed a novel resampling procedure, which also can be used for variance estimation without the need for variable selection. Our simulation results indicate that the resampling method works well for modest sample sizes.

Rank estimators (Prentice 1978; Tsiatis 1990; Wei, Ying, and Lin 1990; Lai and Ying 1991b; Ying 1993) provide potential alternatives to the Buckley–James estimator but are computationally more demanding to implement (cf. Johnson 2008). In general, rank-estimating functions do not correspond to the derivatives of any objective functions. This is also true of estimating functions for many other semiparametric problems. In all of those situations, we can use Theorem 1 to establish the asymptotic properties of the corresponding variable selection procedures and use the proposed resampling technique to estimate the variances of the selected variables.

The proportional hazards and accelerated failure time models cannot hold simultaneously unless the error distribution is extreme value. Thus, it is useful to have variable selection methods for both models at one’s disposal, because one model may fit the data better than another. A major advantage of model (1) is that the regression coefficients have a direct physical interpretation. Hazard ratio can be an awkward concept, especially when the response variable does not pertain to failure time.

Acknowledgments

This research was supported by National Institutes of Health grants P30 ES10126, T32 ES007018, and R03 AI068484 (B.J.); R37 GM047845 (D.L.); and R01 CA082659 (D.L. and D.Z.). The authors thank Paul Weiss for preparing the stroke data set.

APPENDIX A: PROOF OF THEOREM 1

To prove part a, we consider $\hat{β} = {({\hat{β}}_{1}^{T}, 0^{T})}^{T}$ , where ${\hat{β}}_{1} = β_{01} + n^{- 1} A_{11}^{- 1} U_{1} (β_{0})$ . Because n^1/2q_{λ_n}(|β₀_j|) →0, j = 1, …, s, under condition C.2.a and β̂ = β₀+O_p(n^−1/2), we have

n^{- 1 / 2} U_{j}^{P} (\hat{β} \pm ε e_{j}) = o_{P} (1) - n^{1 / 2} q_{λ_{n}} (∣ {\hat{β}}_{j} \pm ε ∣) = o_{p} (1) .

Under condition C.2b, for j = s + 1, …, d, $n^{- 1 / 2} U_{j}^{P} (\hat{β} + ε e_{j})$ and $n^{- 1 / 2} U_{j}^{P} (\hat{β} - ε e_{j})$ are dominated by −n^1/2q_{λ_n} (ε)and n^1/2q_{λ_n}(ε), so they have opposite signs when ε goes to 0. Therefore, β̂ is an approximate zero-crossing by definition.

To prove part b, we consider the sets in the probability space C_j = {β̂_j ≠ 0}, j = s + 1, …, d. It suffices to show that for any ε > 0, when n is sufficiently large, P(C_j) < ε. Because β̂_j = O_p(n⁻¹^/²), there exists some M such that when n is large enough,

P (C_{j}) < ε / 2 + P {{\hat{β}}_{j} \neq 0, ∣ {\hat{β}}_{j} ∣ < M n^{- 1 / 2}} .

Using the j th component of the penalized estimating function and the definition of the approximate zero-crossing, we obtain that on the set of {β̂_j ≠ 0, |β̂_j| < Mn^−1/2},

\begin{array}{l} o_{p} (1) = {n^{- 1 / 2} U_{j} (β_{0}) + n^{1 / 2} A_{j} (\hat{β} - β_{0}) \\ + o_{p} (1) - n^{1 / 2} q_{λ_{n}} (∣ {\hat{β}}_{j} ∣) sgn ({\hat{β}}_{j})}^{2}, \end{array}

where A_j is the j th row of A. The first three terms on the right side are of order O_p(1). As a result, there exists some M′ such that for large n,

P ({\hat{β}}_{j} \neq 0, ∣ {\hat{β}}_{j} ∣ < M n^{- 1 / 2}, n^{1 / 2} q_{λ_{n}} (∣ {\hat{β}}_{j} ∣) > M^{'}) < ε / 2.

Because ${lim}_{n} \sqrt{n} {inf}_{∣ θ ∣ \leq M n^{- 1 / 2}} q_{λ_{n}} (∣ θ ∣) \to \infty$ by condition C.2b, β̂_j ≠ 0 and |β̂_j| < Mn^−1/2 imply that n^1/2q_{λ_n} (|β̂_j|) > M′ for large n. Thus P (β̂_j ≠ 0, |β̂_j| < Mn^−1/2) = P(β̂_j ≠ 0, |β̂_j| < Mn^−1/2), n^1/2q_{λ_n}(|β̂_j|) > M′. Therefore, P(C_j) < ε/2 + P(β̂_j ≠ 0, |β̂_j| < Mn^−1/2, n^1/2q_{λ_n}(|β̂_j|) > M′) < ε.

To prove the second part of part b, because

\begin{array}{l} o_{p} (1) = n^{- 1 / 2} U_{1} (β_{0}) + n^{1 / 2} A_{11} ({\hat{β}}_{1} - β_{01}) \\ - n^{1 / 2} q_{λ_{n}} (∣ {\hat{β}}_{1} ∣) sgn ({\hat{β}}_{1}), \end{array}

after the Taylor series expansion of the last term, we conclude that

\begin{array}{l} n^{1 / 2} {(A_{11} + \sum_{11}) ({\hat{β}}_{1} - β_{01} + {(A_{11} + \sum_{11})}^{- 1} b_{n}} \\ = - n^{- 1 / 2} (\begin{matrix} U_{1} (β_{0}) \\ ⋮ \\ U_{s} (β_{0}) \end{matrix}) + o_{p} (1) \to_{d} N (0, V_{11}) . \end{array}

To prove part c, we consider β₁ ∈ R^s on the boundary of a ball around β₀₁, that is, β₁ = β₀₁ + n^−1/2u with |u| = r for a fixed constant r. From the penalized estimating function $U_{1}^{P}$ , we have

\begin{array}{l} n^{- 1 / 2} {(β_{1} - β_{01})}^{T} A_{11}^{T} U_{1}^{P} (β) \\ = {(β_{1} - β_{01})}^{T} A_{11}^{T} {n^{- 1 / 2} U_{1} (β) - n^{1 / 2} q_{λ_{n}} (∣ β_{1} ∣) sgn (β_{1})} \\ = O_{p} (∣ β_{1} - β_{01} ∣) + n^{1 / 2} {(β_{1} - β_{01})}^{T} A_{11}^{T} A_{11} (β_{1} - β_{01}) \\ - n^{1 / 2} (β_{1} - β_{01}) A_{11}^{T} diag {q_{λ_{n}}^{'} (∣ β_{j}^{*} ∣) sgn (β_{0 j})} (β_{1} - β_{01}), \end{array}

where $β_{j}^{*}$ is between β_j and β₀_j for j = 1, …, s. Because A₁₁ is non-singular, the second term on the right side is larger than a₀r²n^−1/2, where a₀ is the smallest eigenvalue of $A_{11}^{T} A_{11}$ . The first term is of order rO_p(n^−1/2). Because ${max}_{j} q_{λ_{n}}^{'} (∣ β_{j}^{*} ∣) \to 0$ , the third term is dominated by the second term. Therefore, for any ε, if we choose r sufficiently large so that for large n, the probability that the absolute value of the first term is larger than the second term is less than ε, we then have

P [min_{∣ β_{1} - β_{01} ∣ = n^{- 1 / 2} r} {(β_{1} - β_{01})}^{T} A_{11}^{T} U_{1}^{P} ({(β_{1}^{T}, 0^{T})}^{T}) > 0] > 1 - ε .

Applying the Brouwer fixed-point theorem to the continuous function $U_{1}^{P} ({(β_{1}^{T}, 0^{T})}^{T})$ , we see that ${min}_{∣ β_{1} - β_{01} ∣ = n^{- 1 / 2} r} {(β_{1} - β_{01})}^{T} A_{11}^{T} \times U^{P} ({(β_{1}^{T}, 0^{T})}^{T}) > 0$ implies that $A_{11}^{T} U_{1}^{P} ({(β_{1}^{T}, 0^{T})}^{T})$ has a solution within this ball or, equivalently, $U_{1}^{P} ({(β_{1}^{T}, 0^{T})}^{T})$ has a solution within this ball. That is, we can choose an exact solution $\hat{β} = {({\hat{β}}_{1}^{T}, 0^{T})}^{T}$ to $U_{1}^{P} (β) = 0$ with β̂ = β̂₀+O_p(n^−1/2). Thus β̂ is a zero-crossing of U^P (β).

APPENDIX B: CONDITIONAL DISTRIBUTION OF $({\hat{β}}_{1}^{*} - {\hat{β}}_{1})$

Here we justify the resampling procedure for the penalized Buckley–James estimator. Similar justifications can be made for other estimators. Under conditions D.1–D.3, we have the following asymptotic linear expansion for the penalized Buckley–James estimating function:

\begin{array}{l} n^{- 1 / 2} U_{1}^{P} (β) = n^{- 1 / 2} U_{1}^{P} (β_{0}) + (A_{11} + \sum_{11}) n^{1 / 2} (β_{1} - β_{01}) \\ + o (max {1, n^{1 / 2} ∣ ∣ β_{1} - β_{01} ∣ ∣}) . \end{array}

(B.1)

In addition,

n^{- 1 / 2} U_{1} (β_{0}) = n^{- 1 / 2} \sum_{i = 1}^{n} w_{1 i} + o (1),

where w₁_i comprises the components of w_i corresponding to β₁, and w_i, i = 1, …, n, as given by Lin and Wei (1992), are n independent mean-0 random vectors. Replacing the unknown quantities in w_i with their sample estimators yields W_i. Recall that ${\hat{β}}_{1}^{*}$ satisfies $U_{1}^{P} ({\hat{β}}_{1}^{*}) = \sum_{i = 1}^{n} W_{1 i} G_{i}$ , where W₁_i comprises the components of W_i corresponding to β̂₁. Applying (B.1) to β̂₁ and ${\hat{β}}_{1}^{*}$ yields

n^{- 1 / 2} \sum_{i = 1}^{n} W_{1 i} G_{i} = (A_{11} + \sum_{11}) n^{1 / 2} ({\hat{β}}_{1}^{*} - {\hat{β}}_{1}) + o (1) .

The conclusion then follows.

Contributor Information

Brent A. Johnson, Assistant Professor, Department of Biostatistics, Emory University, Atlanta, GA 30322 (E-mail: bajohn3@emory.edu)

D. Y. Lin, Dennis Gillings Distinguished Professor (E-mail: lin@bios.unc.edu)

Donglin Zeng, Associate Professor (E-mail: dzeng@bios.unc.edu), Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599.

References

Buckley J, James I. Linear Regression With Censored Data. Biometrika. 1979;66:429–436. [Google Scholar]
Cai J, Fan J, Li R, Zhou H. Variable Selection for Multivariate Failure Time Data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cox DR. Regression Models and Life-Tables (with discussion) Journal of the Royal Statistical Society, Ser B. 1972;34:187–202. [Google Scholar]
Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Variable Selection for Cox’s Proportional Hazards Model and Frailty Model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]
Fan J, Li R. New Estimation and Model Selection Procedures for Semi-parametric Modeling in Longitudinal Data Analysis. Journal of the American Statistical Association. 2004;99:710–723. [Google Scholar]
Frank IE, Friedman JH. A Statistical View of Some Chemometrics Regression Tools. Technometrics. 1993;35:109–148. [Google Scholar]
Fu WJ. Penalized Estimating Equations. Biometrics. 2003;35:109–148. doi: 10.1111/1541-0420.00015. [DOI] [PubMed] [Google Scholar]
Hunter DR, Li R. Variable Selection Using MM Algorithms. The Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnson BA. Variable Selection in Semiparametric Linear Regression With Censored Data. Journal of the Royal Statistical Society, Ser B. 2008;70:351–370. [Google Scholar]
Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2. Hoboken, NJ: Wiley; 2002. [Google Scholar]
Knight K, Fu W. Asymptotics for Lasso-Type Estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
Lai TL, Ying Z. Large Sample Theory of a Modified Buckley–James Estimator for Regression Analysis With Censored Data. The Annals of Statistics. 1991a;19:1370–1402. [Google Scholar]
Lai TL, Ying Z. Rank Regression Methods for Left-Truncated and Right Censored Data. The Annals of Statistics. 1991b;19:531–556. [Google Scholar]
Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika. 1986;73:13–22. [Google Scholar]
Lin JS, Wei LJ. Linear Regression Analysis for Multivariate Failure Time Observations. Journal of the American Statistical Association. 1992;87:1091–1097. [Google Scholar]
Lin DY, Ying Z. Semiparametric and Nonparametric Regression Analysis of Longitudinal Data. Journal of the American Statistical Association. 2001;96:103–126. (with discussion) [Google Scholar]
Meinshausen N, Bühlmann P. Variable Selection and High-Dimensional Graphs With the Lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Paul Coverdell Prototype Registries Writing Group. Acute Stroke Care in the US: Results From 4 Pilot Prototypes of the Paul Coverdell National Acute Stroke Registry. Stroke. 2005;36:1232–1240. doi: 10.1161/01.STR.0000165902.18021.5b. [DOI] [PubMed] [Google Scholar]
Prentice RL. Linear Rank Tests With Right-Censored Data. Biometrika. 1978;65:167–179. [Google Scholar]
Ritov Y. Estimation in a Linear Regression Model With Censored Data. The Annals of Statistics. 1990;18:303–328. [Google Scholar]
Robins JM, Rotnitzky A, Zhao LP. Estimation of Regression Coefficients When Some Regressors Are not Always Observed. Journal of the American Statistical Association. 1994;89:846–866. [Google Scholar]
Tibshirani RJ. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Ser B. 1996;58:267–288. [Google Scholar]
Tibshirani RJ. The Lasso Method for Variable Selection in the Cox Model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
Tsiatis AA. Estimating Regression Parameters Using Linear Rank Tests for Censored Data. The Annals of Statistics. 1990;18:354–372. [Google Scholar]
Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]
van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer; 1996. [Google Scholar]
Wahba G. A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem. The Annals of Statistics. 1985;13:1378–1402. [Google Scholar]
Wei LJ, Ying Z, Lin DY. Regression Analysis of Censored Survival Data Based on Rank Tests. Biometrika. 1990;77:845–851. [Google Scholar]
Ying Z. A Large Sample Study of Rank Estimation for Censored Regression Data. The Annals of Statistics. 1993;21:76–99. [Google Scholar]
Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Ser B. 2005;67:301–320. [Google Scholar]

[R1] Buckley J, James I. Linear Regression With Censored Data. Biometrika. 1979;66:429–436. [Google Scholar]

[R2] Cai J, Fan J, Li R, Zhou H. Variable Selection for Multivariate Failure Time Data. Biometrika. 2005;92:303–316. doi: 10.1093/biomet/92.2.303. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Cox DR. Regression Models and Life-Tables (with discussion) Journal of the Royal Statistical Society, Ser B. 1972;34:187–202. [Google Scholar]

[R4] Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R5] Fan J, Li R. Variable Selection for Cox’s Proportional Hazards Model and Frailty Model. The Annals of Statistics. 2002;30:74–99. [Google Scholar]

[R6] Fan J, Li R. New Estimation and Model Selection Procedures for Semi-parametric Modeling in Longitudinal Data Analysis. Journal of the American Statistical Association. 2004;99:710–723. [Google Scholar]

[R7] Frank IE, Friedman JH. A Statistical View of Some Chemometrics Regression Tools. Technometrics. 1993;35:109–148. [Google Scholar]

[R8] Fu WJ. Penalized Estimating Equations. Biometrics. 2003;35:109–148. doi: 10.1111/1541-0420.00015. [DOI] [PubMed] [Google Scholar]

[R9] Hunter DR, Li R. Variable Selection Using MM Algorithms. The Annals of Statistics. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Johnson BA. Variable Selection in Semiparametric Linear Regression With Censored Data. Journal of the Royal Statistical Society, Ser B. 2008;70:351–370. [Google Scholar]

[R11] Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. 2. Hoboken, NJ: Wiley; 2002. [Google Scholar]

[R12] Knight K, Fu W. Asymptotics for Lasso-Type Estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]

[R13] Lai TL, Ying Z. Large Sample Theory of a Modified Buckley–James Estimator for Regression Analysis With Censored Data. The Annals of Statistics. 1991a;19:1370–1402. [Google Scholar]

[R14] Lai TL, Ying Z. Rank Regression Methods for Left-Truncated and Right Censored Data. The Annals of Statistics. 1991b;19:531–556. [Google Scholar]

[R15] Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika. 1986;73:13–22. [Google Scholar]

[R16] Lin JS, Wei LJ. Linear Regression Analysis for Multivariate Failure Time Observations. Journal of the American Statistical Association. 1992;87:1091–1097. [Google Scholar]

[R17] Lin DY, Ying Z. Semiparametric and Nonparametric Regression Analysis of Longitudinal Data. Journal of the American Statistical Association. 2001;96:103–126. (with discussion) [Google Scholar]

[R18] Meinshausen N, Bühlmann P. Variable Selection and High-Dimensional Graphs With the Lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R19] Paul Coverdell Prototype Registries Writing Group. Acute Stroke Care in the US: Results From 4 Pilot Prototypes of the Paul Coverdell National Acute Stroke Registry. Stroke. 2005;36:1232–1240. doi: 10.1161/01.STR.0000165902.18021.5b. [DOI] [PubMed] [Google Scholar]

[R20] Prentice RL. Linear Rank Tests With Right-Censored Data. Biometrika. 1978;65:167–179. [Google Scholar]

[R21] Ritov Y. Estimation in a Linear Regression Model With Censored Data. The Annals of Statistics. 1990;18:303–328. [Google Scholar]

[R22] Robins JM, Rotnitzky A, Zhao LP. Estimation of Regression Coefficients When Some Regressors Are not Always Observed. Journal of the American Statistical Association. 1994;89:846–866. [Google Scholar]

[R23] Tibshirani RJ. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Ser B. 1996;58:267–288. [Google Scholar]

[R24] Tibshirani RJ. The Lasso Method for Variable Selection in the Cox Model. Statistics in Medicine. 1997;16:385–395. doi: 10.1002/(sici)1097-0258(19970228)16:4<385::aid-sim380>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]

[R25] Tsiatis AA. Estimating Regression Parameters Using Linear Rank Tests for Censored Data. The Annals of Statistics. 1990;18:354–372. [Google Scholar]

[R26] Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]

[R27] van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. New York: Springer; 1996. [Google Scholar]

[R28] Wahba G. A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem. The Annals of Statistics. 1985;13:1378–1402. [Google Scholar]

[R29] Wei LJ, Ying Z, Lin DY. Regression Analysis of Censored Survival Data Based on Rank Tests. Biometrika. 1990;77:845–851. [Google Scholar]

[R30] Ying Z. A Large Sample Study of Rank Estimation for Censored Regression Data. The Annals of Statistics. 1993;21:76–99. [Google Scholar]

[R31] Zou H. The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R32] Zou H, Hastie T. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Ser B. 2005;67:301–320. [Google Scholar]

PERMALINK

Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models

Brent A Johnson

D Y Lin

Donglin Zeng

Abstract

1. INTRODUCTION

2. PENALIZED ESTIMATING FUNCTIONS

2.1 General Setting

2.2 Application to Censored Data

2.3 Application to Missing Data

3. ASYMPTOTIC RESULTS

Remark 1

Remark 2

Remark 3

Theorem 1

Corollary 1

Remark 4

Remark 5

4. IMPLEMENTATION

5. SIMULATION STUDIES

5.1 Censored Data

Table 1.

Table 2.

5.2 Missing Data

Table 3.

6. THE PAUL COVERDELL STROKE REGISTRY

Table 4.

Table 5.

7. REMARKS

Acknowledgments

APPENDIX A: PROOF OF THEOREM 1

APPENDIX B: CONDITIONAL DISTRIBUTION OF $({\hat{β}}_{1}^{*} - {\hat{β}}_{1})$

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models

Brent A Johnson

D Y Lin

Donglin Zeng

Abstract

1. INTRODUCTION

2. PENALIZED ESTIMATING FUNCTIONS

2.1 General Setting

2.2 Application to Censored Data

2.3 Application to Missing Data

3. ASYMPTOTIC RESULTS

Remark 1

Remark 2

Remark 3

Theorem 1

Corollary 1

Remark 4

Remark 5

4. IMPLEMENTATION

5. SIMULATION STUDIES

5.1 Censored Data

Table 1.

Table 2.

5.2 Missing Data

Table 3.

6. THE PAUL COVERDELL STROKE REGISTRY

Table 4.

Table 5.

7. REMARKS

Acknowledgments

APPENDIX A: PROOF OF THEOREM 1

APPENDIX B: CONDITIONAL DISTRIBUTION OF (β^1∗−β^1)

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

APPENDIX B: CONDITIONAL DISTRIBUTION OF $({\hat{β}}_{1}^{*} - {\hat{β}}_{1})$