VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA

Ramon I Garcia; Joseph G Ibrahim; Hongtu Zhu

. Author manuscript; available in PMC: 2010 Mar 23.

Published in final edited form as: Stat Sin. 2010 Jan;20(1):149–165.

VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA

Ramon I Garcia ¹, Joseph G Ibrahim ¹, Hongtu Zhu ¹

PMCID: PMC2844735 NIHMSID: NIHMS157521 PMID: 20336190

Abstract

We consider the variable selection problem for a class of statistical models with missing data, including missing covariate and/or response data. We investigate the smoothly clipped absolute deviation penalty (SCAD) and adaptive LASSO and propose a unified model selection and estimation procedure for use in the presence of missing data. We develop a computationally attractive algorithm for simultaneously optimizing the penalized likelihood function and estimating the penalty parameters. Particularly, we propose to use a model selection criterion, called the IC_Q statistic, for selecting the penalty parameters. We show that the variable selection procedure based on IC_Q automatically and consistently selects the important covariates and leads to efficient estimates with oracle properties. The methodology is very general and can be applied to numerous situations involving missing data, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Simulations are given to demonstrate the methodology and examine the finite sample performance of the variable selection procedures. Melanoma data from a cancer clinical trial is presented to illustrate the proposed methodology.

Key words and phrases: EM algorithm, IC_Q, missing data, penalized likelihood, variable selection

1. Introduction

Variable selection procedures based on penalized likelihood methods have received much attention in the recent literature (Bickel and Li (2006)). Some notable methods include the Lasso, Smoothly Clipped Absolute Deviation penalty (SCAD) (Fan and Li (2001)), and Adaptive Lasso (ALASSO) (Zou (2006)), among many others. These methods have been successfully applied to generalized linear models and robust linear regression (Fan and Li (2001)), and to semiparametric models including Cox’s proportional hazards model (Fan and Li (2002, 2004)). Moreover, under an appropriate choice of the penalty parameter, these variable selection procedures can produce efficient estimates with oracle properties (Fan and Li (2001)). The methods for selecting the penalty parameters consist of minimizing the penalty parameter with respect to some criterion. Commonly used criteria include generalized cross-validation (GCV) and the Bayesian Information Criterion (BIC). It has been shown that BIC can identify the true model consistently, whereas GCV cannot (Wang, Li and Tsai (2007)). Ideally, one would like to use a criterion that results in appropriate choices of the penalty parameter so that the penalized likelihood estimates can possess oracle properties. However, to the best of our knowledge, a general and easy-to-compute penalty and variable selection procedure is not currently available for missing data problems.

Missing data are a common problem in various settings, including surveys, clinical trials, and longitudinal studies. Responses and/or covariates may be missing, and statistical models for handling the missing data often depend on the missing data mechanism, such as data not missing at random (NMAR), also referred to as nonignorable missingness. For example, when there are NMAR covariates, one must specify both the covariate distribution and the missing data mechanism in the likelihood function. These additional distributions bring additional parameters into the model, that need to be taken into consideration in model selection. It is common to use some model selection criterion, such as AIC and BIC, based on the observed data log-likelihood to select a small set of variables. For instance, one might use AIC (or BIC) to select a small subset of ‘covariates’ that best predicts the outcome of interest. However, even in the absence of missing data, model selection criteria, such as AIC, can become infeasible for variable selection in linear regression models with a large number of covariates (Fan and Li (2001, 2002)). More discussion on the drawbacks of best subset selection can be found in Fan and Li (2001).

Performing variable selection in statistical models for missing data problems raises several new statistical challenges, underscoring the need for methodological development. In many missing data problems, the observed data log-likelihood does not have a closed form and is often computationally intractable because it requires evaluation of high dimensional integrals which do not have a closed form. These integrals can be approximated but the accuracy of the approximation is essentially impossible to assess in many cases. Thus, it can be infeasible to directly maximize the observed data log-likelihood function, along with the SCAD or ALASSO penalties, to select important variables and calculate their estimates. Furthermore, computing the GCV and BIC to select the penalty parameter also requires computing the intractable likelihood function and running an optimization algorithm for each penalty parameter, which can be computationally intensive for missing data problems. Thus, it is also critical to develop a new penalty selection criterion, that is easy-to-compute, in missing data problems.

The aim of this paper is to develop variable selection and penalty selection procedures, along with the SCAD and ALASSO penalties, for a class of statistical models in missing data problems, including generalized linear models with missing covariates and/or responses, random effects models, and latent variable models. We reformulate the penalty parameters in the SCAD and ALASSO as a hyperparameter in the model, and then we use the EM algorithm to simultaneously optimize the penalized likelihood function and estimate the penalty parameters. In addition, we also develop an alternative method based on optimizing a new criterion, which we call the IC_Q criterion, to select penalty parameters. The variable selection and penalty selection procedures developed here are very general and can be applied to numerous situations involving missing data and/or random effects and latent variables. Under some regularity conditions, we establish the asymptotic properties (e.g., oracle properties) of the penalized maximum likelihood estimator and the consistency of the IC_Q-based penalty selection procedure.

The rest of the paper is organized as follows. Section 2 gives the general development of algorithms for maximizing the penalized likelihood function and selecting penalty parameters in missing data problems; we characterize the asymptotic properties of the penalized maximum likelihood (ML) estimator and the IC_Q penalty selection procedure. Section 3 presents a simulation study involving missing at random (MAR) covariates in linear models in order to examine the finite sample performance of the penalized ML estimates using various penalty parameter selection procedures. In Section 4, a Melanoma dataset is analyzed with the proposed methodology. We conclude the paper with some discussion in Section 5.

2. Variable Selection for Regression Models with Missing Data

2.1. Model formulation

For notational simplicity, we focus on data with MAR or NMAR covariates; however, the methods developed below can be adapted to data with both missing responses and covariates (see Ibrahim, Lipsitz and Chen (2001)). Suppose there are n independent observations (x₁, z₁, y₁), …, (x_n, z_n, y_n), where y_i is the response variable, z_i is a q × 1 vector of partially observed covariates, and x_i is a (p−q)×1 vector of completely observed covariates. Let z_m,i and z_o,i, respectively, denote the missing and observed components of z_i. We use the q × 1 random vector r_i to indicate the missingness of z_i, where the k^th component r_ik = 1 when z_ik is observed and r_ik = 0 when z_ik is missing. We denote the complete and observed data of subject i by D_c,i and D_o,i, respectively, and the entire complete and observed data by D_c and D_o, respectively.

When the covariates are NMAR, the complete data likelihood is the product of the joint distribution of (y_i, z_i, r_i) given x_i, denoted by f (y_i, z_i, r_i|x_i), which is typically specified as a product of three conditional distributions as

f (D_{c}) = \prod_{i = 1}^{n} f (y_{i}, z_{i}, r_{i} ∣ x_{i}, η) = \prod_{i = 1}^{n} f (y_{i} ∣ x_{i}, z_{i}, β, τ) f (z_{i} ∣ x_{i}, α) f (r_{i} ∣ y_{i}, x_{i}, z_{i}, ξ),

(2.1)

where η = (β, τ, α, ξ) are the parameters corresponding to response model, covariate distribution, and missing data mechanism. We use the generic label f(u₁|u₂) throughout to denote the conditional distribution of u₁ given u₂. If the covariates are MAR, then the missing data mechanism, f(r_i|y_i, x_i, z_i, ξ), can be ignored from (2.1).

As in generalized linear models (see McCullagh and Nelder (1989, Chap. 2)), we assume that the conditional distribution of y_i given (x_i, z_i), denoted by f(y_i|x_i, z_i, β, τ), satisfies

E [y_{i} ∣ x_{i}, z_{i}; β, τ] = μ_{i} = g ((x_{i}^{T}, z_{i}^{T}) β),

(2.2)

where τ denotes the additional parameters in f(y_i|x_i, z_i, β, τ), g(·) is a known link function, and β = (β₁, …, β_p)^T is a p × 1 vector of regression coefficients. In practice, it is common to assume that y_i given (x_i, z_i) belongs to the exponential family, such as the binomial, normal, Poisson, etc.. (Little and Schluchter (1985), and Ibrahim and Lipsitz (1996)).

We model the missing-data mechanism for NMAR covariates according to either a joint log-linear model for f(r_i|y_i, x_i, z_i, ξ) or a product of a sequence of one dimensional conditionals as in Ibrahim, Chen and Lipsitz (1999). Finally, we assume that the covariate distribution f(z_i|x_i, α) is also modeled via a sequence of one-dimensional conditional distributions as in and Ibrahim, Chen and Lipsitz (1999), and is given by

f (z_{i} ∣ x_{i}, α) = f (z_{i q} ∣ z_{i (q - 1)}, \dots, z_{i 1}, x_{i}, α) \times \dots f (z_{i 1} ∣ x_{i}, α),

where we assume a specific order of conditioning.

2.2. Penalized likelihood for variable selection

In the variable selection problem, our objective is to identify nonzero components of β in (2.2) and simultaneously estimate parameters, while accounting for the missing covariate data. We propose to maximize the penalized likelihood function given by

P (η ∣ λ) = \sum_{i = 1}^{n} log f (D_{o, i} ∣ η) - n \sum_{j = 1}^{p} p_{λ_{j}} (∣ β_{j} ∣),

(2.3)

where λ = (λ₁, …, λ_p)^T, λ_j is the penalty parameter corresponding to the j-th regression coefficient β_j, and f (D_o,i|η) = ∫ f(y_i, z_i, r_i|x_i, η)dz_m,i is the observed-data log-likelihood function of the i-th observation. The penalty function, p_{λ_j} (·), is a nonnegative, nondecreasing, and differentiable function on (0, ∞) (Fan and Li (2001) and Zou (2006)). These properties ensure that the maximization of (2.3) results in estimates of β which are shrunk to zero if they are small. The corresponding covariates of the estimates that are zero are the insignificant predictors of the response variable, whereas the estimates that are not zero correspond to those covariates which are statistically significant predictors. By maximizing (2.3), one can select significant predictors and estimate parameters simultaneously while accounting for the missing data. This approach is in sharp contrast to stepwise selection procedures and Bayesian procedures (George and McCulloch (1993), and Yang, Belin and Boscardin (2005)), that ignore stochastic errors inherited in the selection phase during estimation of the ‘best’ model (Fan and Li (2002)).

In (2.3), the parameters τ, α, and ξ are not penalized, so they are not shrunk to zero even though their actual values may be small. In this sense, variable selection does not occur in the covariate distribution and the missing data mechanism. However, care must be taken in the specification of these distributions since certain specifications can lead to identifiability issues for estimating α ξ,, and thus β.

Because the observed-data log-likelihood function usually involves intractable integration, we use the EM algorithm to compute the penalized maximum likelihood estimate of η, denoted by η̂_λ, for each λ (Dempster, Laird and Rubin (1977)). At the s-th iteration, given η⁽^s⁾, the E step is to evaluate the Q–function given by

\begin{array}{l} Q_{λ} (η ∣ η^{(s)}) = E [log f (D_{c} ∣ η) ∣ D_{o}, η^{(s)}] - n \sum_{j = 1}^{p} p_{λ_{j}} (∣ β_{j} ∣) \\ = Q (η ∣ η^{(s)}) - n \sum_{j = 1}^{p} p_{λ_{j}} (∣ β_{j} ∣) \\ = Q_{1} (β, τ ∣ η^{(s)}) - n \sum_{j = 1}^{p} p_{λ_{j}} (∣ β_{j} ∣) + Q_{2} (α ∣ η^{(s)}) + Q_{3} (ξ ∣ η^{(s)}) \\ = Q_{1, λ} (β, τ ∣ η^{(s)}) + Q_{2} (α ∣ η^{(s)}) + Q_{3} (ξ ∣ η^{(s)}), \end{array}

where

\begin{array}{l} Q_{3} (ξ ∣ η^{(s)}) = \int \sum_{i = 1}^{n} log [f (r_{i} ∣ y_{i}, x_{i}, z_{i}, ξ)] f (z_{m, i} ∣ x_{i}, z_{o, i}, y_{i}, r_{i}, η^{(s)}) d z_{m, i}, \\ Q_{2} (α ∣ η^{(s)}) = \int \sum_{i = 1}^{n} log [f (z_{i} ∣ x_{i}, α)] f (z_{m, i} ∣ x_{i}, z_{o, i}, y_{i}, r_{i}, η^{(s)}) d z_{m, i}, \\ Q_{1, λ} (β, τ ∣ η^{(s)}) = \int \sum_{i = 1}^{n} log [f (y_{i} ∣ x_{i}, z_{i}, β, τ)] f (z_{m, i} ∣ x_{i}, z_{o, i}, y_{i}, r_{i}, η^{(s)}) d z_{m, i} \\ - n \sum_{j = 1}^{p} p_{λ_{j}} (∣ β_{j} ∣) . \end{array}

The M step of the algorithm involves maximizing Q₁_,λ(β, τ |η⁽^s⁾), Q₂(α|η⁽^s⁾), and Q₃(ξ|η⁽^s⁾), independently. Maximizing Q_λ(η|η⁽^s⁾) with respect to (α, τ, ξ) can be done using standard maximization algorithms, such as Newton-Raphson (Little and Schluchter (1985), and Ibrahim and Lipsitz (1996)). However, it is difficult to maximize Q₁_,λ(β, τ⁽^s⁾|η⁽^s⁾) with respect to β, because it is nondifferentiable and nonconcave (Zou and Li (2008)).

To maximize Q₁_,λ(β, τ⁽^s⁾|η⁽^s⁾) with respect to β, we approximate Q₁(β, τ⁽^s⁾|η⁽^s⁾) using a second order Taylor’s series expansion centered at β⁽^s⁾. Using this approximation, Q₁_,λ(β, τ ⁽^s⁾|η⁽^s⁾) resembles a penalized weighted least squares regression, so algorithms used for maximizing penalized least squares can be applied. Such algorithms include the local quadratic approximation algorithm (LQA) (Fan and Li (2001)), the best convex minorization-maximization algorithm (MM) (Hunter and Li (2005)), and the local linear approximation algorithm (LLA) (Zou and Li (2008)). We use the local linear approximation method to maximize Q₁_,λ(β, τ⁽^s⁾|η⁽^s⁾), because it has been shown to reduce the computational cost of maximizing penalized likelihoods (Zou and Li (2008)). Even though an approximation is used for Q₁_,λ(β, τ⁽^s⁾|η⁽^s⁾), the maximizer of this function, denoted β⁽^s⁺¹⁾, will behave such that Q_1,_λ(β⁽^s⁺¹⁾, τ⁽^s⁾|η⁽^s⁾) ≥ Q_1,_λ(β⁽^s⁾, τ⁽^s⁾|η⁽^s⁾). Therefore, using the ECM algorithm (Meng and Rubin (1993)), we can obtain a η⁽^s⁺¹⁾ such that Q_λ(η⁽^s⁺¹⁾|η⁽^s⁾) ≥ Q_λ(η⁽^s⁾|η⁽^s⁾), rather than directly maximizing Q_λ(η|η⁽^s⁾). We iterate this process until it converges to a value and denote the value at convergence by η̂_λ. Thus, η̂_λ maximizes the penalized observed data log-likelihood.

2.3. Penalty selection procedure

To ensure that η̂_λ has oracle properties, the penalty parameter λ has to be appropriately selected. Two commonly used criteria for selecting the penalty parameter include the GCV and BIC criteria. These criteria cannot be easily computed in the presence of missing data because they are often functions of the missing data, and thus involve intractable integrals. Moreover, it has been shown that even for the linear model, the GCV can lead to significant overfitting (Wang, Li and Tsai (2007)).

We propose two methods to select the penalty parameter: an IC_Q criterion and a random effects penalty estimation method. The IC_Q criterion selects the optimal λ by minimizing

{IC}_{Q} (λ) = - 2 Q ({\hat{η}}_{λ} ∣ {\hat{η}}_{0}) + {\hat{c}}_{n} ({\hat{η}}_{λ}),

where ${\hat{η}}_{0} = \underset{η}{argmax} \sum_{i = 1}^{n} log f (D_{o, i} ∣ η)$ is the unpenalized maximum likelihood estimate under the full model, and ĉ_n(η) is a function of the data and the fitted model. For instance, if ĉ_n equals twice the total number of parameters, then we obtain an AIC-type criterion; alternatively, we obtain a BIC-type criterion when ĉ_n(η) = dim(η) × log n. Moreover, in the absence of missing data, we just obtain the usual AIC or BIC criteria. In practice, it is easy to compute IC_Q for different λ because we only need samples from f (z_m,i|y_i, x_i, z_o,i, η̂₀) to approximate Q(η̂_λ|η̂₀) at each λ.

The random effects penalty estimator is calculated under the assumption that the regression coefficients β are distributed as random effects in a hierarchical model. The parameter λ can be regarded as a parameter in the distribution of β, denoted by f(β|λ, n). Then, λ can be estimated by maximizing the marginal likelihood given by

\int \prod_{i = 1}^{n} \int f (y_{i}, z_{i}, r_{i} ∣ x_{i}, η) f (β ∣ λ, n) d z_{m, i} d β = \prod_{i = 1}^{n} \int f (D_{o, i} ∣ η) f (β ∣ λ, n) d β,

(2.4)

where

f (β ∣ λ, n) = \prod_{j = 1}^{p} exp \frac{- n p_{λ_{j}} (∣ β_{j} ∣)}{{[C (λ_{j}, n)]}^{p}},

(2.5)

in which C(λ_j, n) is the normalizing constant of exp(−np_{λ_j} (|β_j|)). The resulting estimate of λ, denoted by λ̂_RE, from the maximization of (2.4) is the random effects penalty estimator. The EM algorithm can be used to calculate λ̂_RE by treating the regression coefficients as missing data in the marginal likelihood.

We consider the SCAD and ALASSO penalties as follows. For ALASSO,

p_{λ_{j}} (∣ β_{j} ∣) = λ_{j} ∣ β_{j} ∣

for j = 1, …, p. Typical values chosen are λ_j = λ₀|β̂_j|⁻^γ, where β̂_j is the unpenalized ML estimate and γ > 0 is a pre-specified positive scalar. In contrast, the SCAD penalty (Fan and Li (2001)) is a nonconcave function defined by p_λ(0) = 0 and for |β| > 0,

p_{λ}^{'} (∣ β ∣) = λ 1 (∣ β ∣ \leq λ) + \frac{{(a λ - ∣ β ∣)}_{+}}{a - 1} 1 (∣ β ∣ > λ),

where 1(·) denotes the indicator function, t₊ denotes the positive part of t, and a = 3.7. Because the function exp(−np_λ(|β|)) for the SCAD penalty is not proper, we use a truncated version of p_λ(|β|) to define the density f (β|λ, n). For SCAD, we have

f (β ∣ λ, n) C (λ, n) = {\begin{array}{l} exp (- n λ ∣ β ∣), & ∣ β ∣ < λ, \\ exp (\frac{n [∣ β ∣^{2} - 2 a λ ∣ β ∣ + λ^{2}]}{[2 (a - 1)]}), & λ \leq ∣ β ∣ \leq a λ, \\ exp (\frac{- n (a + 1) λ^{2}}{2}), & a λ \leq ∣ β ∣ \leq ∣ \bar{β} ∣, \\ 0, & ∣ β ∣ > ∣ \bar{β} ∣, \end{array}

where β̄ is arbitrarily large. For the ALASSO penalty, this truncation is not necessary because exp(−np_λ(|β|)) is proper.

A closed form expression of λ̂_RE is unavailable for both the ALASSO and SCAD penalties. But for the ALASSO penalty, a closed form expression of the conditional maximizer of the log-likelihood function with respect to λ is available. This allows a straightforward implementation of the ECM algorithm to estimate λ. For the SCAD penalty, we use the Newton Raphson algorithm along with the ECM algorithm to estimate λ̂_RE.

3. Theoretical Results

In this section, we establish the asymptotic theory of penalized likelihood estimators and the consistency of the penalty selection procedure based on IC_Q. Suppose that $β = {(β_{(1)}^{T}, β_{(2)}^{T})}^{T}$ , where β₍₁₎ and β₍₂₎ are, respectively, p₁ × 1 and p₂ × 1 subvectors. Let $β^{*} = {(β_{(1)}^{* T}, β_{(2)}^{* T})}^{T}$ denote the true value of β. Without loss of generality, we assume that $β_{(2)}^{*} = 0$ and each of the components of β₍₁₎ is not zero.

Let Inline graphic = {j₁, …, j_d} be a candidate model containing the j₁th, …, j_dth covariates. Thus, = {1, …, p} and = {1, …, p₁} denote the full and true covariate models, respectively. If misses at least one important covariate, then is referred to as an underfitted model; however, if Inline graphic then is an overfitted model. Assume that we only consider the selected covariates in . The unpenalized and penalized ML estimates of η, denoted by η̂_S and η̂_λ, respectively, are

{\hat{η}}_{S} = \underset{η : β_{j} \neq 0, \forall j \in S}{argmax} \sum_{i = 1}^{n} log f (D_{o, i} ∣ η) and {\hat{η}}_{λ} = \underset{η}{argmax} P (η ∣ λ),

where Inline graphic = η̂₀.

Theorem 1

Under assumptions (C1)–(C7) stated in the online supplement, we have

η̂_λ − η* = O_p(n^−1/2) as n → ∞, where ${\hat{η}}_{λ} = {({\hat{β}}_{(1) λ}^{T}, {\hat{β}}_{(2) λ}^{T}, {\hat{τ}}_{λ}^{T}, {\hat{α}}_{λ}^{T}, {\hat{ξ}}_{λ}^{T})}^{T}$ and η* is the true value of η.
Sparsity: P(β̂_(2)λ = 0) → 1.
Asymptotic normality: ${({\hat{β}}_{(1) λ}^{T}, {\hat{τ}}_{λ}^{T}, {\hat{α}}_{λ}^{T}, {\hat{ξ}}_{λ}^{T})}^{T}$ is asymptotically normal with mean and covariance defined in the online supplement.

The proof of Theorem 1 is given in the online supplement at http://www.stat.sinica.edu.tw/statistica. It states that, by choosing the penalty λ, there exists a root-n estimator of η, η̂_λ, and that this estimator must posses the sparsity property, i.e., β̂₍₂₎_λ = 0. Theorem 1(iii) has η̂_λ asymptotically normal. An expression for the asymptotic covariance matrix of η̂_λ can be obtained using Louis’s method (Louis (1983)). These estimates are given in the online supplement.

We investigate whether the IC_Q(λ) criterion can consistently select the correct model. For each λ ∈ R^p⁺, β̂_λ naturally defines a candidate model Inline graphic = {j: β̂_λj ≠ 0}. Generally, can be either underfitted, overfitted, or true. Therefore, R^p⁺ can be partitioned into three mutually exclusive regions $R_{u}^{p +} = {λ \in R^{p +} : S_{λ} ⊅ S_{T}}, R_{t}^{p +} = {λ \in R^{p +} : S_{λ} = S_{T}}$ , and $R_{o}^{p +} = {λ \in R^{p +} : S_{λ} \supset S_{T}, S_{λ} \neq S_{T}}$ . Furthermore, we can always choose a reference penalty parameter sequence ${λ_{n} \in R^{p +}}_{n = 1}^{\infty}$ , that satisfies the conditions necessary for Theorem 1 to hold. Thus, Inline graphic = with probability converging to one. To select a better model, we first calculate

{dIC}_{Q} (λ_{2}, λ_{1}) = {IC}_{Q} (λ_{2}) - {IC}_{Q} (λ_{1}) = 2 Q ({\hat{η}}_{λ_{1}} ∣ {\hat{η}}_{0}) - {\hat{c}}_{n} ({\hat{η}}_{λ_{1}}) - 2 Q ({\hat{η}}_{λ_{2}} ∣ {\hat{η}}_{0}) + {\hat{c}}_{n} ({\hat{η}}_{λ_{2}}) .

We assume Inline graphic ⊃ and choose the model resulting from using the penalty value λ₁ (i.e., ), if dIC_Q(λ₂, λ₁) ≥ 0, otherwise we choose model .

Define $δ_{Q} (λ_{1}, λ_{2}) = E [Q (η_{S_{λ_{1}}}^{*} ∣ η^{*})] - E [Q (η_{S_{λ_{2}}}^{*} ∣ η^{*})]$ , and δ_c(λ₂, λ₁) = ĉ_n(η̂_λ₂) − ĉ_n(η̂_λ₁), in which $η_{S}^{*}$ is defined in the online supplement.

Theorem 2

Under assumptions (C1)–(C7) in the Appendix of the online supplement, we have following results.

If for all , lim infδ_Q(λ, 0)/n > 0 and δ_c(λ, 0) = o_p(n), then dIC_Q(λ, 0) > 0 in probability for all .
If $E [Q (η_{S_{λ_{1}}}^{*} ∣ {\hat{η}}_{0})] - E [Q (η_{S_{λ_{2}}}^{*} ∣ {\hat{η}}_{0})] = O_{p} (n^{1 / 2})$ and $Q ({\hat{η}}_{λ_{t}} ∣ {\hat{η}}_{0}) - E [Q (η_{S_{λ_{t}}}^{*} ∣ {\hat{η}}_{0})] = O_{p} (n^{1 / 2})$ for t = 1, 2, then dIC_Q(λ₂, λ₁) > 0 in probability as $n^{- 1 / 2} δ_{c} (λ_{2}, λ_{1}) \overset{p}{\to} \infty$ .
If Q(η̂_λ₁ |η̂₀) − Q(η̂_λ₂|η̂₀) = O_p(1), then dIC_Q(λ₂, λ₁) > 0 in probability as $δ_{c} (λ_{2}, λ_{1}) \overset{p}{\to} \infty$ .

The proof of Theorem 2 is given in the online supplement. Theorem 2 has some important implications. Theorem 2a shows that IC_Q(λ) chooses all significant covariates with probability 1. Because $S_{0} \subset R_{t}^{p} \cup R_{o}^{p}$ , the optimal model selected when minimizing IC_Q(λ) will not select a λ with Inline graphic because dIC_Q(λ, 0) > 0 in probability. Therefore, IC_Q selects all significant covariates with probability tending to 1. Generally, the most commonly used ĉ_n(η), such as 2dim(η), dim(η) log(n), and K log log(n) (K > 0), satisfy the condition δ_c(λ, 0) = o_p(n). The condition $\underset{n}{lim inf} n^{- 1} δ_{Q} (λ, 0) > 0$ ensures that IC_Q(λ) chooses a model with large $E [Q (η_{S}^{*} ∣ η^{*})]$ . This condition is analogous to Condition 2 in Wang, Li and Tsai (2007), which elucidates the effect of models that underfit. Because $n^{- 1} E [Q (η^{*} ∣ η^{*})] - n^{- 1} E [Q (η_{S}^{*} ∣ η^{*})]$ can be written as

\begin{array}{l} n^{- 1} \sum_{i = 1}^{n} log f (D_{o, i} ∣ η^{*}) - n^{- 1} \sum_{i = 1}^{n} log f (D_{o, i} ∣ η_{S}^{*}) \\ + n^{- 1} E [H (η^{*} ∣ η^{*})] - n^{- 1} E [H (η_{S}^{*} ∣ η^{*})], \end{array}

where

H (η ∣ η_{1}) = \int \sum_{i = 1}^{n} log [f (z_{m, i} ∣ x_{i}, z_{o, i}, y_{i}, r_{i}, η)] f (z_{m, i} ∣ x_{i}, z_{o, i}, y_{i}, r_{i}, η_{1}) d z_{m, i},

it then follows from Jensen’s inequality that n⁻¹δ_Q(λ, 0) ≥ 0. Thus, if a model Inline graphic misses a significant covariate, it is reasonable to assume lim inf_n n⁻¹δ_Q(λ, 0) is greater than zero.

If λ₁ and λ₂ have the same average $n^{- 1} E [Q (η_{S_{λ}}^{*} ∣ η^{*})]$ , that is, lim inf_n n⁻¹ δ_Q(λ₂, λ₁) = 0, then Theorem 2 (b) and (c) indicate that IC_Q(λ) picks out the smaller model Inline graphic when δ_c(λ₂, λ₁) increases to ∞ at a certain rate (e.g., log(n)). For example, for the BIC-type criterion, δ_c(λ₂, λ₁) = [dim() − dim()] log(n) ≥ log(n), since we assume ⊃ . However, the AIC-type criterion ĉ_n(η) = 2 × dim(η) does not satisfy this condition. Thus, similar to the standard AIC, IC_Q with ĉ_n(η) = 2 × dim(η) tends to overfit.

4. Numerical Studies

4.1. Example 1: simulation study

We demonstrate the performance of the penalized ML estimates using our proposed penalty estimators via simulations and compare them to the unpenalized ML estimate. Our objective for these simulations was to (1) compare the performance of the random effects and the IC_Q penalty estimators, (2) compare the performance of the SCAD and ALASSO penalty functions, and (3) determine how the comparisons in (1) and (2) differ in the complete data and missing covariate settings.

To do this, we simulated datasets consisting of n observations from the model y = u^T β* + σε where β* = (3, 1.5, 0, 0, 2, 0, 0, 0)^T and the components of u = (u₁, …, u₈), and ε are standard normal. The correlation between u_i and u_j is ρ^|ⁱ⁻^j^| with ρ = 0.5. This model was used in Fan and Li (2001). We considered three settings, (n = 40, σ = 3), (n = 40, σ = 1), and (n = 60, σ = 1). For each of them, two sets of 100 datasets were simulated, one with complete data and another with missing covariate data. For the datasets with missing data, the missing covariates z_i = (u₁_i, u₂_i) were taken to be MAR and x_i = (u₃_i, …, u₈_i) were completely observed. The covariate distribution is given by, [z_i|x_i] ~ N₂(μ_i, Σ) for i = 1, …, n where μ_i = (μ₁_i, μ₂_i), $μ_{s i} = α_{s 0} + \sum_{j = 1}^{5} α_{s j} x_{i s}$ for s = 1, 2 and Σ is an unstructured 2 × 2 covariance matrix. The missing data mechanism used was f(r_i₁, r_i₂|y_i, x_i, φ) = f (r_i₁|r_i₂, y_i, x_i, φ₁)f (r_i₂|y_i, x_i, φ₂), where f(r_i₁|y_i, x_i, φ₁) and f (r_i₂|r_i₁, y_i, x_i, φ₂) are logistic regressions where the logistic regression parameters φ₁ and φ₂ were selected such that 65% of the observations had complete data.

For each simulated dataset, the penalized ML estimate using the SCAD and ALASSO penalties was computed using the random effects and IC_Q penalty estimates. These estimates are denoted as SCAD-RE, SCAD-IC_Q, ALASSO-RE, and ALASSO-IC_Q, respectively. For the IC_Q estimate, the BIC-type criterion, c_n(η) = dim(η) log n, was used. In the analysis of the datasets with no missing covariates, the IC_Q criterion is equivalent to BIC. For the random effects penalty estimator, 2,000 Monte Carlo iterations were used within each iteration of EM. Since the EM algorithm can be sensitive to starting values, the algorithm was initiated from multiple starting values to ensure the overall global maximum was achieved by the algorithm. For the ALASSO penalty, we set λ_j = λ₀|β̂_j₀|⁻¹, where β̂_j₀ is the unpenalized ML estimate and for the SCAD penalty we let λ_j = λ₀, for all j, where in both cases λ₀ was estimated using the penalty estimation methods.

In addition to the penalized estimates, the unpenalized ML estimate of the model selected by the simultaneously impute and select (SIAS) method of Yang, Belin and Boscardin (2005) was computed. SIAS implements the stochastic search variable selection (SSVS) method of George and McCulloch (1993) in the presence of missing covariates. SIAS is a fully Bayesian method which does not require model enumeration or computation of marginal likelihoods, so it maybe easier to implement than other fully Bayesian methods. In the analysis of the datasets with no missing covariates, SIAS is equivalent to SSVS. Details of the implementation of SIAS are given in the online supplement.

For each estimate β̂_λ, the model error, ME(β̂_λ) = (β̂_λ−β*)E(uu^T)(β̂_λ−β*), was computed and the ratio of the model error of the penalized ML estimate to that of the unpenalized ML estimate, ME(β̂_λ)/ME(β̂₀), was computed. The median of these ratios over the 100 simulated datasets, denoted as MRME, is reported. The MRME of the true model, denoted as ‘oracle’, is also reported. In addition, the average number of zero coefficients correctly estimated to be zero and the average number of zero coefficients incorrectly estimated to be zero are reported. These are reported in the columns ‘Correct’ and ‘Incorrect’ respectively.

The results indicate that when the noise level is high (σ = 3), the ALASSO-RE and SCAD-IC_Q estimates have smallest model error while the SCAD-RE has the highest. When the noise level is reduced (σ = 1), or the sample size is large (n = 60), the SCAD-RE estimate has the smallest model error. For the estimates, MRME values greater than one indicate that the estimate performs worse than the unpenalized ML estimate, values near one indicate it performs as good as the unpenalized ML estimate, while values near the ‘oracle’ MRME value indicate optimal performance. The SCAD-RE performed poorly when the noise level was high, however, it is optimal when either the noise level is small or the sample size is large. The ALASSO-RE estimate had substantial overfit since ‘Correct’ averaged significantly less than 5 indicating a tendency to not set insignificant coefficients to zero. The SIAS estimate performed as well as the unpenalized ML estimate when the noise level was large and covariates were missing, however it outperformed the ML estimate when either the noise level was high, the sample size was large, or all the covariates were fully observed. ‘Correct’ averages and ‘Incorrect’ averages that are both high indicate that the estimate is more likely to set coefficients to zero rather than not. This was the case with the SIAS and SCAD-RE estimates when the noise level was large. Comparing the analysis of no missing covariate data to the analysis with missing covariate data shows that for all the estimates, the estimation error increased, overfitting increased, and underfitting increased.

4.3. Example 2: melanoma data

To further illustrate our proposed methods, we consider data on n = 286 patients from a phase III two arm clinical trial conducted by the Eastern Cooperative Oncology Group. The results from this study have been reported in Kirkwood, Strawderman, Ernstoff, Smith, Borden and Blum (1996). Patients in this trial were randomized to one of two treatment arms: high dose interferon or observation. Interferon is suggested to have a significant effect on disease-free survival. Here, disease free survival is defined as the time from randomization until progression of tumor or death, whichever comes first. In this analysis, several prognostic factors were identified as important predictors of survival. Among these factors are, z₁ = Breslow thickness (in mm), z₂ = size of primary (in cm2), z₃ = type of primary tumor (two levels: superficial spreading, other), x₁ = age (in years), x₂ = pathological group (two levels: previous recurrence and other) and x₃ = treatment (two levels: high dose interferon and observation). From these six covariates, three had missing data while the rest of the covariates and the response variable were completely observed. The three covariates with missing data were Breslow thickness, size, and type. Logarithms of Breslow thickness and size were used in this analysis to achieve approximate normality of these covariates in the covariate distribution. The dataset had a total missing data fraction of 28.7%. The outcome variable, y_i, was taken here to be binary, and was assigned a 1 if the patient had an overall survival greater than or equal to 0.55 years, and 0 otherwise. There were no censored cases that had an overall survival below 0.55 years.

To analyze these data, a logistic regression model was used for y_i|x_i, β with E(y_i|x_i, β) = exp(γ_i)/(1 + exp(γ_i)), where γ_i = (1, z_i, x_i)^T β, z_i = (z_i₁, z_i₂, z_i₃)^T, x_i = (x_i₁, x_i₂, x_i₃)^T, and β = (β₀, β₁, …, β₆). For the missing covariates, we assume they are MAR and have the covariate distribution

f (z_{i} ∣ x_{i}; α) = f (z_{i 3} ∣ z_{i 1}, z_{i 2}, x_{i}; α_{3}) f (z_{i 1}, z_{i 2} ∣ x_{i}; α_{1}, α_{2})

for i = 1, …, n. Since x_i is completely observed, it is conditioned on throughout. We take (z_i₁, z_i₂|x_i) ~ N₂(μ_i, Σ), where μ_i = (μ_i₁, μ_i₂) and $μ_{i s} = α_{s 0} + \sum_{j = 1}^{3} α_{s j} x_{i j}$ for s = 1,2, i = 1, …, n, and Σ is an unstructured 2 × 2 covariance matrix. A logistic regression model was used for x_i₃ conditional on (z_i₁, z_i₂, x_i). The same estimates as those computed in the simulations were computed. The statistical model used for the SIAS method is given in the online supplement.

The results are presented in Table 4.2. The predictors identified as significant were different for the each of the estimation methods. In the missing data analysis, the ALASSO and SIAS estimates identified treatment as a significant predictor while the SCAD estimates did not. The ALASSO-IC_Q estimate also identified treatment and pathology as significant while the ALASSO-RE estimate identified treatment, pathology and age as significant. According to the unpenalized ML analysis, treatment and pathology are the only predictors which are possibly significant since their p-values are near or below the cutoff value of 0.05 for significance. However, neither of these predictors was strongly significant. Therefore, a possible explanation for the differences in the results of the various estimation methods is that these methods may not be able to discriminate between models that include or exclude treatment and pathology very well. The results of the unpenalized maximum likelihood analysis coincided with the results of the ALASSO-IC_Q and SIAS estimates. As with the simulations, the ALASSO-RE estimate tended to overfit since it identified age as significant even though its p-value was greater than 0.05, and the SCAD-RE estimate tended to set coefficients to 0 since it did not identify any predictors as significant. The estimate of the regression coefficient for treatment decreased from 1.117 in the complete case analysis to 0.839 in the missing data analysis. This change caused the SCAD-IC_Q estimate to identify treatment as significant in the complete case analysis but not significant for the missing data analysis.

Table 4.2.

Estimates of Melanoma Data

	Missing Data Estimate
	SCAD		ALASSO		SIAS	MLE (p value)
Variable	RE	IC_Q	RE	IC_Q
Intercept	2.132	2.132	2.421	2.280	1.774	2.638 (<0.001)
Breslow	0.000	0.000	0.000	0.000	0.000	−0.217 (0.332)
Size	0.000	0.000	0.000	0.000	0.000	−0.052 (0.798)
Type	0.000	0.000	0.000	0.000	0.000	−0.161 (0.730)
Age	0.000	0.000	−0.267	0.000	0.000	−0.325 (0.146)
Pathology	0.000	0.000	−0.845	−0.454	0.000	−1.061 (0.039)
Treatment	0.000	0.000	0.737	0.322	0.827	0.839 (0.043)

	Complete Case Estimate
	SCAD		ALASSO		SIAS	MLE (p value)
Variable	RE	IC_Q	RE	IC_Q
Intercept	2.085	1.609	2.043	1.820	1.609	2.210 (<0.001)
Breslow	0.000	0.000	−0.081	0.000	0.000	−0.222 (0.400)
Size	0.000	0.000	0.000	0.000	0.000	−0.089 (0.650)
Type	0.000	0.000	0.000	0.000	0.000	0.235 (0.650)
Age	0.000	0.000	−0.113	0.000	0.000	−0.232 (0.356)
Pathology	0.000	0.000	−0.578	0.000	0.000	−0.945 (0.086)
Treatment	0.000	1.173	1.003	0.572	1.173	1.117 (0.028)

Open in a new tab

5. Discussion

We have proposed a general method to simultaneously perform model selection and estimation in the presence of missing data. We have showed that under regularity conditions and appropriate rates of the penalty parameter, the penalized estimate possesses oracle properties. We have introduced two computationally attractive methods for estimating the penalty parameters. We have showed that under an appropriate choice of ĉ_n(η), the IC_Q penalty estimate chooses all the significant predictors in probability. Simulation results show that the SCAD penalty function with the random effects penalty estimate performs well when the noise level is small, whereas it performs poorly when the noise level is large. Overall, the SCAD performed better when it was used with the random effects penalty estimator whereas the ALASSO performed better when it was used with the IC_Q criterion. The ALASSO penalty function with the random effects penalty estimate showed significant overfit in the finite sample simulations and this overfit was also present in the Melanoma data analyses. The results of the Melanoma data analysis indicate that when predictors are not strongly significant, the results from penalized likelihood maximization may differ depending on the penalty functions and penalty selection methods which are used.

One of the disadvantages of penalized likelihood methods is that they do not provide a measure of model uncertainty, i.e., the probability of selecting each model in the model space. Other methods, such as Bayesian model averaging (Hoeting, Madigan, Raftery and Volinsky (1999)), SIAS, or Bayesian methods in general provide estimates of posterior model probabilities. However, implementation of fully Bayesian methods can be difficult in many cases, since it requires specifying priors for all of the parameters in the response model, covariate distribution (and missing data mechanism under NMAR) which encompass all the models in the model space, as well as calculating marginal likelihoods and enumerating all the models in the model space. Alternatively, the SIAS method is easier to implement but, unlike penalized ML maximization, it does not give an estimate of the parameters of the ‘best’ model. Moreover, the results of the linear regression simulations indicated that the SCAD-RE estimate outperforms SIAS when either the noise level is small or the sample size is large.

Many aspects of this work warrant further research and investigation. One major issue is to carry out variable selection using IC_Q under different modeling situations such as generalized linear mixed models with nonignorable missing response and/or covariate data, semiparametric survival models with missing covariate data, such as the Cox model as well as frailty models, measurement error models, and partially linear models with missing covariates and/or responses. Throughout this paper, we made an implicit assumption that the response model does not depend on whether a covariate is observed or missing. That is, we have assumed a single response model for the covariate where it is missing or not. If we have a different response model for the observed and missing parts of the covariate, then the methods developed in this paper would not be able detect whether the missing part of a covariate is significant. In this scenario other statistical methods, such as propensity score methods, may be useful for handling this case (Kang and Schafer (2007)), but applying these methods to variable selection problems requires further developments both computationally and theoretically. We will formally investigate these issues in our future work.

Supplementary Material

Supplmentary data

NIHMS157521-supplement-Supplmentary_data.pdf^{(155.3KB, pdf)}

Table 4.1.

Simulation results of linear regression model with no missing data and covariates missing at random comparing SCAD and ALASSO penalty functions with random effects and IC_Q penalty estimates.

	No missing (MAR)
			# of 0 coefficients
Model	Method	MRME	Correct	Incorrect
n = 40, σ = 3	SCAD-RE	1.111 (1.203)	4.91 (4.90)	0.97 (0.98)
	SCAD-IC_Q	0.625 (0.745)	4.53 (4.48)	0.33 (0.45)
	ALASSO-RE	0.632 (0.690)	3.23 (3.42)	0.09 (0.13)
	ALASSO-IC_Q	0.681 (0.771)	4.31 (4.23)	0.28 (0.35)
	SIAS	0.765 (1.004)	4.81 (4.87)	0.55 (0.77)
	Oracle	0.256 (0.305)	5.00 (5.00)	0.00 (0.00)
n = 40, σ = 1	SCAD-RE	0.285 (0.316)	4.34 (4.49)	0.01 (0.01)
	SCAD-IC_Q	0.333 (0.549)	4.64 (4.15)	0.00 (0.00)
	ALASSO-RE	0.472 (0.543)	3.45 (3.23)	0.00 (0.00)
	ALASSO-IC_Q	0.404 (0.572)	4.58 (4.10)	0.00 (0.00)
	SIAS	0.321 (0.360)	4.82 (4.79)	0.00 (0.00)
	Oracle	0.273 (0.258)	5.00 (5.00)	0.00 (0.00)
n = 60, σ = 1	SCAD-RE	0.322 (0.351)	4.54 (4.62)	0.00 (0.00)
	SCAD-IC_Q	0.375 (0.386)	4.86 (4.73)	0.00 (0.00)
	ALASSO-RE	0.517 (0.495)	3.47 (3.53)	0.00 (0.00)
	ALASSO-IC_Q	0.425 (0.447)	4.83 (4.70)	0.00 (0.00)
	SIAS	0.461 (0.387)	4.70 (4.82)	0.00 (0.00)
	Oracle	0.310 (0.356)	5.00 (5.00)	0.00 (0.00)

Open in a new tab

Contributor Information

Ramon I. Garcia, Email: rgarcia@bios.unc.edu.

Joseph G. Ibrahim, Email: ibrahim@bios.unc.edu.

Hongtu Zhu, Email: hzhu@bios.unc.edu.

References

Bickel PJ, Li B. Regularization in statistics. Test. 2006;76:271–344. [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum likelihood for incomplete data via the EM algorithm. J Roy Statist Soc Ser B. 1977;39:1–38. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30(1):74–99. [Google Scholar]
Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J Amer Statist Assoc. 2004;99:710–723. [Google Scholar]
George EI, McCulloch RE. Variable selection via Gibbs sampling. J Amer Statist Assoc. 1993;88:881–889. [Google Scholar]
Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Statist Sci. 1999;14:382–417. [Google Scholar]
Hunter DR, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ibrahim JG, Chen MH, Lipsitz SR. Monte Carlo EM for missing covariates in parametric regression models. Biometrics. 1999;55:591–596. doi: 10.1111/j.0006-341x.1999.00591.x. [DOI] [PubMed] [Google Scholar]
Ibrahim JG, Lipsitz SR. Parameter estimation from incomplete data in binomial regression when the missing data mechanism is nonignorable. Biometrics. 1996;52:1071–1078. [PubMed] [Google Scholar]
Ibrahim JG, Lipsitz SR, Chen MH. Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika. 2001;88:551–564. [Google Scholar]
Kang JDY, Schafer JL. Demystifying double robustness: a comparision of alternative strategies from estimating a population mean from incomplete data. Statist Sci. 2007;22:523–539. [Google Scholar]
Kirkwood JM, Strawderman MH, Ernstoff MS, Smith TJ, Borden EC, Blum RH. Interferon alfa-2b adjuvant therapy of high-risk resected cutaneous melanoma: the eastern cooperative oncology group trial EST 1684. Journal of Clinical Oncology. 1996;14:7–17. doi: 10.1200/JCO.1996.14.1.7. [DOI] [PubMed] [Google Scholar]
Little RJA, Schluchter M. Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika. 1985;72:497–512. [Google Scholar]
Louis TA. Finding the observed information matrix when using the EM algorithm. J Roy Statist Soc Ser B. 1983;44:226–233. [Google Scholar]
McCullagh P, Nelder JA. Generalized Linear Models. 2. Chapman and Hall; London: 1989. [Google Scholar]
Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika. 1993;80:267–78. [Google Scholar]
Wang H, Li R, Tsai CL. Tuning parameter selector for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang X, Belin TR, Boscardin WJ. Imputation and variable selection in linear regression models with missing covariates. Biometrics. 2005;61:498–506. doi: 10.1111/j.1541-0420.2005.00317.x. [DOI] [PubMed] [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Statist Sci. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplmentary data

NIHMS157521-supplement-Supplmentary_data.pdf^{(155.3KB, pdf)}

[R1] Bickel PJ, Li B. Regularization in statistics. Test. 2006;76:271–344. [Google Scholar]

[R2] Dempster AP, Laird NM, Rubin DB. Maximum likelihood for incomplete data via the EM algorithm. J Roy Statist Soc Ser B. 1977;39:1–38. [Google Scholar]

[R3] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R4] Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30(1):74–99. [Google Scholar]

[R5] Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J Amer Statist Assoc. 2004;99:710–723. [Google Scholar]

[R6] George EI, McCulloch RE. Variable selection via Gibbs sampling. J Amer Statist Assoc. 1993;88:881–889. [Google Scholar]

[R7] Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Statist Sci. 1999;14:382–417. [Google Scholar]

[R8] Hunter DR, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Ibrahim JG, Chen MH, Lipsitz SR. Monte Carlo EM for missing covariates in parametric regression models. Biometrics. 1999;55:591–596. doi: 10.1111/j.0006-341x.1999.00591.x. [DOI] [PubMed] [Google Scholar]

[R10] Ibrahim JG, Lipsitz SR. Parameter estimation from incomplete data in binomial regression when the missing data mechanism is nonignorable. Biometrics. 1996;52:1071–1078. [PubMed] [Google Scholar]

[R11] Ibrahim JG, Lipsitz SR, Chen MH. Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika. 2001;88:551–564. [Google Scholar]

[R12] Kang JDY, Schafer JL. Demystifying double robustness: a comparision of alternative strategies from estimating a population mean from incomplete data. Statist Sci. 2007;22:523–539. [Google Scholar]

[R13] Kirkwood JM, Strawderman MH, Ernstoff MS, Smith TJ, Borden EC, Blum RH. Interferon alfa-2b adjuvant therapy of high-risk resected cutaneous melanoma: the eastern cooperative oncology group trial EST 1684. Journal of Clinical Oncology. 1996;14:7–17. doi: 10.1200/JCO.1996.14.1.7. [DOI] [PubMed] [Google Scholar]

[R14] Little RJA, Schluchter M. Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika. 1985;72:497–512. [Google Scholar]

[R15] Louis TA. Finding the observed information matrix when using the EM algorithm. J Roy Statist Soc Ser B. 1983;44:226–233. [Google Scholar]

[R16] McCullagh P, Nelder JA. Generalized Linear Models. 2. Chapman and Hall; London: 1989. [Google Scholar]

[R17] Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika. 1993;80:267–78. [Google Scholar]

[R18] Wang H, Li R, Tsai CL. Tuning parameter selector for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Yang X, Belin TR, Boscardin WJ. Imputation and variable selection in linear regression models with missing covariates. Biometrics. 2005;61:498–506. doi: 10.1111/j.1541-0420.2005.00317.x. [DOI] [PubMed] [Google Scholar]

[R20] Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]

[R21] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Statist Sci. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA

Ramon I Garcia

Joseph G Ibrahim

Hongtu Zhu

Abstract

1. Introduction