Abstract
We consider the variable selection problem for a class of statistical models with missing data, including missing covariate and/or response data. We investigate the smoothly clipped absolute deviation penalty (SCAD) and adaptive LASSO and propose a unified model selection and estimation procedure for use in the presence of missing data. We develop a computationally attractive algorithm for simultaneously optimizing the penalized likelihood function and estimating the penalty parameters. Particularly, we propose to use a model selection criterion, called the ICQ statistic, for selecting the penalty parameters. We show that the variable selection procedure based on ICQ automatically and consistently selects the important covariates and leads to efficient estimates with oracle properties. The methodology is very general and can be applied to numerous situations involving missing data, from covariates missing at random in arbitrary regression models to nonignorably missing longitudinal responses and/or covariates. Simulations are given to demonstrate the methodology and examine the finite sample performance of the variable selection procedures. Melanoma data from a cancer clinical trial is presented to illustrate the proposed methodology.
Key words and phrases: EM algorithm, ICQ, missing data, penalized likelihood, variable selection
1. Introduction
Variable selection procedures based on penalized likelihood methods have received much attention in the recent literature (Bickel and Li (2006)). Some notable methods include the Lasso, Smoothly Clipped Absolute Deviation penalty (SCAD) (Fan and Li (2001)), and Adaptive Lasso (ALASSO) (Zou (2006)), among many others. These methods have been successfully applied to generalized linear models and robust linear regression (Fan and Li (2001)), and to semiparametric models including Cox’s proportional hazards model (Fan and Li (2002, 2004)). Moreover, under an appropriate choice of the penalty parameter, these variable selection procedures can produce efficient estimates with oracle properties (Fan and Li (2001)). The methods for selecting the penalty parameters consist of minimizing the penalty parameter with respect to some criterion. Commonly used criteria include generalized cross-validation (GCV) and the Bayesian Information Criterion (BIC). It has been shown that BIC can identify the true model consistently, whereas GCV cannot (Wang, Li and Tsai (2007)). Ideally, one would like to use a criterion that results in appropriate choices of the penalty parameter so that the penalized likelihood estimates can possess oracle properties. However, to the best of our knowledge, a general and easy-to-compute penalty and variable selection procedure is not currently available for missing data problems.
Missing data are a common problem in various settings, including surveys, clinical trials, and longitudinal studies. Responses and/or covariates may be missing, and statistical models for handling the missing data often depend on the missing data mechanism, such as data not missing at random (NMAR), also referred to as nonignorable missingness. For example, when there are NMAR covariates, one must specify both the covariate distribution and the missing data mechanism in the likelihood function. These additional distributions bring additional parameters into the model, that need to be taken into consideration in model selection. It is common to use some model selection criterion, such as AIC and BIC, based on the observed data log-likelihood to select a small set of variables. For instance, one might use AIC (or BIC) to select a small subset of ‘covariates’ that best predicts the outcome of interest. However, even in the absence of missing data, model selection criteria, such as AIC, can become infeasible for variable selection in linear regression models with a large number of covariates (Fan and Li (2001, 2002)). More discussion on the drawbacks of best subset selection can be found in Fan and Li (2001).
Performing variable selection in statistical models for missing data problems raises several new statistical challenges, underscoring the need for methodological development. In many missing data problems, the observed data log-likelihood does not have a closed form and is often computationally intractable because it requires evaluation of high dimensional integrals which do not have a closed form. These integrals can be approximated but the accuracy of the approximation is essentially impossible to assess in many cases. Thus, it can be infeasible to directly maximize the observed data log-likelihood function, along with the SCAD or ALASSO penalties, to select important variables and calculate their estimates. Furthermore, computing the GCV and BIC to select the penalty parameter also requires computing the intractable likelihood function and running an optimization algorithm for each penalty parameter, which can be computationally intensive for missing data problems. Thus, it is also critical to develop a new penalty selection criterion, that is easy-to-compute, in missing data problems.
The aim of this paper is to develop variable selection and penalty selection procedures, along with the SCAD and ALASSO penalties, for a class of statistical models in missing data problems, including generalized linear models with missing covariates and/or responses, random effects models, and latent variable models. We reformulate the penalty parameters in the SCAD and ALASSO as a hyperparameter in the model, and then we use the EM algorithm to simultaneously optimize the penalized likelihood function and estimate the penalty parameters. In addition, we also develop an alternative method based on optimizing a new criterion, which we call the ICQ criterion, to select penalty parameters. The variable selection and penalty selection procedures developed here are very general and can be applied to numerous situations involving missing data and/or random effects and latent variables. Under some regularity conditions, we establish the asymptotic properties (e.g., oracle properties) of the penalized maximum likelihood estimator and the consistency of the ICQ-based penalty selection procedure.
The rest of the paper is organized as follows. Section 2 gives the general development of algorithms for maximizing the penalized likelihood function and selecting penalty parameters in missing data problems; we characterize the asymptotic properties of the penalized maximum likelihood (ML) estimator and the ICQ penalty selection procedure. Section 3 presents a simulation study involving missing at random (MAR) covariates in linear models in order to examine the finite sample performance of the penalized ML estimates using various penalty parameter selection procedures. In Section 4, a Melanoma dataset is analyzed with the proposed methodology. We conclude the paper with some discussion in Section 5.
2. Variable Selection for Regression Models with Missing Data
2.1. Model formulation
For notational simplicity, we focus on data with MAR or NMAR covariates; however, the methods developed below can be adapted to data with both missing responses and covariates (see Ibrahim, Lipsitz and Chen (2001)). Suppose there are n independent observations (x1, z1, y1), …, (xn, zn, yn), where yi is the response variable, zi is a q × 1 vector of partially observed covariates, and xi is a (p−q)×1 vector of completely observed covariates. Let zm,i and zo,i, respectively, denote the missing and observed components of zi. We use the q × 1 random vector ri to indicate the missingness of zi, where the kth component rik = 1 when zik is observed and rik = 0 when zik is missing. We denote the complete and observed data of subject i by Dc,i and Do,i, respectively, and the entire complete and observed data by Dc and Do, respectively.
When the covariates are NMAR, the complete data likelihood is the product of the joint distribution of (yi, zi, ri) given xi, denoted by f (yi, zi, ri|xi), which is typically specified as a product of three conditional distributions as
| (2.1) |
where η = (β, τ, α, ξ) are the parameters corresponding to response model, covariate distribution, and missing data mechanism. We use the generic label f(u1|u2) throughout to denote the conditional distribution of u1 given u2. If the covariates are MAR, then the missing data mechanism, f(ri|yi, xi, zi, ξ), can be ignored from (2.1).
As in generalized linear models (see McCullagh and Nelder (1989, Chap. 2)), we assume that the conditional distribution of yi given (xi, zi), denoted by f(yi|xi, zi, β, τ), satisfies
| (2.2) |
where τ denotes the additional parameters in f(yi|xi, zi, β, τ), g(·) is a known link function, and β = (β1, …, βp)T is a p × 1 vector of regression coefficients. In practice, it is common to assume that yi given (xi, zi) belongs to the exponential family, such as the binomial, normal, Poisson, etc.. (Little and Schluchter (1985), and Ibrahim and Lipsitz (1996)).
We model the missing-data mechanism for NMAR covariates according to either a joint log-linear model for f(ri|yi, xi, zi, ξ) or a product of a sequence of one dimensional conditionals as in Ibrahim, Chen and Lipsitz (1999). Finally, we assume that the covariate distribution f(zi|xi, α) is also modeled via a sequence of one-dimensional conditional distributions as in and Ibrahim, Chen and Lipsitz (1999), and is given by
where we assume a specific order of conditioning.
2.2. Penalized likelihood for variable selection
In the variable selection problem, our objective is to identify nonzero components of β in (2.2) and simultaneously estimate parameters, while accounting for the missing covariate data. We propose to maximize the penalized likelihood function given by
| (2.3) |
where λ = (λ1, …, λp)T, λj is the penalty parameter corresponding to the j-th regression coefficient βj, and f (Do,i|η) = ∫ f(yi, zi, ri|xi, η)dzm,i is the observed-data log-likelihood function of the i-th observation. The penalty function, pλj (·), is a nonnegative, nondecreasing, and differentiable function on (0, ∞) (Fan and Li (2001) and Zou (2006)). These properties ensure that the maximization of (2.3) results in estimates of β which are shrunk to zero if they are small. The corresponding covariates of the estimates that are zero are the insignificant predictors of the response variable, whereas the estimates that are not zero correspond to those covariates which are statistically significant predictors. By maximizing (2.3), one can select significant predictors and estimate parameters simultaneously while accounting for the missing data. This approach is in sharp contrast to stepwise selection procedures and Bayesian procedures (George and McCulloch (1993), and Yang, Belin and Boscardin (2005)), that ignore stochastic errors inherited in the selection phase during estimation of the ‘best’ model (Fan and Li (2002)).
In (2.3), the parameters τ, α, and ξ are not penalized, so they are not shrunk to zero even though their actual values may be small. In this sense, variable selection does not occur in the covariate distribution and the missing data mechanism. However, care must be taken in the specification of these distributions since certain specifications can lead to identifiability issues for estimating α ξ,, and thus β.
Because the observed-data log-likelihood function usually involves intractable integration, we use the EM algorithm to compute the penalized maximum likelihood estimate of η, denoted by η̂λ, for each λ (Dempster, Laird and Rubin (1977)). At the s-th iteration, given η(s), the E step is to evaluate the Q–function given by
where
The M step of the algorithm involves maximizing Q1,λ(β, τ |η(s)), Q2(α|η(s)), and Q3(ξ|η(s)), independently. Maximizing Qλ(η|η(s)) with respect to (α, τ, ξ) can be done using standard maximization algorithms, such as Newton-Raphson (Little and Schluchter (1985), and Ibrahim and Lipsitz (1996)). However, it is difficult to maximize Q1,λ(β, τ(s)|η(s)) with respect to β, because it is nondifferentiable and nonconcave (Zou and Li (2008)).
To maximize Q1,λ(β, τ(s)|η(s)) with respect to β, we approximate Q1(β, τ(s)|η(s)) using a second order Taylor’s series expansion centered at β(s). Using this approximation, Q1,λ(β, τ (s)|η(s)) resembles a penalized weighted least squares regression, so algorithms used for maximizing penalized least squares can be applied. Such algorithms include the local quadratic approximation algorithm (LQA) (Fan and Li (2001)), the best convex minorization-maximization algorithm (MM) (Hunter and Li (2005)), and the local linear approximation algorithm (LLA) (Zou and Li (2008)). We use the local linear approximation method to maximize Q1,λ(β, τ(s)|η(s)), because it has been shown to reduce the computational cost of maximizing penalized likelihoods (Zou and Li (2008)). Even though an approximation is used for Q1,λ(β, τ(s)|η(s)), the maximizer of this function, denoted β(s+1), will behave such that Q1,λ(β(s+1), τ(s)|η(s)) ≥ Q1,λ(β(s), τ(s)|η(s)). Therefore, using the ECM algorithm (Meng and Rubin (1993)), we can obtain a η(s+1) such that Qλ(η(s+1)|η(s)) ≥ Qλ(η(s)|η(s)), rather than directly maximizing Qλ(η|η(s)). We iterate this process until it converges to a value and denote the value at convergence by η̂λ. Thus, η̂λ maximizes the penalized observed data log-likelihood.
2.3. Penalty selection procedure
To ensure that η̂λ has oracle properties, the penalty parameter λ has to be appropriately selected. Two commonly used criteria for selecting the penalty parameter include the GCV and BIC criteria. These criteria cannot be easily computed in the presence of missing data because they are often functions of the missing data, and thus involve intractable integrals. Moreover, it has been shown that even for the linear model, the GCV can lead to significant overfitting (Wang, Li and Tsai (2007)).
We propose two methods to select the penalty parameter: an ICQ criterion and a random effects penalty estimation method. The ICQ criterion selects the optimal λ by minimizing
where is the unpenalized maximum likelihood estimate under the full model, and ĉn(η) is a function of the data and the fitted model. For instance, if ĉn equals twice the total number of parameters, then we obtain an AIC-type criterion; alternatively, we obtain a BIC-type criterion when ĉn(η) = dim(η) × log n. Moreover, in the absence of missing data, we just obtain the usual AIC or BIC criteria. In practice, it is easy to compute ICQ for different λ because we only need samples from f (zm,i|yi, xi, zo,i, η̂0) to approximate Q(η̂λ|η̂0) at each λ.
The random effects penalty estimator is calculated under the assumption that the regression coefficients β are distributed as random effects in a hierarchical model. The parameter λ can be regarded as a parameter in the distribution of β, denoted by f(β|λ, n). Then, λ can be estimated by maximizing the marginal likelihood given by
| (2.4) |
where
| (2.5) |
in which C(λj, n) is the normalizing constant of exp(−npλj (|βj|)). The resulting estimate of λ, denoted by λ̂RE, from the maximization of (2.4) is the random effects penalty estimator. The EM algorithm can be used to calculate λ̂RE by treating the regression coefficients as missing data in the marginal likelihood.
We consider the SCAD and ALASSO penalties as follows. For ALASSO,
for j = 1, …, p. Typical values chosen are λj = λ0|β̂j|−γ, where β̂j is the unpenalized ML estimate and γ > 0 is a pre-specified positive scalar. In contrast, the SCAD penalty (Fan and Li (2001)) is a nonconcave function defined by pλ(0) = 0 and for |β| > 0,
where 1(·) denotes the indicator function, t+ denotes the positive part of t, and a = 3.7. Because the function exp(−npλ(|β|)) for the SCAD penalty is not proper, we use a truncated version of pλ(|β|) to define the density f (β|λ, n). For SCAD, we have
where β̄ is arbitrarily large. For the ALASSO penalty, this truncation is not necessary because exp(−npλ(|β|)) is proper.
A closed form expression of λ̂RE is unavailable for both the ALASSO and SCAD penalties. But for the ALASSO penalty, a closed form expression of the conditional maximizer of the log-likelihood function with respect to λ is available. This allows a straightforward implementation of the ECM algorithm to estimate λ. For the SCAD penalty, we use the Newton Raphson algorithm along with the ECM algorithm to estimate λ̂RE.
3. Theoretical Results
In this section, we establish the asymptotic theory of penalized likelihood estimators and the consistency of the penalty selection procedure based on ICQ. Suppose that , where β(1) and β(2) are, respectively, p1 × 1 and p2 × 1 subvectors. Let denote the true value of β. Without loss of generality, we assume that and each of the components of β(1) is not zero.
Let
= {j1, …, jd} be a candidate model containing the j1th, …, jdth covariates. Thus,
= {1, …, p} and
= {1, …, p1} denote the full and true covariate models, respectively. If
misses at least one important covariate,
then
is referred to as an underfitted model; however, if
then
is an overfitted model. Assume that we only consider the selected covariates in
. The unpenalized and penalized ML estimates of η, denoted by η̂S and η̂λ, respectively, are
where
= η̂0.
Theorem 1
Under assumptions (C1)–(C7) stated in the online supplement, we have
η̂λ − η* = Op(n−1/2) as n → ∞, where and η* is the true value of η.
Sparsity: P(β̂(2)λ = 0) → 1.
Asymptotic normality: is asymptotically normal with mean and covariance defined in the online supplement.
The proof of Theorem 1 is given in the online supplement at http://www.stat.sinica.edu.tw/statistica. It states that, by choosing the penalty λ, there exists a root-n estimator of η, η̂λ, and that this estimator must posses the sparsity property, i.e., β̂(2)λ = 0. Theorem 1(iii) has η̂λ asymptotically normal. An expression for the asymptotic covariance matrix of η̂λ can be obtained using Louis’s method (Louis (1983)). These estimates are given in the online supplement.
We investigate whether the ICQ(λ) criterion can consistently select the correct model. For each λ ∈ Rp+, β̂λ naturally defines a candidate model
= {j: β̂λj ≠ 0}. Generally,
can be either underfitted, overfitted, or true. Therefore, Rp+ can be partitioned into three mutually exclusive regions
, and
. Furthermore, we can always choose a reference penalty parameter sequence
, that satisfies the conditions necessary for Theorem 1 to hold. Thus,
=
with probability converging to one. To select a better model, we first calculate
We assume
⊃
and choose the model resulting from using the penalty value λ1 (i.e.,
), if dICQ(λ2, λ1) ≥ 0, otherwise we choose model
.
Define , and δc(λ2, λ1) = ĉn(η̂λ2) − ĉn(η̂λ1), in which is defined in the online supplement.
Theorem 2
Under assumptions (C1)–(C7) in the Appendix of the online supplement, we have following results.
If for all
, lim infδQ(λ, 0)/n > 0 and δc(λ, 0) = op(n), then dICQ(λ, 0) > 0 in probability for all
.If and for t = 1, 2, then dICQ(λ2, λ1) > 0 in probability as .
If Q(η̂λ1 |η̂0) − Q(η̂λ2|η̂0) = Op(1), then dICQ(λ2, λ1) > 0 in probability as .
The proof of Theorem 2 is given in the online supplement. Theorem 2 has some important implications. Theorem 2a shows that ICQ(λ) chooses all significant covariates with probability 1. Because
, the optimal model selected when minimizing ICQ(λ) will not select a λ with
because dICQ(λ, 0) > 0 in probability. Therefore, ICQ selects all significant covariates with probability tending to 1. Generally, the most commonly used ĉn(η), such as 2dim(η), dim(η) log(n), and K log log(n) (K > 0), satisfy the condition δc(λ, 0) = op(n). The condition
ensures that ICQ(λ) chooses a model with large
. This condition is analogous to Condition 2 in Wang, Li and Tsai (2007), which elucidates the effect of models that underfit. Because
can be written as
where
it then follows from Jensen’s inequality that n−1δQ(λ, 0) ≥ 0. Thus, if a model
misses a significant covariate, it is reasonable to assume lim infn n−1δQ(λ, 0) is greater than zero.
If λ1 and λ2 have the same average
, that is, lim infn n−1 δQ(λ2, λ1) = 0, then Theorem 2 (b) and (c) indicate that ICQ(λ) picks out the smaller model
when δc(λ2, λ1) increases to ∞ at a certain rate (e.g., log(n)). For example, for the BIC-type criterion, δc(λ2, λ1) = [dim(
) − dim(
)] log(n) ≥ log(n), since we assume
⊃
. However, the AIC-type criterion ĉn(η) = 2 × dim(η) does not satisfy this condition. Thus, similar to the standard AIC, ICQ with ĉn(η) = 2 × dim(η) tends to overfit.
4. Numerical Studies
4.1. Example 1: simulation study
We demonstrate the performance of the penalized ML estimates using our proposed penalty estimators via simulations and compare them to the unpenalized ML estimate. Our objective for these simulations was to (1) compare the performance of the random effects and the ICQ penalty estimators, (2) compare the performance of the SCAD and ALASSO penalty functions, and (3) determine how the comparisons in (1) and (2) differ in the complete data and missing covariate settings.
To do this, we simulated datasets consisting of n observations from the model y = uT β* + σε where β* = (3, 1.5, 0, 0, 2, 0, 0, 0)T and the components of u = (u1, …, u8), and ε are standard normal. The correlation between ui and uj is ρ|i−j| with ρ = 0.5. This model was used in Fan and Li (2001). We considered three settings, (n = 40, σ = 3), (n = 40, σ = 1), and (n = 60, σ = 1). For each of them, two sets of 100 datasets were simulated, one with complete data and another with missing covariate data. For the datasets with missing data, the missing covariates zi = (u1i, u2i) were taken to be MAR and xi = (u3i, …, u8i) were completely observed. The covariate distribution is given by, [zi|xi] ~ N2(μi, Σ) for i = 1, …, n where μi = (μ1i, μ2i), for s = 1, 2 and Σ is an unstructured 2 × 2 covariance matrix. The missing data mechanism used was f(ri1, ri2|yi, xi, φ) = f (ri1|ri2, yi, xi, φ1)f (ri2|yi, xi, φ2), where f(ri1|yi, xi, φ1) and f (ri2|ri1, yi, xi, φ2) are logistic regressions where the logistic regression parameters φ1 and φ2 were selected such that 65% of the observations had complete data.
For each simulated dataset, the penalized ML estimate using the SCAD and ALASSO penalties was computed using the random effects and ICQ penalty estimates. These estimates are denoted as SCAD-RE, SCAD-ICQ, ALASSO-RE, and ALASSO-ICQ, respectively. For the ICQ estimate, the BIC-type criterion, cn(η) = dim(η) log n, was used. In the analysis of the datasets with no missing covariates, the ICQ criterion is equivalent to BIC. For the random effects penalty estimator, 2,000 Monte Carlo iterations were used within each iteration of EM. Since the EM algorithm can be sensitive to starting values, the algorithm was initiated from multiple starting values to ensure the overall global maximum was achieved by the algorithm. For the ALASSO penalty, we set λj = λ0|β̂j0|−1, where β̂j0 is the unpenalized ML estimate and for the SCAD penalty we let λj = λ0, for all j, where in both cases λ0 was estimated using the penalty estimation methods.
In addition to the penalized estimates, the unpenalized ML estimate of the model selected by the simultaneously impute and select (SIAS) method of Yang, Belin and Boscardin (2005) was computed. SIAS implements the stochastic search variable selection (SSVS) method of George and McCulloch (1993) in the presence of missing covariates. SIAS is a fully Bayesian method which does not require model enumeration or computation of marginal likelihoods, so it maybe easier to implement than other fully Bayesian methods. In the analysis of the datasets with no missing covariates, SIAS is equivalent to SSVS. Details of the implementation of SIAS are given in the online supplement.
For each estimate β̂λ, the model error, ME(β̂λ) = (β̂λ−β*)E(uuT)(β̂λ−β*), was computed and the ratio of the model error of the penalized ML estimate to that of the unpenalized ML estimate, ME(β̂λ)/ME(β̂0), was computed. The median of these ratios over the 100 simulated datasets, denoted as MRME, is reported. The MRME of the true model, denoted as ‘oracle’, is also reported. In addition, the average number of zero coefficients correctly estimated to be zero and the average number of zero coefficients incorrectly estimated to be zero are reported. These are reported in the columns ‘Correct’ and ‘Incorrect’ respectively.
The results indicate that when the noise level is high (σ = 3), the ALASSO-RE and SCAD-ICQ estimates have smallest model error while the SCAD-RE has the highest. When the noise level is reduced (σ = 1), or the sample size is large (n = 60), the SCAD-RE estimate has the smallest model error. For the estimates, MRME values greater than one indicate that the estimate performs worse than the unpenalized ML estimate, values near one indicate it performs as good as the unpenalized ML estimate, while values near the ‘oracle’ MRME value indicate optimal performance. The SCAD-RE performed poorly when the noise level was high, however, it is optimal when either the noise level is small or the sample size is large. The ALASSO-RE estimate had substantial overfit since ‘Correct’ averaged significantly less than 5 indicating a tendency to not set insignificant coefficients to zero. The SIAS estimate performed as well as the unpenalized ML estimate when the noise level was large and covariates were missing, however it outperformed the ML estimate when either the noise level was high, the sample size was large, or all the covariates were fully observed. ‘Correct’ averages and ‘Incorrect’ averages that are both high indicate that the estimate is more likely to set coefficients to zero rather than not. This was the case with the SIAS and SCAD-RE estimates when the noise level was large. Comparing the analysis of no missing covariate data to the analysis with missing covariate data shows that for all the estimates, the estimation error increased, overfitting increased, and underfitting increased.
4.3. Example 2: melanoma data
To further illustrate our proposed methods, we consider data on n = 286 patients from a phase III two arm clinical trial conducted by the Eastern Cooperative Oncology Group. The results from this study have been reported in Kirkwood, Strawderman, Ernstoff, Smith, Borden and Blum (1996). Patients in this trial were randomized to one of two treatment arms: high dose interferon or observation. Interferon is suggested to have a significant effect on disease-free survival. Here, disease free survival is defined as the time from randomization until progression of tumor or death, whichever comes first. In this analysis, several prognostic factors were identified as important predictors of survival. Among these factors are, z1 = Breslow thickness (in mm), z2 = size of primary (in cm2), z3 = type of primary tumor (two levels: superficial spreading, other), x1 = age (in years), x2 = pathological group (two levels: previous recurrence and other) and x3 = treatment (two levels: high dose interferon and observation). From these six covariates, three had missing data while the rest of the covariates and the response variable were completely observed. The three covariates with missing data were Breslow thickness, size, and type. Logarithms of Breslow thickness and size were used in this analysis to achieve approximate normality of these covariates in the covariate distribution. The dataset had a total missing data fraction of 28.7%. The outcome variable, yi, was taken here to be binary, and was assigned a 1 if the patient had an overall survival greater than or equal to 0.55 years, and 0 otherwise. There were no censored cases that had an overall survival below 0.55 years.
To analyze these data, a logistic regression model was used for yi|xi, β with E(yi|xi, β) = exp(γi)/(1 + exp(γi)), where γi = (1, zi, xi)T β, zi = (zi1, zi2, zi3)T, xi = (xi1, xi2, xi3)T, and β = (β0, β1, …, β6). For the missing covariates, we assume they are MAR and have the covariate distribution
for i = 1, …, n. Since xi is completely observed, it is conditioned on throughout. We take (zi1, zi2|xi) ~ N2(μi, Σ), where μi = (μi1, μi2) and for s = 1,2, i = 1, …, n, and Σ is an unstructured 2 × 2 covariance matrix. A logistic regression model was used for xi3 conditional on (zi1, zi2, xi). The same estimates as those computed in the simulations were computed. The statistical model used for the SIAS method is given in the online supplement.
The results are presented in Table 4.2. The predictors identified as significant were different for the each of the estimation methods. In the missing data analysis, the ALASSO and SIAS estimates identified treatment as a significant predictor while the SCAD estimates did not. The ALASSO-ICQ estimate also identified treatment and pathology as significant while the ALASSO-RE estimate identified treatment, pathology and age as significant. According to the unpenalized ML analysis, treatment and pathology are the only predictors which are possibly significant since their p-values are near or below the cutoff value of 0.05 for significance. However, neither of these predictors was strongly significant. Therefore, a possible explanation for the differences in the results of the various estimation methods is that these methods may not be able to discriminate between models that include or exclude treatment and pathology very well. The results of the unpenalized maximum likelihood analysis coincided with the results of the ALASSO-ICQ and SIAS estimates. As with the simulations, the ALASSO-RE estimate tended to overfit since it identified age as significant even though its p-value was greater than 0.05, and the SCAD-RE estimate tended to set coefficients to 0 since it did not identify any predictors as significant. The estimate of the regression coefficient for treatment decreased from 1.117 in the complete case analysis to 0.839 in the missing data analysis. This change caused the SCAD-ICQ estimate to identify treatment as significant in the complete case analysis but not significant for the missing data analysis.
Table 4.2.
Estimates of Melanoma Data
| Missing Data Estimate |
||||||
|---|---|---|---|---|---|---|
| SCAD |
ALASSO |
SIAS |
MLE (p value) |
|||
| Variable | RE | ICQ | RE | ICQ | ||
| Intercept | 2.132 | 2.132 | 2.421 | 2.280 | 1.774 | 2.638 (<0.001) |
| Breslow | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | −0.217 (0.332) |
| Size | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | −0.052 (0.798) |
| Type | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | −0.161 (0.730) |
| Age | 0.000 | 0.000 | −0.267 | 0.000 | 0.000 | −0.325 (0.146) |
| Pathology | 0.000 | 0.000 | −0.845 | −0.454 | 0.000 | −1.061 (0.039) |
| Treatment | 0.000 | 0.000 | 0.737 | 0.322 | 0.827 | 0.839 (0.043) |
| Complete Case Estimate |
||||||
|---|---|---|---|---|---|---|
| SCAD |
ALASSO |
SIAS |
MLE (p value) |
|||
| Variable | RE | ICQ | RE | ICQ | ||
| Intercept | 2.085 | 1.609 | 2.043 | 1.820 | 1.609 | 2.210 (<0.001) |
| Breslow | 0.000 | 0.000 | −0.081 | 0.000 | 0.000 | −0.222 (0.400) |
| Size | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | −0.089 (0.650) |
| Type | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.235 (0.650) |
| Age | 0.000 | 0.000 | −0.113 | 0.000 | 0.000 | −0.232 (0.356) |
| Pathology | 0.000 | 0.000 | −0.578 | 0.000 | 0.000 | −0.945 (0.086) |
| Treatment | 0.000 | 1.173 | 1.003 | 0.572 | 1.173 | 1.117 (0.028) |
5. Discussion
We have proposed a general method to simultaneously perform model selection and estimation in the presence of missing data. We have showed that under regularity conditions and appropriate rates of the penalty parameter, the penalized estimate possesses oracle properties. We have introduced two computationally attractive methods for estimating the penalty parameters. We have showed that under an appropriate choice of ĉn(η), the ICQ penalty estimate chooses all the significant predictors in probability. Simulation results show that the SCAD penalty function with the random effects penalty estimate performs well when the noise level is small, whereas it performs poorly when the noise level is large. Overall, the SCAD performed better when it was used with the random effects penalty estimator whereas the ALASSO performed better when it was used with the ICQ criterion. The ALASSO penalty function with the random effects penalty estimate showed significant overfit in the finite sample simulations and this overfit was also present in the Melanoma data analyses. The results of the Melanoma data analysis indicate that when predictors are not strongly significant, the results from penalized likelihood maximization may differ depending on the penalty functions and penalty selection methods which are used.
One of the disadvantages of penalized likelihood methods is that they do not provide a measure of model uncertainty, i.e., the probability of selecting each model in the model space. Other methods, such as Bayesian model averaging (Hoeting, Madigan, Raftery and Volinsky (1999)), SIAS, or Bayesian methods in general provide estimates of posterior model probabilities. However, implementation of fully Bayesian methods can be difficult in many cases, since it requires specifying priors for all of the parameters in the response model, covariate distribution (and missing data mechanism under NMAR) which encompass all the models in the model space, as well as calculating marginal likelihoods and enumerating all the models in the model space. Alternatively, the SIAS method is easier to implement but, unlike penalized ML maximization, it does not give an estimate of the parameters of the ‘best’ model. Moreover, the results of the linear regression simulations indicated that the SCAD-RE estimate outperforms SIAS when either the noise level is small or the sample size is large.
Many aspects of this work warrant further research and investigation. One major issue is to carry out variable selection using ICQ under different modeling situations such as generalized linear mixed models with nonignorable missing response and/or covariate data, semiparametric survival models with missing covariate data, such as the Cox model as well as frailty models, measurement error models, and partially linear models with missing covariates and/or responses. Throughout this paper, we made an implicit assumption that the response model does not depend on whether a covariate is observed or missing. That is, we have assumed a single response model for the covariate where it is missing or not. If we have a different response model for the observed and missing parts of the covariate, then the methods developed in this paper would not be able detect whether the missing part of a covariate is significant. In this scenario other statistical methods, such as propensity score methods, may be useful for handling this case (Kang and Schafer (2007)), but applying these methods to variable selection problems requires further developments both computationally and theoretically. We will formally investigate these issues in our future work.
Supplementary Material
Table 4.1.
Simulation results of linear regression model with no missing data and covariates missing at random comparing SCAD and ALASSO penalty functions with random effects and ICQ penalty estimates.
| No missing (MAR) |
||||
|---|---|---|---|---|
| # of 0 coefficients |
||||
| Model | Method | MRME | Correct | Incorrect |
| n = 40, σ = 3 | SCAD-RE | 1.111 (1.203) | 4.91 (4.90) | 0.97 (0.98) |
| SCAD-ICQ | 0.625 (0.745) | 4.53 (4.48) | 0.33 (0.45) | |
| ALASSO-RE | 0.632 (0.690) | 3.23 (3.42) | 0.09 (0.13) | |
| ALASSO-ICQ | 0.681 (0.771) | 4.31 (4.23) | 0.28 (0.35) | |
| SIAS | 0.765 (1.004) | 4.81 (4.87) | 0.55 (0.77) | |
| Oracle | 0.256 (0.305) | 5.00 (5.00) | 0.00 (0.00) | |
| n = 40, σ = 1 | SCAD-RE | 0.285 (0.316) | 4.34 (4.49) | 0.01 (0.01) |
| SCAD-ICQ | 0.333 (0.549) | 4.64 (4.15) | 0.00 (0.00) | |
| ALASSO-RE | 0.472 (0.543) | 3.45 (3.23) | 0.00 (0.00) | |
| ALASSO-ICQ | 0.404 (0.572) | 4.58 (4.10) | 0.00 (0.00) | |
| SIAS | 0.321 (0.360) | 4.82 (4.79) | 0.00 (0.00) | |
| Oracle | 0.273 (0.258) | 5.00 (5.00) | 0.00 (0.00) | |
| n = 60, σ = 1 | SCAD-RE | 0.322 (0.351) | 4.54 (4.62) | 0.00 (0.00) |
| SCAD-ICQ | 0.375 (0.386) | 4.86 (4.73) | 0.00 (0.00) | |
| ALASSO-RE | 0.517 (0.495) | 3.47 (3.53) | 0.00 (0.00) | |
| ALASSO-ICQ | 0.425 (0.447) | 4.83 (4.70) | 0.00 (0.00) | |
| SIAS | 0.461 (0.387) | 4.70 (4.82) | 0.00 (0.00) | |
| Oracle | 0.310 (0.356) | 5.00 (5.00) | 0.00 (0.00) | |
Contributor Information
Ramon I. Garcia, Email: rgarcia@bios.unc.edu.
Joseph G. Ibrahim, Email: ibrahim@bios.unc.edu.
Hongtu Zhu, Email: hzhu@bios.unc.edu.
References
- Bickel PJ, Li B. Regularization in statistics. Test. 2006;76:271–344. [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood for incomplete data via the EM algorithm. J Roy Statist Soc Ser B. 1977;39:1–38. [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
- Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30(1):74–99. [Google Scholar]
- Fan J, Li R. New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. J Amer Statist Assoc. 2004;99:710–723. [Google Scholar]
- George EI, McCulloch RE. Variable selection via Gibbs sampling. J Amer Statist Assoc. 1993;88:881–889. [Google Scholar]
- Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Statist Sci. 1999;14:382–417. [Google Scholar]
- Hunter DR, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ibrahim JG, Chen MH, Lipsitz SR. Monte Carlo EM for missing covariates in parametric regression models. Biometrics. 1999;55:591–596. doi: 10.1111/j.0006-341x.1999.00591.x. [DOI] [PubMed] [Google Scholar]
- Ibrahim JG, Lipsitz SR. Parameter estimation from incomplete data in binomial regression when the missing data mechanism is nonignorable. Biometrics. 1996;52:1071–1078. [PubMed] [Google Scholar]
- Ibrahim JG, Lipsitz SR, Chen MH. Missing responses in generalized linear mixed models when the missing data mechanism is nonignorable. Biometrika. 2001;88:551–564. [Google Scholar]
- Kang JDY, Schafer JL. Demystifying double robustness: a comparision of alternative strategies from estimating a population mean from incomplete data. Statist Sci. 2007;22:523–539. [Google Scholar]
- Kirkwood JM, Strawderman MH, Ernstoff MS, Smith TJ, Borden EC, Blum RH. Interferon alfa-2b adjuvant therapy of high-risk resected cutaneous melanoma: the eastern cooperative oncology group trial EST 1684. Journal of Clinical Oncology. 1996;14:7–17. doi: 10.1200/JCO.1996.14.1.7. [DOI] [PubMed] [Google Scholar]
- Little RJA, Schluchter M. Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika. 1985;72:497–512. [Google Scholar]
- Louis TA. Finding the observed information matrix when using the EM algorithm. J Roy Statist Soc Ser B. 1983;44:226–233. [Google Scholar]
- McCullagh P, Nelder JA. Generalized Linear Models. 2. Chapman and Hall; London: 1989. [Google Scholar]
- Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika. 1993;80:267–78. [Google Scholar]
- Wang H, Li R, Tsai CL. Tuning parameter selector for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang X, Belin TR, Boscardin WJ. Imputation and variable selection in linear regression models with missing covariates. Biometrics. 2005;61:498–506. doi: 10.1111/j.1541-0420.2005.00317.x. [DOI] [PubMed] [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
- Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models. Statist Sci. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
