Abstract
Modern regularization and variable selection methods such as lasso and Bayesian variable selection are important tools for psychological researchers to reduce the risk of overfitting, improve prediction in future samples, and increase model interpretability. Although missing data are common in psychological data, it is not straightforward to combine principled methods for addressing missing data with these modern variable selection methods. This challenge is well-illustrated in a recent paper by Gunn and colleagues (2022) with a comparison of three approaches for combining lasso with multiple imputation to address missing data. Each of the surveyed approaches results in markedly different results in terms of predictors selected. Their findings underscore limitations of the lasso for the purpose of variable selection. In this paper we show how to implement a Bayesian variable selection method, Stochastic Search Variable Selection (SSVS), with multiply imputed data. SSVS is a principled and consistent method for variable selection, and we demonstrate advantages relative to lasso in an example dataset and simulation study. It is straightforward to apply an impute-then-combine strategy for SSVS using existing software.
Keywords: Bayesian variable selection, SSVS, missing data, multiple imputation, regression
Translational Abstract:
Psychological researchers often analyze many potential predictors, which increases the risk of identifying relationships that do not replicate in future samples. Methods such as the lasso are widely used to reduce this risk by selecting a smaller set of important predictors. However, psychological data also frequently contain missing values, and combining variable-selection methods with approaches for handling missing data is challenging. Recent work comparing ways to combine the lasso with multiple imputation—a common method for addressing missing data—shows that different approaches can lead to very different conclusions about which predictors are important. In this paper, we demonstrate how to apply a Bayesian variable-selection method, Stochastic Search Variable Selection (SSVS), with multiply imputed data. Using both an example dataset and simulations, we show that this approach provides a principled and consistent way to identify important predictors and can offer advantages over the lasso, while remaining straightforward to implement with existing software.
Model regularization and variable selection methods are becoming more widespread in social science research (McNeish, 2015), following trends in other disciplines. An important outstanding challenge is how to use such methods when there is missing data. This paper demonstrates how stochastic search variable selection (SSVS), a fully Bayesian variable selection method, can be straightforwardly combined with multiple imputation (MI) to handle missing data. This approach addresses complications that arise when combining the more commonly used least absolute shrinkage and selection operator or lasso (Tibshirani, 1996) with multiple imputation (Gunn et al., 2022), and offers additional advantages for research questions in psychology.
The challenge of selecting a meaningful set of predictors for a given outcome is a longstanding issue in psychological research. Considering the choice of regression modeling strategy, it is important to consider matching the strategy with the goals of the analysis. In psychology researchers are often concerned with explanation, or making inferences from their models to inform theory, specifically whether or how predictors are related to an outcome of interest. This in contrast to modeling with a primary focus on prediction. Of course, variable selection consistency is an important characteristic of methods used with the goal of explanation. Classical stepwise regression methods have been used in psychology since the 1960s (e.g., Dennerill, 1964; Sengstake, 1965) to select a subset of predictors based on stepwise significance tests. More recent stepwise methods select subsets using information-based criteria (e.g. AIC or BIC). However there has been longstanding recognition that such methods are prone to overfitting, capitalization on chance, and therefore poor replicability and explanatory value (Henderson & Denison, 1989). These limitations have motivated interest in more principled approaches to variable selection that can support valid statistical inference.
One widely adopted alternative to stepwise techniques is lasso regularization, which trades a reduction in estimator variance for a bias in regression coefficients that shrinks coefficients towards zero. Because the lasso penalty shrinks some coefficients to exactly zero, the method performs simultaneous regularization and selection, and has exploded in popularity across disciplines including psychology. However, lasso has several well-recognized limitations that may strict its applicability for explanatory research in psychology. The method may over-shrink meaningfully large coefficients (Polson & Scott, 2011), does not perform well when predictors are highly correlated (Zou & Hastie, 2005), and does not asymptotically select the correct set of predictors except under restrictive conditions (Fan & Li, 2001). These limitations have motivated several extensions and alternatives to lasso (e.g. Zou, 2006; Zou & Hastie, 2005). Despite these limitations lasso is still frequently and widely used, including for the goal of variable selection in psychology (McNeish, 2015).
Alternatively, variable selection can be approached directly in a Bayesian framework using prior specifications to explicitly model uncertainty in the predictor set (van Erp et al., 2019). Stochastic search variable selection (SSVS) is one fully Bayesian variable selection technique that can be used to estimate the probability that each predictor should be included in the model (Mitchell & Beauchamp, 1988). Unlike lasso, which produces a single selected model at a given tuning parameter value, SSVS provides predictor-specific marginal inclusion probabilities (MIPs) that quantify uncertainty about whether each predictor belongs in the model. These MIPs are estimated as the proportion of times each predictor is selected for inclusion during MCMC sampling. Full posterior distributions are also available for each regression coefficient, providing information that aligns well with explanatory research goals in psychology. Relative to lasso, SSVS improves variable selection accuracy and adaptively shrinks coefficients, addressing some of lasso’s key limitations (S. A. Bainter et al., 2023).
All of the variable selection methods discussed so far assume complete cases. However, missing data is prevalent in social sciences research. Common software programs default to listwise deletion, a generally unacceptable strategy for dealing with missing data as it can greatly decrease the available sample size and bias results depending on the missingness mechanism (see next section for definitions) (Schafer & Graham, 2002). As the number of candidate predictors increases, the likelihood of complete data for each case tends to decrease. In the extreme, it may be that no cases have complete data. Another frequently used approach is a single imputation method, such as mean substitution, which is also unacceptable because single imputation ignores uncertainty in the missing values, potentially biasing point estimates and standard errors.
Widely acceptable modern methods for handling missing data are maximum likelihood estimation using the EM algorithm, and multiple imputation, both of which can be easily applied to a standard regression analysis. Both approaches are effective and require less stringent assumptions compared with complete case analysis. For many common model types either approach is accessible and useful for overcoming the disadvantages of complete case analysis and accounting for uncertainty in missing values. The same is not yet true for methods that perform variable selection, and care must be taken when combining uncertainty in the model (i.e. which predictors are included) with the uncertainty in missing values. For example, the most common method for tuning the lasso solution is to use cross-validation, which requires splitting the data into training and test sets and choosing a solution that optimizes an appropriate index of prediction error. It remains an open question how the steps of imputation, cross-validation, and splitting into training and test sets should be combined (Gunn et al., 2022). In their recent paper, Gunn et al. (2022) compared three approaches for fitting a lasso when using multiple imputation to handle missing data. The approaches were compared in the context of a motivating example, predicting depression severity from a set of candidate predictors including demographics, social determinants, and risk and protective factors in a sample of adolescents. They found that the results, especially in terms of predictors selected, varied greatly depending on which approach was used. In contrast, SSVS can be straightforwardly applied using an impute-then-select approach (Yang et al., 2005) because the Bayesian framework naturally propagates uncertainty across imputed datasets without requiring cross-validation or train-test splits.
In this paper, we demonstrate how SSVS can be used with multiply imputed data using an impute-then-select approach (Yang et al., 2005). We focus on SSVS as a principled alternative to lasso that is both more suitable for research questions in psychology and more straightforward to implement with multiple imputation. We extend recent comparisons of SSVS and lasso approaches (S. A. Bainter et al., 2023; van Erp et al., 2019) to the context of missing data and multiple imputation. For our comparison we use the same example data and imputations used by (Gunn et al., 2022) and a targeted simulation study to illustrate how results are expected to differ depending on missingness mechanism.
Modern Variable Selection Approaches
Regularization and the LASSO
The standard linear regression model for an outcome for individual and a set of predictors can be written as
| (1) |
with intercept , and errors . For a given model, the usual way of obtaining estimates is to find the Ordinary Least Squares (OLS) solution. The OLS parameter estimates are determined by minimizing the loss function
| (2) |
where N is the total number of observations, is the observed score on the outcome variable for the observation with the observed score for predictor j, is the intercept and is the slope coefficient for predictor j, and p is the total number of predictors. OLS estimation provides a solution that is the best linear unbiased estimator (BLUE). However the OLS estimates tend to have high variance and perform poorly in terms of high prediction error when applied to additional samples from the same population.
The least absolute shrinkage and selection operator (Tibshirani, 1996, lasso, ) and other regularization methods add a penalty term to the OLS loss function which shrinks coefficient estimates towards zero. This biased estimation produces estimates that are less variable across samples and results in lower out of sample prediction error. The lasso penalty is the absolute sum of the coefficients weighted by a penalty term .
| (3) |
For , the lasso solution is identical to OLS. Because the penalty term applies equally to all predictors regardless of scale, predictors are standardized prior to estimation. A notable feature of the lasso penalty is that some coefficients are set to zero which removes predictors from the model and the method therefore performs simultaneous regularization and variable selection.
The lasso loss function does not have a closed-form solution outside of the orthogonal design case. Lasso estimates for a given are obtained using efficient algorithms such as Least Angle Regression (Tibshirani et al., 2004) or the coordinate descent algorithm (Friedman et al. 2009). The value for the penalty term must also be chosen, and an optimal value is usually selected using a cross-validation method. K-fold cross-validation is done by splitting the data into K folds, for example 10, and then for all folds (k = 1,…, K) obtaining lasso coefficients for a specific value of using K-1 folds of the data and using the estimates to predict values in the kth fold. A measure of fit such as mean squared error is calculated for each fold and then averaged across folds for a particular value of . This process is repeated for a range of values of , typically a series of 100 values, and an “optimal” value of may be chosen by minimizing the cross-validation error. Less commonly the value of can be determined by choosing the value that minimizes some information criterion (e.g. AIC, BIC) in the sample. However, these information criteria require degrees of freedom in order to be computed, which are ill-defined in the lasso (Zou et al., 2007).
In addition to the cross-validation procedure performed to select an optimal value of , the final model is evaluated by assessing generalization error with new data (E. E. Chen & Wojcik, 2016). Ideally this evaluation would happen with a new sample of data, but in practice the overall sample is usually split into training and test sets before performing k-fold cross-validation within the training set to select an optimal value of . The predictive performance is then evaluated in the holdout test set of the sample. The main advantage of the lasso is a decrease in prediction error in new samples for the tradeoff of biased (shrunken) parameter estimates. Other advantages that have contributed to its widespread use are the property of automatic variable selection and efficient algorithms in easily accessible software (Tibshirani, 2011).
However, the lasso has some notable shortcomings. Lasso regression performs poorly when predictors are correlated—tending to select only one of a group of correlated predictors (Zou & Hastie, 2005). In addition, the lasso penalty may over-penalize meaningfully large coefficients (Polson & Scott, 2011). The lasso is also not consistent in selecting the correct predictor set unless certain restrictive conditions are met (Fan & Li, 2001). It is possible that the property of consistent variable selection may not be of central importance if the primary motivation of the analysis is prediction rather than explanation (Yarkoni & Westfall, 2017). However, in practice it is tempting for researchers to interpret the lasso solution, and the lasso is often used and interpreted in an explanatory way, for example, to make inferences about which predictors are important for understanding treatment response or to predict psychological symptoms. Moreover, if an analysis is focused purely on prediction, many methods outperform the lasso in terms of prediction, especially Bayesian model averaging methods (Porwal & Raftery, 2022).
Other regularization methods have been motivated to overcome limitations of lasso. The elastic net incorporates the ridge and lasso penalties to select correlated sets of predictors (Zou & Hastie, 2005). The adaptive lasso was developed to asymptotically select the true set of predictors (Zou & Hastie, 2005). To selectively shrink coefficients, Fan and Li (Fan & Li, 2001) developed the Smoothly Clipped Absolute Deviation (SCAD). Regardless, cross-validation is most often used to arrive at a selected model, and an important challenge is how to deal with missing data within this process.
LASSO with Missing Data
Missing data is prevalent in social sciences research and may occur for a number of reasons. For example a participant may skip pages in a survey, a device may malfunction and not record data, or a participant may not provide a usable biological sample. Missing data can negatively impact results by reducing statistical power and biasing parameter estimates. Missing data are commonly categorized into three general missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Rubin, 1976).
Missingness is categorized as MCAR if the likelihood of data being missing is completely unrelated to any observed or unobserved values in the data. For example, if a participant accidentally skips the back side of a two-sided questionnaire. Missingness is MAR if the probability of missingness is related to observed values in the data, for example if questions about income are more likely to be skipped by younger participants. MNAR missingness means that the probability of missingness is related to the missing values themselves. This would be the case if a question asking about substance use is more likely to be skipped by respondents who engage in substance use.
Missing data is considered the rule rather than the exception in social science research, but the issue is likely to be compounded in variable selection applications. Larger sets of predictors make it less likely for individual observations to have complete data, especially if data are pooled from different sources (e.g. self-report, wearable devices, biological measures, behavioral data). Although modern methods for handling missing data are well-developed, relatively little research has explored their application in the context of variable selection. The simplest method for dealing with missing data – or rather for not dealing with it – is to restrict the analysis to complete cases only, also known as listwise deletion. Complete case analysis can be extremely inefficient, throwing away potentially large proportions of data, and bias results because complete case analysis makes the restrictive assumption that any missing values are MCAR. If the missingness is not MCAR, results may be biased. The two primary methods for more appropriately dealing with missing data can be broadly referred to as full information maximum likelihood (FIML) and multiple imputation (MI) approaches (Schafer & Graham, 2002).
Both methods make use of the available information in complete and incomplete cases, negate bias in results if the causes of missingness are related to observed values that have been incorporated into the analysis, and in some cases the methods yield theoretically equivalent results (Enders, 2006). FIML estimation provides estimates given the available data and incorporating uncertainty of missing values under distributional assumptions. While FIML estimation is widely available for many types of analyses, such as the families of generalized linear models and latent variable models, it is not currently accessible for estimating the lasso with missing data. This is an ongoing area of research (e.g. Garcia et al., 2010; Sabbe et al., 2013), but not currently programmed into software for general use. It is a nontrivial challenge to estimate a variable selection model allowing uncertainty in both the model and the missing values (Jiang et al., 2015).
Whereas FIML handles missing values within the model estimation process, MI addresses missing data in multiple stages. First by creating an number of copies of the data and within each copy imputing missing values with appropriate plausible values. Second, an analysis model is applied to each imputed data set. Finally, parameter estimates and standard errors can be pooled over the imputations to obtain results that appropriately account for uncertainty in the unobserved values while retaining available information in the observed data (Rubin, 2004). MI is a flexible overarching framework with many options for forming the imputation model and drawing plausible values. Two common frameworks are joint modeling, where all variables are imputed simultaneously under a single multivariate distribution, and fully conditional specification (FCS) wherein each variable is imputed separately from its own conditional model in an iterative sequence. Joint modeling has the theoretical advantage of drawing imputations from a coherent joint distribution, but practically FCS is more flexible and easier to implement, especially with mixed variable types. For a given application, selecting an appropriate MI procedure requires considering multiple factors including the distributions of incomplete variables and planned analysis model.
Some recent studies have explored integrating lasso with multiple imputation to address missingness (Gunn et al., 2022; Long & Johnson, 2015; Takada et al., 2019; Thao & Geskus, 2019; Zhao & Long, 2016). One known consideration is that any candidate predictors for a variable selection model should also be included in the imputation model in order for the models to be compatible (Van Buuren et al., 2006). However as demonstrated by Gunn et al., 2022, there are important considerations for how the steps of lasso estimation should be properly combined with the steps for MI.
Specifically, Gunn et al. compared a separate approach, a stacked approach, and the MI-LASSO. The separate approach involves splitting each imputed data set into training and test sets, performing the procedures of -fold cross-validation to select a value for within the training sample of each imputed data set, validating each imputation-specific lasso model in the corresponding test set to calculate a model performance measure such as MSE, and finally pooling the MSE values across the imputations. Within the separate approach there are many additional decisions to consider, such as whether to use an average or otherwise pooled value of for all imputations or imputation-specific values and how to pool zero and nonzero coefficients across imputations. Additionally, an inclusion frequency (IF) threshold must be chosen to create a decision rule for predictors included in some but not all imputed solutions. All of these decision points invite the possibility of different solutions in terms of variables selected.
The two other approaches considered, termed the stacked approach and the MI-lasso, are applied to combined or “stacked” imputed training and test sets rather than to each imputed data set separately. For these methods, each imputed data set is split into training and test sets, and the training and test sets are stacked into one long training data set and one long test data set. For the stacked approach, the lasso is applied using -fold cross validation to the stacked training data set to select an optimal value of , and the resulting model is fit to the stacked test data set to compute a model performance measure. This approach will select a much larger set of nonzero predictors, because the stacked data sets are much larger than the individual datasets and there is no correction applied to account for this duplication of the data. The third procedure considered, MI-LASSO, analyzes the stacked training set using the group lasso penalty (Q. Chen & Wang, 2013). The group lasso was initially motivated to select or exclude sets of coefficients as a group, such as main effects together with an interaction (Yuan & Lin, 2006). For MI-LASSO, the grouped lasso penalty is used to estimate the lasso for all imputed training sets jointly, estimating regression coefficients for each predictor, either included as a group or all set to zero.
In their comprehensive comparison, Gunn and colleagues highlighted the complexity of applying lasso with multiply imputed data and provided a tutorial for each approach. They showed that the predictors selected varied considerably according to approach used, provided guidance for future research in this area, and noted that the pros and cons of each approach do not highlight a clear recommended approach. Besides each method differing in terms of the computational and analytical complexity, a major concern was model interpretability. The methods for combining MI were compared using lasso, but their results are generalizable to related regularization approaches based on cross-validation, such as the elastic net. In the next section we present Bayesian SSVS as an alternative variable selection approach, with distinct advantages relative to lasso, including for use with multiple imputation.
Bayesian Variable Selection Approaches
Many alternative approaches for regularization and variable selection are available in a Bayesian framework. Bayesian estimation includes a prior distribution for each parameter in the model, which contains information about uncertainty in the values of parameter estimates, that is combined with the probability distribution of new data to yield the posterior distribution, which is used for inference. The posterior distribution is characterized using Markov Chain Monte Carlo (MCMC) sampling. MCMC estimation can be thought of as Monte Carlo integration using Markov chains; these algorithms provide a flexible approach to systematically sample from the target posterior distribution. While often more computationally intensive, there are several advantages to approaching variable selection, including with missing data, from a Bayesian framework.
Bayesian modeling readily incorporates the principle of regularization via the prior distribution. There are infinite possibilities for prior specification, but a few cases highlight connections between frameworks. For a given value of , the lasso estimates can be obtained by placing independent Laplace (i.e. double-exponential) priors on the regression coefficients (Park & Casella, 2008) and obtaining the maximum a posteriori (MAP) estimates. However, a value of must still be selected, and within a Bayesian framework, rather than selecting an optimal value using cross-validation, the natural choice is to put a prior distribution on the value of . Although this approach has been termed the “Bayesian lasso”, this labeling obscures the fact that the MAP estimates using this procedure are not equivalent to the lasso estimates because the resulting solution is no longer optimizing the lasso penalized sum of square of errors.
Stochastic Search Variable Selection
Whereas the lasso can be used to perform simultaneous selection and regularization, these aims may also be addressed more directly and independently by specifying a discrete mixture prior for the regression coefficients with one component tightly centered around zero and a second diffuse component with high variance, termed a “spike-and-slab” prior. This indicates the prior belief that each coefficient is expected to either have a value of zero (or near-zero) or be meaningfully large, and membership in these two mixture components can itself be assigned a prior distribution and estimated. This is done by defining a set of indicator variables for each predictor, so that and indicate absence or presence of predictor , respectively. Just as for the lasso approaches, for the prior to apply to all predictors regardless of scale, the predictor variables are standardized before estimation. The linear model is then expressed as
| (4) |
and the prior for if we choose to set the spike as exactly zero can be written as
| (5) |
with chosen to accommodate meaningfully large coefficients, for example , and the prior probability of inclusion is . If for each predictor, this reflects the prior belief that 1/2 of the predictors should be included. Diffuse priors are also set for the nuisance parameters, for example:
| (6) |
Following MCMC estimation, Marginal Inclusion Probabilities (MIPs) can be calculated as the proportion of times each predictor was selected for inclusion in the model. More specifically, the MIP is the proportion of MCMC samples with for each predictor . The MIPs provide an estimate of the probability that each predictor has a nonzero effect; there is no analogous quantity using lasso-type approaches with model optimization. This approach, Stochastic Search Variable Selection (SSVS) has been extensively compared to lasso methods for research in psychology (S. A. Bainter et al., 2023). Compared to lasso, SSVS is consistent, is more interpretable, and also minimizes overfitting by marginalizing over the distribution of possible models. The above prior specification is the default in the SSVS R package (S. A. Bainter et al., 2022).
The marginal distributions of the coefficients can also be examined to obtain model-averaged estimates; however, interpreting these estimates can be challenging, since the meaning of a coefficient may depend on the set of variables included in each model (Forte et al., 2018). Different prior choices and computation strategies for Bayesian variable selection are reviewed by O’Hara and Sillanpää O’Hara and Sillanpää (2009). Van Erp et al. van Erp et al. (2019) compared the variable selection accuracy and prediction error of Bayesian penalized regression methods including SSVS, Bayesian lasso, and the horseshoe prior (Carvalho et al., 2010). In a review of Bayesian model selection methods emphasizing out-of-sample performance, Piironen and Vehtari Piironen and Vehtari (2017) highlight the performance of projection predictive variable selection. Because the Bayesian approach characterizes uncertainty in the model and parameters by characterizing the posterior distribution, rather than through cross-validation, the process of combining MI with SSVS is also more straightforward.
Multiple Imputation and Stochastic Search Variable Selection
As for lasso regularization, the literature of applying multiple imputation with Bayesian variable selection is sparse. There are two general strategies for combining Bayesian variable selection methods such as SSVS with multiple imputation, these have been called “impute, then select” (ITS) and “simultaneously impute and select” (SIAS) (Yang et al., 2005). ITS involves the steps of initially performing multiple imputation, applying Bayesian variable selection within each imputed dataset, and applying multiple imputation combining rules (Rubin, 2004) to pool results. SIAS, also called data augmentation, implements multiple imputation simultaneously with variable selection by incorporating both into the MCMC sampling process. Yang, Belin, and Boscardin Yang et al. (2005) compared SIAS and ITS for SSVS and found both approaches were effective in mitigating bias due to MAR missingness and loss of power due to complete case analysis. In their study SIAS resulted in slightly better performance than ITS in terms of lower Monte Carlo standard errors, however for their comparison they used only five imputations for ITS and more imputations would be expected to improve these Monte Carlo standard errors and decrease the observed difference between methods.
Multiple imputation has also been applied for variable selection using the horseshoe prior and a SIAS approach, but this approach is also not implemented in any available software (Zhang & Kim, 2024). It should be noted that as shown in Jiang et al. (2015), issues can still exist with doing variable selection after multiple imputation. Specifically, the imputation is informed by a correct (not necessarily efficient) model, this can lead to a double-dipping phenomenon which can induce difficulties with variable selection. However an impute-then-combine strategy is straightforward to implement using existing software. We will demonstrate combining SSVS with MI using ITS using the example provided by Gunn and colleagues (2022) and with a small simulation study.
Real Data Example
For an applied example we use the same original data and set of imputations provided by (Gunn et al., 2022). The example data are from a controlled trial to evaluate interventions to improve HIV prevention and mental health outcomes among high risk youth conducted through the Adolescent Medicine Trials Network (study protocol 149; UCLA IRB #16–001674-AM-00006). The full study is described in Swendeman et al. (2019). The example uses baseline data from the study to predict depression symptoms for the sample of 1,486 adolescents aged 14 to 24 years (M = 20.89, SD = 2.15). Recent depression symptoms were measured using the nine-item Patient Health Questionnaire PHQ-9 (Kroenke et al., 2001). The candidate predictor set consisted of 46 variables, including demographics and risk and protective factors. Across the 47 variables the rate of missingness varied between 0 and 12%. However only about two thirds of the sample (1,004 participants, 68%) had complete observations on all 47 variables. After dummy coding the categories of a nominal predictor (i.e. race/ethnicity), there were a total of 29 binary and 20 continuous candidate predictors.
Multiple imputation of the missing values was done using fully conditional specification in Blimp 2.1 software. Following recent recommendations, imputations were generated (Von Hippel, 2020). The imputation model contained the same set of variables as the variable selection analysis, with no interaction effects, nonlinear terms, or auxiliary variables. Convergence of the MCMC algorithm was monitored by examining the potential scale reduction factors (PSRF Brooks & Gelman, 1998; Gelman et al., 2013) for each parameter; a cutoff PSRF below 1.10 for all parameters was used to suggest convergence (Gelman et al., 2013). A burn-in period of 1,000 iterations satisfied this threshold, and a thinning interval of 1,000 iterations was used to generate the 50 imputed data sets which were shared as supplementary material with their article. We first review the results obtained for each approach using lasso and MI.
Summary of Results Combining Lasso and MI
The lasso results obtained by Gunn and colleagues (2022) using listwise deletion and each of the three strategies explored for combining MI are reproduced in the columns on the left of Table 1. The bottom row of Table 1 shows that the number of candidate predictors selected was 16 using listwise deletion and ranged from 20 to 47 (41% to 96%) depending on the strategy used for combining lasso with MI. The stacked approach included nearly all of the candidate predictors in the solution, and the grouped lasso resulted in the fewest predictors chosen. All three MI strategies improved the predictive performance of the lasso relative to listwise deletion, and the stacked approach marginally outperformed the others. However performance among the MI approaches was similar despite the solutions differing in predictors selected.
Table 1.
Comparison of Lasso and SSVS methods with listwise deletion and MI for depression example
| LASSO Methods | SSVS | |||||
|---|---|---|---|---|---|---|
| Listwise | Separate (IF%) | Stacked | MI-LASSO | Listwise (MIP) | MI (MIP) | |
| Age | 0 | 0 (0) | 0.01 | 0 | 0.00 (.03) | 0.00 (.03) |
| AUDIT-C | 0 | 0 (0) | −0.01 | 0 | 0.00 (.07) | 0.00 (.03) |
| Count of unique drugs used | 0 | 0 (0) | −0.04 | 0 | 0.00 (.04) | 0.00 (.03) |
| Rumination on something bad | 0.39 | 0.37 (100) | 0.45 | 0.37 | 0.51 (.99) | 0.49 (1.0) |
| SF-12 calm and peaceful | −0.21 | −0.24 (100) | −0.26 | −0.24 | −0.03 (.16) | −0.18 (.74) |
| SF-12 sad and blue | 0.44 | 0.34 (100) | 0.35 | 0.34 | 0.44 (1.0) | 0.34 (.99) |
| SF-12 emotional problems | 0.60 | 0.42 (100) | −0.41 | 0.41 | 0.58 (1.0) | 0.43 (1.0) |
| SF-12 energy | 0.40 | −0.38 (100) | −0.41 | −0.38 | −0.57 (1.0) | −0.50 (1.0) |
| GAD-7 anxiety | 0.47 | 0.53 (100) | 0.52 | 0.53 | 0.50 (1.0) | 0.53 (1.0) |
| SR of mental health | −0.15 | −0.20 (100) | −0.21 | −0.20 | −0.17 (.89) | −0.22 (1.0) |
| SR of physical health | −0.12 | −0.05 (100) | −0.07 | −0.05 | −0.01 (.13) | 0.00 (.05) |
| SR of living situation | 0 | −0.01 (78) | −0.04 | −0.01 | 0.00 (.02) | −0.01 (.10) |
| SR of ability to live drug free | 0 | 0 (0) | 0.00 | 0 | 0.00 (.01) | 0.00 (.03) |
| SR of social network | 0 | −0.00 (2) | −0.01 | 0 | 0.00 (.03) | 0.00 (.03) |
| SR of sexual relationships | 0 | 0 (0) | 0.02 | 0 | −0.01 (.13) | 0.00 (.04) |
| Social help | 0 | − 0.01 (82) | −0.05 | −0.01 | 0.00 (.04) | 0.00 (.04) |
| Emotional support | 0 | 0 (0) | 0.06 | 0 | 0.00 (.05) | 0.00 (.04) |
| Ability to make new friends | 0 | −0.01 (74) | −0.03 | 0.00 | 0.00 (.06) | 0.00 (.04) |
| Frequency of social media use | 0 | 0.00 (2) | 0.04 | 0 | 0.00 (.04) | 0.00 (.03) |
| Frequency of dating app use | 0 | 0.00 (28) | 0.05 | 0 | 0.00 (.06) | 0.00 (.05) |
| Los Angeles | 0.25 | 0.71 (100) | 0.87 | 0.70 | 0.14 (.27) | 0.81 (.98) |
| Black/African American | −0.02 | −0.10 (100) | −0.14 | −0.09 | −0.02 (.08) | −0.01 (.05) |
| Latinx | 0 | 0 (0) | 0 | 0 | 0.00 (.05) | 0.00 (.03) |
| White | 0 | 0.00 (12) | 0.31 | 0 | 0.00 (.03) | 0.01 (.04) |
| Other race/ethnicity | 0 | 0.00 (2) | 0.16 | 0 | 0.00 (.05) | 0.00 (.03) |
| Female at birth | 0 | 0 (0) | −0.14 | 0 | 0.00 (.04) | 0.00 (.03) |
| Cisgender | −0.58 | −0.42 (100) | −0.59 | −0.41 | −0.62 (.67) | −0.16 (.26) |
| Heterosexual | 0 | 0 (0) | 0.01 | 0 | 0.00 (.04) | 0.00 (.03) |
| Employed | −0.30 | −0.25 (100) | −0.50 | −0.24 | −0.13 (.24) | −0.10 (.23) |
| Income below poverty line | 0 | 0 (0) | 0.00 | 0 | 0.00 (.04) | 0.00 (.03) |
| Has health insurance | 0 | −0.01 (16) | −0.17 | 0 | 0.00 (.04) | −0.01 (.05) |
| Has health care provider | 0 | 0 (0) | −0.06 | 0 | 0.00 (.04) | −0.01 (.05) |
| Medical utilization | 0.38 | 0.40 (100) | 0.66 | 0.39 | 0.63 (.87) | 0.65 (.95) |
| Received ER/urgent care | 0 | 0 (0) | −0.22 | 0 | 0.00 (.04) | 0.00 (.03) |
| Participated in substance abuse program | 0 | −0.04 (66) | −0.43 | −0.00 | 0.00 (.04) | −0.07 (.16) |
| Participated in HIV prevention program | 0 | 0 (0) | −0.04 | 0 | −0.07 (.16) | −0.03 (.09) |
| Ever homeless | 0 | 0 (0) | 0 | 0 | 0.00 (.04) | 0.00 (.04) |
| Ever incarcerated | 0 | 0 (0) | 0.09 | 0 | 0.00 (.04) | 0.00 (.03) |
| Experienced partner violence | 0 | 0 (0) | −0.04 | 0 | 0.00 (.03) | 0.00 (.03) |
| Exchanged sex for money | 0 | 0 (0) | 0.01 | 0 | 0.00 (.04) | 0.01 (.04) |
| Attempted suicide | 0.32 | 0.58 (100) | 0.70 | 0.58 | 0.43 (.62) | 0.84 (.98) |
| Hospitalized for mental health problems | 0 | 0 (0) | −0.02 | 0 | 0.00 (.05) | 0.00 (.03) |
| Sexually abused | 0.42 | 0.22 (100) | 0.38 | 0.22 | 0.07 (.15) | 0.01 (.05) |
| Sex w/ person 5þ years older before 16 | 0 | 0 (0) | −0.08 | 0 | 0.00 (.05) | 0.00 (.03) |
| Ever been robbed | 0.17 | 0.12 (98) | 0.28 | 0.12 | 0.02 (.10) | 0.01 (.06) |
| Seen serious injury or death | 0 | 0 (0) | −0.19 | 0 | 0.00 (.05) | 0.01 (.04) |
| Family member was murdered | 0 | 0.01 (40) | 0.39 | 0 | 0.00 (.04) | 0.04 (.12) |
| Used drugs during last sexual encounter | 0 | 0.00 (6) | −0.21 | 0 | 0.00 (.03) | 0.00 (.03) |
| Ever smoked | 0 | 0.01 (46) | 0.30 | 0 | 0.02 (.05) | 0.02 (.07) |
| # of predictors selected | 16 | 20* | 47 | 20 | 9 | 10 |
Note. Results for all lasso method conditions are reproduced from Table 2 within Gunn et al. (2022). Estimates in bold were selected for a given method.
Selected with an IF = .50
Within a strategy, intermediate choices such as the number of imputations also impact the set of predictors chosen. It is reasonable to infer that the predictors selected using the stacked approach will depend on the number of imputations; fewer imputations would have resulted in fewer predictors selected. The predictors selected from the separate approach depends on the choice of IF threshold. If predictors with an IF > 0 are included, i.e. predictors selected in any of the imputations, 29 predictors are selected. A more conservative IF threshold of .50 (i.e. predictors selected in at least half of the imputed training sets) results in 20 variables selected, and 15 variables were selected in all imputed data sets. There is a bias-variance trade-off to the choice of threshold. More conservative IF thresholds will select fewer noise variables at the cost of increased bias, and a threshold = .5 balances this bias-variance trade-off (Thao & Geskus, 2019). Note also that precision of the IFs depends on the number of imputations.
Their in-depth review and step-by-step tutorial of the different approaches provide an important foundation for methodological advances related to variable selection with missing data. Their comparison as well as previous simulation studies (Thao & Geskus, 2019; Zhao & Long, 2016) have not revealed a preferred approach, and as demonstrated by these results, these choices have a marked impact on results.
Results Combining SSVS with MI
We applied SSVS using the SSVS R package (S. A. Bainter et al., 2022) and combined results using the ITS approach. We used the default prior specification and number of MCMC iterations (20,000 with 5,000 discarded as warm-up) to obtain posterior summaries of the MIPs and regression coefficients across 15,000 MCMC iterations. The default prior inclusion probability for each predictor was p = .5 and priors for the regression coefficients and inverse variance were diffuse and correspond to equations EQ above. We ran SSVS for the complete cases (listwise deletion) and in each of the 50 provided imputed data sets for the applied example. The results using listwise deletion and the pooled results across imputations are in the last two columns of Table 1. With listwise deletion, 9 predictors had MIPs above .5, corresponding to the cutoff for the median probability model (Barbieri & Berger, 2004). Comparing estimates for lasso and SSVS using listwise deletion, the estimates for predictors with larger MIPs are generally larger than the corresponding lasso estimates, while predictors with lower MIPs have smaller coefficients compared to lasso. These results demonstrate the selective shrinkage of the prior used for SSVS. Using a MIP threshold = .5 results in notably fewer predictors selected using SSVS compared to lasso, and the predictors selected with SSVS were a subset of those with nonzero coefficients in the lasso solution.
The pooled SSVS results after MI included 10 predictors with MIPs above .5, and again these selected predictors were a subset of those included in the lasso solutions. Figure 1 shows a plot of the MIPs for each predictor with listwise deletion and MI and shows the range of values across the 50 imputed data sets. This figure also shows which predictors would be selected with a more or less conservative threshold for inclusion. Because the MIPs are influenced by the prior inclusion probability, here chosen as .5, it can be useful to examine this pattern of results graphically rather than relying on an automatic cutoff. These results demonstrate the relatively straightforward ITS procedure for combining MI with SSVS using existing software.
Figure 1.

Marginal inclusion probabilities for each predictor with listwise deletion and MI. Mean estimates are shown as different colors and symbols by each method with 95% interval across 50 imputed data sets.
The solution obtained using SSVS in this example is more conservative in terms of predictors selected compared to all lasso methods. This solution is also more interpretable for several reasons. Most importantly, as described above, SSVS is consistent for variable selection, making it more appropriate for inference. Many predictors included by lasso methods but not SSVS had trivially small coefficients, within +/− .01, but there were also several predictors omitted by SSVS with MI with larger coefficients. For example, the separate lasso and MI-lasso included substantial effects for lower depression in youth who reported cisgender relative to gender minority status and those who reported employment. The probability that each predictor should be included is directly estimated from the MIPs, providing added information to aid in interpreting the importance of predictors. SSVS performs more adaptive shrinkage of coefficients, inducing less bias in meaningfully large effects. In this case, Bayesian variable selection simplifies inference and can be straightforwardly combined with MI.
The Bayesian approach relies on its own set of assumptions, most notably the choice of priors. The prior specification for SSVS is explicitly designed for variable selection. Here we assumed the same prior inclusion probability for each predictor. Additionally, the posterior distribution on the parameters is characterized using MCMC estimation which can be more computationally intensive than the optimization and cross-validation used to select a lasso solution. To evaluate the use of MI with SSVS in a larger variety of conditions we performed the following targeted simulation study. In this study we also evaluate and compare the lasso MI approaches.
Simulation Study
Design
Our simulation design was similar to that used by Yang and colleagues (Yang et al., 2005). To investigate the performance of SSVS with MI using an impute-then-combine approach we simulated data for p = 10 explanatory variables from a multivariate normal distribution with a compound symmetric correlation matrix with off diagonal elements of . Four predictors, , and were true effects, with and , and all other . The response variable was simulated from a linear regression model given the predictor variables and residual standard deviation . We simulated 500 replications with N = 100 per sample. Thus our population generating model included mild correlation among the predictors and varied the magnitude of true effects.
Missingness was induced under two mechanisms: MCAR and MAR and two proportions of missingness: 40% and 65% missing. For each condition the first five predictors , were completely observed and missingness was introduced for , using the ampute() function from the MICE package (Buuren & Groothuis-Oudshoorn, 2011). MCAR missingness was induced by randomly deleting values of , for an average 40% and 65% incomplete cases. For MAR missingness, the ampute function induces missing data patterns in specified combinations of variables based on values of other variables. Missingness is modeled as a logistic probability distribution applied to a user-defined weight matrix to create weighted sum scores. The weighted sum scores can be modeled by varying types of distributions to assign larger probabilities of missingness to low, average, high, or extreme (high and low) sum scores. We induced missingness in based on values of , and which were completely observed and equally weighted with higher sumscores corresponding to higher probabilities of missingness. Code including full weight and pattern matrices is available in supplementary materials (S. Bainter, 2026).
After simulating the complete data and missingness conditions, we first applied SSVS and lasso estimation to the complete datasets and using complete case analysis (i.e. listwise deletion) for each of the four missing data conditions. We ran SSVS and lasso using the same functions and specifications as in the real data example and averaged results across replications. We then performed MI using the mice R package (Buuren & Groothuis-Oudshoorn, 2011) which implements Fully Conditional Specification to impute missing values in each replication within each condition. We used the default imputation method for numeric variables in mice, predictive mean matching. Predictive mean matching is a semi-parametric method for predicting missing values and drawing imputations from observed values closest to the prediction for each case. PMM is considered a good default since it preserves the original data distribution and avoids implausible or out-of range imputations while remaining robust to model misspecification (Morris et al., 2014). For all conditions, imputations were generated. After multiple imputation, we obtained results for each of the four lasso methods and using SSVS on each imputed dataset. As suggested by (Thao & Geskus, 2019), we used an IF threshold of .50 for the separate lasso approach. For all methods we examined the averaged coefficients in each condition and rates of inclusion of true effects and null effects. For all coefficients we calculated average bias as the difference between the true value and the average estimate. For true coefficients we also examined average relative bias, for a scale dependent measure of bias, as the ratio of bias to the true value.
Inclusion was defined as a non-zero coefficient for lasso according to each method. For SSVS we pooled model-averaged coefficients and MIPs across imputations and calculated inclusion based on averaged MIP > .5.
Results
SSVS
The simulation results for SSVS with complete data and using listwise deletion are shown in Table 2; the ranges of coefficient estimates are also shown in Figure 2. For complete data, the average MIPs for true effects was .88 and .99 for the medium and large coefficients, respectively. The MIP average for null effects was .14 with an average estimate of .01. The medium effects had a relative negative bias of −.07, and large effects were unbiased on average. For the MCAR conditions using listwise deletion, the average MIP for medium true effects decreased to .73 with with 40% missing and .58 with the 65% missing. The larger effect size average MIP was .99 with 40% missing and .94 for 6% MCAR. MIPs for the null effects increased as missingness increased, up to .23 in the highest missingness condition. Bias for true effects increased with the proportion of missingness, to −.27 average relative bias for medium effects and −. 02 average relative bias for large effects with 65% missingness. As expected, the standard deviation of the estimates increased as missingness increased.
Table 2.
SSVS results averaged across replications with complete data and listwise deletion by missingness pattern
| Complete Data | MCAR 40%: Listwise | MCAR 65%: Listwise | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Variable | True Value | Avg Est | SD | Est-True | Avg MIP | Avg Est | SD | Est-True | Avg MIP | Avg Est | SD | Est - True | Avg MIP |
| X1 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 0.99 | 0.16 | −0.01 | 1.00 | 0.97 | 0.26 | 0.03 | 0.97 |
| X2 | 2.00 | 2.00 | 0.17 | 0.00 | 1.00 | 2.00 | 0.22 | 0.00 | 1.00 | 1.97 | 0.30 | 0.02 | 1.00 |
| X3 | 0.00 | 0.00 | 0.04 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.09 | 0.00 | 0.07 | −0.00 | 0.11 |
| X4 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.09 | −0.00 | 0.07 | 0.00 | 0.11 |
| X5 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.09 | 0.00 | 0.07 | −0.00 | 0.11 |
| X6 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 0.99 | 0.17 | −0.01 | 1.00 | 0.98 | 0.27 | 0.02 | 0.97 |
| X7 | 2.00 | 1.99 | 0.17 | −0.01 | 1.00 | 1.99 | 0.23 | −0.01 | 1.00 | 1.94 | 0.31 | 0.06 | 1.00 |
| X8 | 0.00 | 0.00 | 0.02 | 0.00 | 0.06 | 0.00 | 0.03 | 0.00 | 0.08 | 0.00 | 0.05 | −0.00 | 0.11 |
| X9 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.09 | 0.00 | 0.07 | 0.00 | 0.11 |
| X10 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.09 | −0.00 | 0.07 | 0.00 | 0.12 |
| Complete Data | MAR 40%: Listwise | MAR 65%: Listwise | |||||||||||
| Avg Est | SD | Est-True | Avg MIP | Avg Est | SD | Est-True | Avg MIP | Avg Est | SD | Est-True | Avg MIP | ||
| X1 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 0.70 | 0.18 | −0.30 | 1.00 | 0.86 | 0.25 | −0.14 | 1.00 |
| X2 | 2.00 | 2.00 | 0.17 | 0.00 | 1.00 | 2.23 | 0.22 | 0.23 | 1.00 | 2.06 | 0.30 | 0.06 | 1.00 |
| X3 | 0.00 | 0.00 | 0.04 | 0.00 | 0.07 | 0.05 | 0.04 | 0.05 | 0.19 | 0.03 | 0.07 | 0.03 | 0.12 |
| X4 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | −0.01 | 0.04 | −0.01 | 0.08 | 0.01 | 0.06 | 0.01 | 0.08 |
| X5 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.01 | 0.04 | 0.01 | 0.08 | 0.02 | 0.05 | 0.02 | 0.09 |
| X6 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 0.71 | 0.19 | −0.29 | 1.00 | 1.01 | 0.25 | 0.01 | 1.00 |
| X7 | 2.00 | 1.99 | 0.17 | −0.01 | 1.00 | 2.08 | 0.24 | 0.08 | 1.00 | 2.38 | 0.31 | 0.38 | 1.00 |
| X8 | 0.00 | 0.00 | 0.02 | 0.00 | 0.06 | 0.01 | 0.03 | 0.01 | 0.06 | 0.01 | 0.05 | 0.01 | 0.06 |
| X9 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.03 | 0.00 | 0.04 | −0.02 | 0.05 | −0.02 | 0.09 |
| X10 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.04 | 0.00 | 0.07 | 0.00 | 0.05 |
Note. MIP = Marginal Inclusion Probability, MCAR = Missingness completely at random, MAR = Missing at random
Figure 2.

Mean estimates from SSVS with 95% intervals across replications for each predictor using listwise deletion for each missingness condition
In the MAR conditions, average MIPs for true effects were smaller compared to the MCAR conditions. The downward bias was more pronounced with a higher proportion of missingness. For example, medium and large sized effects had average MIPs of .72 and .99, respectively with 40% MAR missingness, compared with .50 and .79 with 65% MAR missingness. The MIPs of null effects increased with the proportion of missingness, .18 on average and .28 on average for 40% and 65% missingness, respectively. As can be seen from Figure 2, listwise deletion in the MAR missingness conditions resulted in substantial negative bias in true effects. There was also some positive bias in true effects, .06 on average in the most extreme MAR condition. The variability in the estimates was generally larger for the MAR conditions, especially at higher proportions of missingness.
After imputation and pooling across imputations, results are shown in Table 3 and Figure 3. These results show that after imputation there was minimal bias in true or null effects across missingness type, with slightly higher variability in estimates in the MAR conditions. Higher missingness was associated with marginally higher MIPs, and this was slightly more elevated for the predictors that had been MAR. All conditions showed relative bias in true effects below 10%.
Table 3.
SSVS results averaged across imputations and replications by missingness pattern, compared with complete data results
| Complete Data | MCAR 40%: MI | MCAR 65%: MI | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Variable | True Value | Avg Est | SD | Est-True | Avg MIP | Avg Est | SD | Est-True | Avg MIP | Avg Est | SD | Est - True | Avg MIP |
| X1 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 1.00 | 0.14 | 0.00 | 1.00 |
| X2 | 2.00 | 2.00 | 0.17 | 0.00 | 1.00 | 1.99 | 0.18 | −0.01 | 1.00 | 1.99 | 0.18 | −0.01 | 1.00 |
| X3 | 0.00 | 0.00 | 0.04 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.09 | 0.00 | 0.05 | 0.00 | 0.12 |
| X4 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.09 | 0.00 | 0.06 | 0.00 | 0.13 |
| X5 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.10 | 0.00 | 0.06 | 0.00 | 0.12 |
| X6 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 0.99 | 0.15 | −0.01 | 1.00 | 0.99 | 0.16 | −0.01 | 1.00 |
| X7 | 2.00 | 1.99 | 0.17 | −0.01 | 1.00 | 1.97 | 0.18 | −0.03 | 1.00 | 1.98 | 0.19 | −0.02 | 1.00 |
| X8 | 0.00 | 0.00 | 0.02 | 0.00 | 0.06 | 0.00 | 0.03 | 0.00 | 0.08 | 0.00 | 0.05 | 0.00 | 0.11 |
| X9 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.09 | 0.00 | 0.06 | 0.00 | 0.12 |
| X10 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.09 | 0.00 | 0.06 | 0.00 | 0.13 |
| Complete Data | MAR 40%: MI | MAR 65%: MI | |||||||||||
| Avg Est | SD | Est-True | Avg MIP | Avg Est | SD | Est-True | Avg MIP | Avg Est | SD | Est-True | Avg MIP | ||
| X1 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 1.00 | 0.14 | 0.00 | 1.00 | 1.00 | 0.14 | 0.00 | 1.00 |
| X2 | 2.00 | 2.00 | 0.17 | 0.00 | 1.00 | 1.99 | 0.18 | −0.01 | 1.00 | 1.99 | 0.18 | −0.01 | 1.00 |
| X3 | 0.00 | 0.00 | 0.04 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.09 | 0.00 | 0.06 | 0.00 | 0.12 |
| X4 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.09 | 0.00 | 0.05 | 0.00 | 0.11 |
| X5 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.10 | 0.00 | 0.05 | 0.00 | 0.11 |
| X6 | 1.00 | 1.00 | 0.13 | 0.00 | 1.00 | 0.99 | 0.15 | −0.01 | 1.00 | 0.99 | 0.16 | −0.01 | 1.00 |
| X7 | 2.00 | 1.99 | 0.17 | −0.01 | 1.00 | 1.97 | 0.18 | −0.03 | 1.00 | 1.97 | 0.19 | −0.03 | 1.00 |
| X8 | 0.00 | 0.00 | 0.02 | 0.00 | 0.06 | 0.00 | 0.05 | 0.00 | 0.11 | 0.00 | 0.06 | 0.00 | 0.14 |
| X9 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.10 | 0.00 | 0.07 | 0.00 | 0.14 |
| X10 | 0.00 | 0.00 | 0.03 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.11 | 0.00 | 0.07 | 0.00 | 0.16 |
Note. MIP = Marginal Inclusion Probability, MCAR = Missingness completely at random, MAR = Missing at random. Complete data results are repeated from Table 2 for comparison.
Figure 3.

Mean estimates from SSVS for MI data with 95% intervals across replications for each predictor for each missingness condition
LASSO
The lasso estimates for complete data and using listwise deletion are shown in Figure 4. With complete data, medium and large true effects were negatively biased (average of 16% and 7% relative bias, respectively) and null effect estimates were small, .03 on average. Proportions of nonzero coefficients for all lasso conditions are shown in Figure 5. With complete data nearly 100% of true effects were nonzero, however note that 47% of null effects were also nonzero.
Figure 4.

Mean estimates from lasso with 95% intervals across replications for each coefficient with complete data and after listwise deletion for each condition
Figure 5.

Proportions of nonzero estimates from lasso across replications and missingness conditions using listwise deletion and each lasso imputation approach.
After listwise deletion, negative bias in true coefficients increased as the proportion of missingness increased. For medium effects, relative bias was −.21 with 40% missing and −. 30 with 65% missing. There was some additional bias in true effects associated with MAR versus MCAR missingness, especially for large coefficients. Variability in the estimates for all predictors increased. For example, null effects had an average standard deviation of .29 with 40% missing and .41 with 65% missing. Across missingness conditions, the proportions of nonzero coefficients were between .81 and 1.0 for true effects and between .39 and .50 for null effects. The average coefficients for null effects were .04 with 40% missing (.06 with 65% missing).
Results after imputation using the Separate method are shown in Figure 6. Using the separate lasso imputation method and an IF threshold of .50, the proportion of nonzero coefficients for null effects increased to between 59% and 75%. Relative bias was lower compared with listwise deletion and did not differ across missingness conditions. The estimates for the MI lasso results (not pictured) were nearly identical to the estimates from the separate approach in Figure 6, however as shown in Figure 5 the rates of nonzero null effect estimates were higher compared to the separate results, especially with the highest proportion of MAR missingness. Estimates using the stacked imputation approach were unbiased compared to the complete data results (supplementary Figure), but as shown in Figure 5, this decrease in bias was accompanied by large increases in the proportions of nonzero null effect estimates.
Figure 6.

Mean estimates using the separate imputation lasso approach with 95% intervals across replications for each variable and for each missingness condition
Comparison with Additional Regularization Methods
To contextualize the performance of SSVS and lasso relative to other contemporary regularization approaches, we additionally compared results for the elastic net (Zou & Hastie, 2005) and SCAD penalty (Fan & Li, 2001). These methods represent widely-used alternatives that address some of lasso’s known limitations, such as elastic net’s handling of correlated predictors and SCAD’s reduction of bias in large coefficient estimates. Gunn et al. (2022) noted that the complications they identified when combining lasso with multiple imputation—in particular the integration of cross-validation with imputation—generalize to other regularization approaches that rely on cross-validation for tuning. The comparisons presented here serve to demonstrate empirically what Gunn et al. argued conceptually: that these challenges extend beyond lasso to other penalized regression methods. While SSVS has been compared favorably to several variations of lasso in previous work (S. A. Bainter et al., 2023), comparing performance across multiple penalized methods with missing data provides a more comprehensive evaluation and highlights the advantages of approaches that do not require cross-validation.
We fit elastic net and applied SCAD using the glmnet (Friedman et al., 2021) and ncvreg (Breheny et al., 2009) packages in R, respectively. For elastic net, the mixing parameter (balancing L1 and L2 penalties) was tuned via cross-validation along with the regularization parameter . SCAD uses a penalty function that reduces bias for large coefficients while maintaining the variable selection properties of lasso. We applied both elastic net and SCAD regularization approaches to the complete data all missing data conditions using the methods described previously for lasso: listwise deletion using complete cases, separate, and stacked imputation approaches. Note that missing from this comparison is a specific grouped selection procedure for SCAD and elastic net to compare with MI-LASSO.
Results for elastic net and SCAD with complete data and using listwise deletion are available in the supplemental materials. Compared to lasso, the SCAD penalty resulted in decreased negative bias for the true effects across all conditions. For large effects SCAD was nearly unbiased (less than 5% relative bias), except in the most extreme 65% missing conditions. With complete data, relative bias for medium effects was −11% using SCAD, compared with −15% for lasso. This general pattern of less negative bias with SCAD relative to lasso was also observed after listwise delection across missingness conditions. Given the modest correlation among the predictors, results for elastic net aligned closely with the lasso results. Negative bias in true coefficients in these conditions was slightly attenuated when comparing elastic net with lasso, but these differences were negligible. The elastic net also resulted in slightly higher rates of nonzero null effect estimates, about 55%, compared with less than 50% for lasso in the complete data. To summarize, nearly 100% of true effects were selected using SCAD and elastic net, as with lasso, and the rates of nonzero null effect estimates were lower for SCAD (33%) and slightly higher for elastic net (55%) compared with lasso (45%). However across complete data and listwise deletion conditions, SSVS resulted in less biased estimates compared with lasso, elastic net, and SCAD, and substantially lower rates of inclusion for null effects.
Results after multiple imputation using the separate and stacked methods with elastic net and SCAD closely mirror and reinforce patterns observed for lasso. To illustrate this pattern, coefficient estimates are shown for the 40% MAR condition in Supplemental Figure 1. For listwise deletion, SCAD results in less biased estimates relative to elastic net lasso and lasso, especially for the large coefficients. The separate imputation approach results in estimates similar to listwise deletion, and the stacked imputation approach, by artificially inflating sample size, results in almost identical results across methods.
Discussion & Conclusion
In this paper we have compared how lasso and SSVS approaches can be combined with multiple imputation to perform variable selection with incomplete data. With real data and simulated data we compare the procedures involved and properties of the solutions. This paper addresses an important need identified by (Gunn et al., 2022) to identify best practices for applying variable selection with missing data. As missing data are widespread in social science research, and variable selection approaches such as lasso are becoming increasingly prevalent, this contribution is timely.
In the real data example, SSVS provided a more conservative solution compared to all lasso approaches. However, in contrast to lasso approaches to combine lasso, MI, and cross-validation, the ITS approach for applying SSVS with missing data is straightforward. Our simulation results shed light on the overall performance and properties of each method. With complete data, there was negative bias for true effects using both lasso and SSVS. However, the bias was more pronounced for the lasso, and bias for SSVS was moderate for medium effects and unbiased for large effect estimates. With complete data, nearly half of null effects had nonzero estimates using lasso, compared to less than 5% of null effects selected using SSVS (i.e. with MIPs above a threshold = .5). Our simulation results showed that SSVS combined with MI effectively mitigated bias due to missing data in these conditions.
Among lasso approaches, our simulation results indicated the separate imputation approach yielded results most similar to the complete data. However, all lasso imputation approaches yielded higher proportions of nonzero coefficients for null effects compared to the complete data or listwise deletion. In terms of estimates, the separate approach and MI-lasso approaches yielded almost identical results. The stacked imputation approach effectively inflates the sample size by the number of imputations, which served to counteract the shrinkage bias but resulted in the inclusion of virtually all predictors. Altogether, our results further highlight that the lasso solution is ill-suited for interpretation in terms of variable selection. The lasso solution is over-penalized compared to SSVS and includes a high proportion of non-zero coefficients for null effects. As shown in our secondary comparisons with elastic net and SCAD, SSVS also performed favorably in terms of coefficient bias and selection of true and null effects, even to these methods that are motivated to overcome identified limitations of lasso.
The approaches we compared use available R code and packages. A fully Bayesian approach to SSVS with missing data could be implemented to simultaneously impute and select (SIAS) within the same Gibbs sampling process, but because SIAS is not currently available in software we do not directly compare ITS and SIAS approaches in this paper. In their paper introducing both approaches, Yang and colleagues (2005) found that ITS and SIAS provided consistent variable selection results, but that SIAS had slightly smaller Monte Carlo standard errors. However, they evaluated ITS with only 5 imputations, and the Monte Carlo standard errors would be expected to decrease with more imputations. Besides MI approaches, alternative approaches for variable selection with missing data have also been proposed, (e.g. Garcia et al., 2010; Jiang et al., 2015; Sabbe et al., 2013), but are also not currently programmed into software for general use.
Some relative advantages of Multiple Imputation are the ability to satisfy the MAR assumption by including relevant predictors in the imputation model. For variable selection, all candidate predictors should be included in both the imputation stage and subsequent variable selection modeling. However it is possible that bias is introduced through imputation, as imputation is itself model-based. Based on our limited simulation results, we believe combining SSVS with MI is justifiable to avoid listwise deletion.
There are additional extensions to these methods, in both the Bayesian and penalized likelihood frameworks, which we did not include in our comparison. As shown in our secondary comparisons section, our results are generalizable to other procedures based on cross-validation, in that an ITS approach is comparably simple. Our simulation was also necessarily limited, for example we used default prior specifications and did not consider alternative sample sizes, numbers of predictors, or degree of sparsity in the true model. We can expect lasso to perform worse with a higher degree of correlation among predictors (S. A. Bainter et al., 2023), and all methods will improve with larger sample sizes. We also did not consider the much more challenging case of non-ignorable missingness, MNAR. Further work in this area is needed for extensions to more complicated models, including interaction or nonlinear effects, models for categorical outcomes, and models for nested data.
Supplementary Material
Footnotes
We have no known conflicts of interest to disclose.
Portions of the analyses and results in this article were presented at the International Meeting of the Psychometric Society in Bologna, Italy (July, 2022).
References
- Bainter S (2026). How to Apply Bayesian Variable Stochastic Search Variable Selection with Multiple Imputed Data. 10.17605/OSF.IO/ZAY2T [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bainter SA, McCauley T, Fahmy M, & Attali D (2022, May). SSVS: Functions for Stochastic Search Variable Selection (SSVS). Retrieved December 20, 2022, from https://CRAN.R-project.org/package=SSVS
- Bainter SA, McCauley TG, Fahmy MM, Goodman ZT, Kupis LB, & Rao JS (2023). Comparing Bayesian Variable Selection to Lasso Approaches for Applications in Psychology. Psychometrika, 88(3), 1032–1055. 10.1007/s11336-023-09914-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barbieri MM, & Berger JO (2004). Optimal predictive model selection. Annals of Statistics, 32(3), 870–897. 10.1214/009053604000000238 [DOI] [Google Scholar]
- Breheny P, Miller R, & Harris L (2009, November). Ncvreg: Regularization Paths for SCAD and MCP Penalized Regression Models [Institution: Comprehensive R Archive Network Pages: 3.16.0]. 10.32614/CRAN.package.ncvreg [DOI] [Google Scholar]
- Brooks SP, & Gelman A (1998). General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455. 10.1080/10618600.1998.10474787 [DOI] [Google Scholar]
- Buuren S. v., & Groothuis-Oudshoorn K (2011). Mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45, 1–67. 10.18637/jss.v045.i03 [DOI] [Google Scholar]
- Carvalho CM, Polson NG, & Scott JG (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465–480. 10.1093/biomet/asq017 [DOI] [Google Scholar]
- Chen EE, & Wojcik SP (2016). A practical guide to big data research in psychology. Psychological Methods, 21(4), 458–474. 10.1037/met0000111 [DOI] [PubMed] [Google Scholar]
- Chen Q, & Wang S (2013). Variable selection for multiply-imputed data with application to dioxin exposure study [_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/sim.5783]. Statistics in Medicine, 32(21), 3646–3659. 10.1002/sim.5783 [DOI] [PubMed] [Google Scholar]
- Dennerill RD (1964). Prediction of unilateral brain dysfunction using Wechsler test scores [Place: United States]. Journal of consulting psychology, 28(3), 278–284. 10.1037/h0045831 [DOI] [PubMed] [Google Scholar]
- Enders CK (2006). A Primer on the Use of Modern Missing-Data Methods in Psychosomatic Medicine Research: Psychosomatic Medicine, 68(3), 427–436. 10.1097/01.psy.0000221275.75056.d8 [DOI] [PubMed] [Google Scholar]
- Fan J, & Li R (2001). Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association, 96(456), 1348–1360. 10.1198/016214501753382273 [DOI] [Google Scholar]
- Forte A, Garcia-Donato G, & Steel M (2018). Methods and Tools for Bayesian Variable Selection and Model Averaging in Normal Linear Regression [_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/insr.12249]. International Statistical Review, 86(2), 237–258. 10.1111/insr.12249 [DOI] [Google Scholar]
- Friedman J, Hastie T, Tibshirani R, Narasimhan B, Tay K, Simon N, & Qian J (2021, June). Glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. Retrieved July 21, 2021, from https://CRAN.R-project.org/package=glmnet
- Garcia RI, Ibrahim JG, & Zhu H (2010). VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA. Statistica Sinica, 20(1), 149–165. Retrieved March 4, 2024, from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2844735/ [PMC free article] [PubMed] [Google Scholar]
- Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, & Rubin DB (2013, November). Bayesian Data Analysis (0th ed.). Chapman; Hall/CRC. 10.1201/b16018 [DOI] [Google Scholar]
- Gunn HJ, Hayati Rezvan P, Fernández MI, & Comulada WS (2022). How to apply variable selection machine learning algorithms with multiply imputed data: A missing discussion. Psychological Methods. 10.1037/met0000478 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henderson DA, & Denison DR (1989). Stepwise Regression in Social and Psychological Research. Psychological Reports, 64(1), 251–257. 10.2466/pr0.1989.64.1.251 [DOI] [Google Scholar]
- Jiang J, Nguyen T, & Rao JS (2015). The E-MS Algorithm: Model Selection With Incomplete Data. Journal of the American Statistical Association, 110(511), 1136–1147. 10.1080/01621459.2014.948545 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kroenke K, Spitzer RL, & Williams JBW (2001). The PHQ-9: Validity of a brief depression severity measure. Journal of General Internal Medicine, 16(9), 606–613. 10.1046/j.1525-1497.2001.016009606.x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Long Q, & Johnson BA (2015). Variable selection in the presence of missing data: Resampling and imputation. Biostatistics, 16(3), 596–610. 10.1093/biostatistics/kxv003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McNeish DM (2015). Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences. Multivariate Behavioral Research, 50(5), 471–484. 10.1080/00273171.2015.1036965 [DOI] [PubMed] [Google Scholar]
- Mitchell TJ, & Beauchamp JJ (1988). Bayesian Variable Selection in Linear Regression [__eprint: https://www.tandfonline.com/doi/pdf/10.1080/01621459.1988.10478694]. Journal of the American Statistical Association, 83(404), 1023–1032. 10.1080/01621459.1988.10478694 [DOI] [Google Scholar]
- Morris TP, White IR, & Royston P (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology, 14(1), 75. 10.1186/1471-2288-14-75 [DOI] [PMC free article] [PubMed] [Google Scholar]
- O’Hara RB, & Sillanpää MJ (2009). A review of Bayesian variable selection methods: What, how and which. Bayesian Analysis, 4(1), 85–117. 10.1214/09-BA403 [DOI] [Google Scholar]
- Park T, & Casella G (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103(482), 681–686. 10.1198/016214508000000337 [DOI] [Google Scholar]
- Piironen J, & Vehtari A (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3), 711–735. 10.1007/s11222-016-9649-y [DOI] [Google Scholar]
- Polson NG, & Scott JG (2011, October). Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction. In Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, & West M (Eds.), Bayesian Statistics 9 (pp. 501–538). Oxford University Press. 10.1093/acprof:oso/9780199694587.003.0017 [DOI] [Google Scholar]
- Porwal A, & Raftery AE (2022). Comparing methods for statistical inference with model uncertainty. Proceedings of the National Academy of Sciences, 119(16), e2120737119. 10.1073/pnas.2120737119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin DB (1976). Inference and Missing Data. Biometrika, 63(3), 581–593. [Google Scholar]
- Rubin DB (2004). Multiple imputation for nonresponse in surveys. Wiley-Interscience. [Google Scholar]
- Sabbe N, Thas O, & Ottoy J-P (2013). EMLasso: Logistic lasso with missing data. Statistics in Medicine, 32(18), 3143–3157. 10.1002/sim.5760 [DOI] [PubMed] [Google Scholar]
- Schafer JL, & Graham JW (2002). Missing data: Our view of the state of the art. Psychological Methods, 7(2), 147–177. [PubMed] [Google Scholar]
- Sengstake CB (1965). Perception of deviations in repetitive patterns. Journal of Experimental Psychology, 70(2), 210. 10.1037/h0022203 [DOI] [PubMed] [Google Scholar]
- Swendeman D, Arnold EM, Harris D, Fournier J, Comulada WS, Reback C, Koussa M, Ocasio M, Lee S-J, Kozina L, Fernández MI, Rotheram MJ, & Adolescent Medicine Trials Network (ATN) CARES Team. (2019). Text-Messaging, Online Peer Support Group, and Coaching Strategies to Optimize the HIV Prevention Continuum for Youth: Protocol for a Randomized Controlled Trial. JMIR Research Protocols, 8(8), e11165. 10.2196/11165 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Takada M, Fujisawa H, & Nishikawa T (2019). HMLasso: Lasso with High Missing Rate. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 3541–3547. 10.24963/ijcai.2019/491 [DOI] [Google Scholar]
- Thao LTP, & Geskus R (2019). A comparison of model selection methods for prediction in the presence of multiply imputed data. Biometrical Journal. Biometrische Zeitschrift, 61 (2), 343–356. 10.1002/bimj.201700232 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288. Retrieved December 24, 2018, from https://www.jstor.org/stable/2346178 [Google Scholar]
- Tibshirani R (2011). Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 73(3), 273–282. Retrieved October 13, 2022, from http://www.jstor.org/stable/41262671 [Google Scholar]
- Tibshirani R, Johnstone I, Hastie T, & Efron B (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499. 10.1214/009053604000000067 [DOI] [Google Scholar]
- Van Buuren S, Brand JP, Groothuis-Oudshoorn CG, & Rubin DB (2006). Fully conditional specification in multivariate imputation [_eprint: https://doi.org/10.1080/10629360600810434]. Journal of Statistical Computation and Simulation, 76(12), 1049–1064. 10.1080/10629360600810434 [DOI] [Google Scholar]
- van Erp S, Oberski DL, & Mulder J (2019). Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89, 31–50. 10.1016/j.jmp.2018.12.004 [DOI] [Google Scholar]
- Von Hippel PT (2020). How Many Imputations Do You Need? A Two-stage Calculation Using a Quadratic Rule. Sociological Methods & Research, 49(3), 699–718. 10.1177/0049124117747303 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang X, Belin TR, & Boscardin WJ (2005). Imputation and Variable Selection in Linear Regression Models with Missing Covariates. Biometrics, 61(2), 498–506. Retrieved December 1, 2021, from http://www.jstor.org/stable/3695970 [DOI] [PubMed] [Google Scholar]
- Yarkoni T, & Westfall J (2017). Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspectives on Psychological Science, 12(6), 1100–1122. 10.1177/1745691617693393 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuan M, & Lin Y (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B-Statistical Methodology, 68, 49–67. 10.1111/j.1467-9868.2005.00532.x [DOI] [Google Scholar]
- Zhang Y, & Kim S (2024). Variable selection for high-dimensional incomplete data using horseshoe estimation with data augmentation. Communications in Statistics - Theory and Methods, 53(12), 4235–4251. 10.1080/03610926.2023.2177107 [DOI] [Google Scholar]
- Zhao Y, & Long Q (2016). Multiple imputation in the presence of high-dimensional data. Statistical Methods in Medical Research, 25(5), 2021–2035. 10.1177/0962280213511027 [DOI] [PubMed] [Google Scholar]
- Zou H (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical Association, 101(476), 1418–1429. 10.1198/016214506000000735 [DOI] [Google Scholar]
- Zou H, & Hastie T (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320. 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
- Zou H, Hastie T, & Tibshirani R (2007). On the “degrees of freedom” of the lasso. The Annals of Statistics, 35(5), 2173–2192. 10.1214/009053607000000127 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
