Analyzing evidence-based falls prevention data with significant missing information using variable selection after multiple imputation

Yujia Cheng; Yang Li; Matthew Lee Smith; Changwei Li; Ye Shen

doi:10.1080/02664763.2021.1985090

. 2021 Oct 7;50(3):724–743. doi: 10.1080/02664763.2021.1985090

Analyzing evidence-based falls prevention data with significant missing information using variable selection after multiple imputation

Yujia Cheng ^a, Yang Li ^b, Matthew Lee Smith ^c, Changwei Li ^a, Ye Shen ^a,^CONTACT

PMCID: PMC9930815 PMID: 36819083

Abstract

Falls are the leading cause of fatal and non-fatal injuries among older adults. Evidence-based fall prevention programs are delivered nationwide, largely supported by funding from the Administration for Community Living (ACL), to mitigate fall-related risk. This study utilizes data from 39 ACL grantees in 22 states from 2014 to 2017. The large amount of missing values for falls efficacy in this national database may lead to potentially biased statistical results and make it challenging to implement reliable variable selection. Multiple imputation is used to deal with missing values. To obtain a consistent result of variable selection in multiply-imputed datasets, multiple imputation-stepwise regression (MI-stepwise) and multiple imputation-least absolute shrinkage and selection operator (MI-LASSO) methods are used. To compare the performances of MI-stepwise and MI-LASSO, simulation studies were conducted. In particular, we extended prior work by considering several circumstances not covered in previous studies, including an extensive investigation of data with different signal-to-noise ratios and various missing data patterns across predictors, as well as a data structure that allowed the missingness mechanism to be missing not at random (MNAR). In addition, we evaluated the performance of MI-LASSO method with varying tuning parameters to address the overselection issue in cross-validation (CV)-based LASSO.

Keywords: Multiple imputation, variable selection, stepwise regression, group LASSO penalty, Rubin's rules, data simulation, fall prevention, falls efficacy

1. Background

1.1. Purpose of study

The study is motivated by the practical need to assess missing data from a national dissemination of evidence-based fall prevention programs [30]. While the goal of this initiative was to collect evaluation data capable of identifying participants' benefits from attending various fall prevention programs, there is a large amount of missing data concentrated in post-intervention survey information. Significant missing values in datasets are likely to affect analysis results. Meanwhile, variable selection is usually performed only on the complete cases, which can be inefficient when significant amount of missing data exists and give biased selection result unless the data are missing completely at random (MCAR) [28]. To solve this missing value problem, first data were imputed using multiple imputation (MI). Then, to obtain a consistent result from variable selection on multiply-imputed data, two different variable selection methods designed for use after MI, MI-stepwise [37] and MI-LASSO [4] were performed. These two methods were applied to predict the improvement of participants' falls efficacy, assuming MAR. The current study focuses on falls efficacy (i.e. the perceived ability to prevent and manage a fall [29]) because it has been identified as a significant predictor of physical activity among older adults who participated in evidence-based fall prevention programs [6,33]. In order to compare and evaluate the two imputation methods, we conducted simulation studies to extend prior work under several circumstances not previously covered, including an extensive investigation of data with different signal-to-noise ratios and various missing data patterns across predictors, as well as a data structure that allowed the missingness mechanism to be missing not at random (MNAR). In addition, we evaluated the performance of MI-LASSO method with varying tuning parameters to address the overselection issue in cross-validation(CV)-based LASSO.

1.2. Techniques for missing data

Currently, the default solutions for missing data within most statistical packages are listwise and pairwise deletion. Another option to treat missing values is to drop variables with a large proportion of missing data before analysis, which is only effective if the dropped variables are not relevant to the study purposes. Imputation is often a preferred choice over dropping variables. Multiple imputation [25] replaces each missing entry with two or more values based on the distribution of possibilities of the imputed variable, resulting in two or more complete datasets. It can provide unbiased statistical results given a correctly specified imputation model [13,39] and give parameter estimates and standard errors that consider the uncertainty due to the missing data values [28]. Multiple imputation works well with data under an ignorable missing mechanism, which includes missing completely at random (MCAR) and missing at random (MAR), but may give erroneous results under a missing not at random (MNAR) mechanism. While there are other methods to analyze data with missingness without imputing missing data, the Multiple Imputation method is more computationally straightforward in that it imputes several datasets and uses the same datasets for different types of analyses using synthetic approaches.

1.3. Variable selection

There are many ways to conduct variable selection when data are complete. The subset selection method tests all variable combinations and chooses the best model based on a specific criterion, like the Akaike Information Criterion (AIC), Bayes Information Criterion (BIC), adjusted $R^{2}$ , Mallow's Cp, or Mean Square Error. Stepwise regression, a type of combination of forward selection and backward elimination proposed by Efroymson [10], is a relatively more efficient approach. For each step of stepwise regression, it adds or removes predictor based on a pre-specified criterion, such as significance level of parameter estimates. These methods are more practical for datasets with a large sample size and relatively small number of candidate predictors. When we have a large number of candidate predictors and if the number of observation is smaller than the number of candidate predictors, variable selection methods via penalized likelihood can be considered. Penalties are divided into categories: K-Smallest Items (KSI) penalties family, which includes the least absolute shrinkage and selection operator (LASSO), the Self-adaptive penalty, and the Log-Exp-Sum penalty. The LASSO was introduced by Robert Tibshirani [32] in 1996 to improve regression models in terms of prediction accuracy and interpretability. It can perform variable selection and regularization by minimizing the residual sum of squares with the constraint in the sum of absolute values of the coefficients. Bayesian variable selection strategies are also popular in many applications [14].

For multiply-imputed datasets, suitable variable selection methods should be applied after multiple imputation. However, if variable selection methods are directly applied to each imputed dataset separately, the selection result may not be consistent across the multiple datasets generated by imputation; therefore, making it difficult to draw scientific conclusions about the model and parameter estimates. Several approaches for variable selection in multiply-imputed data have been proposed in the literature. According to Heymans et al. [15], one can conduct variable selection in each imputed dataset separately and pick common predictors for a single model based on selection results from each imputed dataset. Wood et al. [37] proposed a backward stepwise selection under a weighted regression applied on an integral dataset, which was obtained by stacking k multiply-imputed datasets. They also provided a stepwise variable selection method for multiply-imputed data by repeated use of Rubin's rules [9,25], the MI-stepwise method. In this paper, the MI-stepwise method is adopted, in which each selection step is based on the combined P-value obtained by Rubin's rules. The MI-LASSO method proposed by Chen and Wang [4] is also utilized in this study, which combines estimated coefficients for each variable in k imputed datasets into a group LASSO penalty and adds/removes the whole group simultaneously.

It is worth discussing the criterion or methods for model comparisons. Especially for LASSO, the model selection problem is actually a search of optimal tuning parameter. Cross-validation [32] or information criteria are commonly used to choose tuning parameter λ in LASSO. AIC and BIC are based on minimizing the deviance. They differ only by the coefficient multiplying the number of parameters, thus differ in the strength of penalizing large models. AIC and BIC are easy to calculate but are not applicable in high-dimensional data cases. Generally, models chosen by BIC will be more parsimonious than models chosen by AIC [17]. The MI-LASSO method used in this paper refers to BIC to select the best model.

2. ACL-falls data overview

Falling is a leading cause of fatal and non-fatal injuries for older persons in the United States [27]. Statistics show that a quarter of Americans aged 65+ fall every year [2]. While falling is somewhat common, falls are largely preventable. Older adults who engage in regular exercise activities have increased physical functional capacity relative to those engaging in lower-intensity activities such as daily chores and normal walking. However, all forms of physical activity have benefits for lower extremity strength and flexibility, which promotes balance and reduces fall-related risk [3]. Regular physical activity also promotes positive perceptions of psychological well-being among older adults, which in turn helps people feel better and adhere to staying physically active [26].

Evidence-based fall prevention programs [21,31] were delivered nationwide as part of the Patient Protection & Affordable Care Act (ACA) to help older adults reduce falls and fall-related risks. The ongoing implementation and dissemination of these programs are supported by grants from ACL and technical assistance from the National Council on Aging (NCOA)'s National Falls Prevention Resource Center [12]. Eight evidence-based fall prevention programs are included in this initiative, which includes data from 39 grantees spanning 22 states from 2014 to 2017 [30]. Using a national data repository [19], participants' information was collected using surveys before (pre-intervention) and immediately after (post-intervention) the intervention. Participant attendance details were recorded by attendance log. Grantees reported data about the program and workshops delivered, which were uploaded in the repository. Generally, in terms of outcome measures, four aspects of participant data are collected: (a) physical improvement; (b) mental improvement; (c) program influence on daily activities and environment; and (d) comments of the program (collected post-intervention only).

The original dataset contained 126 variables and 45,812 observations from 2014 to 2017 when including all 8 interventions, with 9.4% of data elements missing and only 30.4% of all observations being complete across all variables. This study focuses on two of the eight interventions, Stepping On (SO) and Tai Ji Quan: Moving for Better Balance (TJQMBB), because they are two of the most prevalent programs (representing 13.9% and 11.6% of participants reached, respectively) and include substantial physical activity components. SO is a 7-week program that introduces a range of fall prevention strategies to build participants' self-confidence so they can change behaviors and make better decisions in situations where they are at risk of falling [7]. TJQMBB is a 24-week or 48-week program developed to improve strength, balance, mobility and daily functioning, which is especially beneficial to prevent falls among older adults and individuals with balance disorders [20]. Due to collinearity concerns, certain variables are combined to form a new variable. For example, falls efficacy scale is created as a composite score of five items. As a result, a sub-dataset of 37 variables and 5864 observations from only SO and TJQMBB participants is used in this paper. The final working dataset has 8.8% of data elements missing and 32.0% of observations are with complete information across all 37 variables

The overall analysis plan is to fit a linear regression model with the falls efficacy change variable as the outcome, and potential predictors including individual demographics, self-evaluated physical and mental health level, intervention (program), length of course, etc. as explanatory variables. 30 out of the 34 variables in the final working dataset were used in this analysis, after excluding variables deemed irrelevant to the proposed analysis. A complete list of the included variables with their corresponding missingness percentages are provided in Table 1.

Table 1.

Variables included in the proposed analysis with missingness percentages.

Candidate covariate variables	Num observed	% Missingness
Host Organization Id	5864	0%
Number of Sessions Offered	4967	15%
Number of Participants in Workshop	5864	0%
Pre-workshop Offered	5864	0%
Health Care Referral	5285	10%
Age	5715	3%
Live Alone	5051	14%
Gender	5716	3%
Hispanic	5497	6%
Education	5015	14%
Arthritis Bone/Joint Disease	5864	0%
Breathing/Lung Disease	5864	0%
Depression	5864	0%
Diabetes	5864	0%
Glaucoma/Other Eye Problem	5864	0%
Heart Disease/Blood Circulation Problem	5864	0%
No Chronic Condition	5864	0%
Limited Activity	4542	23%
General Health (base)	5563	5%
Number of Falls in Past 3 Months	4730	19%
Fear of Fall (base)	4954	16%
Concern about Falling Interfered with Normal Social Activities	4933	16%
Falls Efficacy (base)	4647	21%
Number of Falls in Past 3 Months with Injury	4053	31%
Race	5498	6%
Length of Program	5864	0%
Days Between First Survey and Last Survey	5864	0%
Program	5864	0%
Participation Rate	4967	15%
Outcome Variable	Num Observed	% Missingness
Fall's Efficacy Change	4279	27%

Open in a new tab

3. Methodology

3.1. Multiple imputation

The multiple imputation procedure adopted in this study is the sequential regression imputation (SRMI) [24] approach, also known as multivariate imputation by chained equations (MICE) and fully conditional specification (FCS) [35]. This approach is capable of handling multiple imputation for datasets with a relatively complex structure by imputing variables under a certain sequence. According to the process, the variable with the least amount of missing values is imputed by specifying an appropriate regression model given other variables. Then, this imputed variable is used for the imputation of next variable. Separate regression models, conditional on all other observed or imputed variables, are specified for each variable according to its type (continuous, binary, categorical, counts, and mixed). The whole process iterates until it converges. Van Buuren, Boshuizen, and Knook [34] recommend to include three sets of covariates in the imputation models: (a) variables will be in the model for analysis; (b) variables correlated with the imputed variable and (c) variables associated with the missingness of the imputed variable. They also recommend to remove covariates that have a large number of missing entries in observations with missing values for the imputed variable.

The ACL fall prevention dataset used in this study contains various types of data. Based on the arbitrary missing data pattern, and assuming an ignorable missing mechanism (MAR or MCAR), the SRMI/MICE/FCS approach was applied for this study. To provide a practical balance between processing time and quality of results, five imputations were generated for each missing entry in our study.

After getting multiply-imputed data, Rubin's rules were then used to draw combined inferences from these datasets. Suppose we have D imputations, for each imputation the same analysis method is conducted to get D sets of point and variance (or standard error) estimates for a certain population parameter of interest Q. Let $\hat{Q_{i}}$ and $\hat{U_{i}}$ denote the point estimate of interest and estimated variance of $\hat{Q_{i}}$ from the ith dataset, $i = 1, \dots, D$ . Then, the point estimate for parameter Q from D imputations is the average of the D complete-data point estimates:

\bar{Q} = \frac{1}{D} \sum_{i = 1}^{D} {\hat{Q}}_{i} .

(1)

To obtain a valid standard error for Q, two components of variance are needed. One is within-imputation variation, the other is between-imputation variation, which reflects the variability due to imputation uncertainty. Let $\bar{U}$ and B denote the average within-imputation variance and the between-imputation variance, respectively. They are calculated by the following equations:

\begin{aligned} \bar{U} & = \frac{1}{D} \sum_{i = 1}^{D} {\hat{U}}_{i}, \end{aligned}

(2)

\begin{aligned} B & = \frac{1}{D - 1} \sum_{i = 1}^{D} {({\hat{Q}}_{i} - \bar{Q})}^{2} . \end{aligned}

(3)

Then, the combined variance of Q is given by

T = \bar{U} + (1 + \frac{1}{D}) B .

(4)

A Wald test can be used for the null hypothesis that a single combined point estimate equals a specific value, $H_{0} : Q = Q_{0}$ , by comparing the statistic $W = \frac{(Q_{0} - \bar{Q})^{2}}{T}$ against the $F_{1, γ}$ , where

γ = (D - 1) {1 + \frac{\bar{U}}{(1 + D^{- 1}) B}}^{2} .

(5)

It is worth mentioning, Rubin's rules cannot be applied to draw inference from multiply-imputed data if the covariates of regression models are different in each imputation. Applying the variable selection method on each imputation is highly likely to get different selection results, which makes it hard to draw combined inferences. Therefore, it is desirable to use methods like MI-stepwise and MI-LASSO to ensure that a single set of covariates is selected into the final model.

3.2. Variable selection after MI: MI-stepwise

MI-stepwise variable selection [37] is similar to standard stepwise selection. The only difference is that the action (add, remove, or stay in the model) for normal stepwise is based on the probability value ( P-value) in the hypothesis test for the variable of interest, while a combined P-value is used for MI stepwise. Specifically, m point estimates and variances are obtained from identical analyses on m multiply-imputed datasets. According to Rubin's rules, these multiple estimates are combined into an overall inference, which incorporates both within-imputation and between-imputation variability. A combined P-value can be obtained from the Wald test, using the combined point estimate and its total variance. This combined P-value is the determinant of action in each step. Thus the selection procedures are jointly conducted across all imputed datasets, and the action for each variable will be the same in all imputed datasets.

The details of MI-stepwise selection procedures are shown below, as described by Wood et al. [37]:

Step 0
Let t = 0. Specify the initial model $M_{0}$ . Define $α_{1}$ , the significance level for candidate covariates to enter the model, and $α_{2}$ , the significance level for removing a covariate from the model.
Step 1
Let t = t + 1. For each covariate $X_{j}$ that is not included in the model $M_{t - 1}$ , fit D regressions with the model ${M_{t - 1}, X_{j}}$ by D imputed datasets. Obtain the combined P-value, $P_{c, j}$ , for each added $X_{j}$ by using Rubin's rules. Sort the covariates by $P_{c, j}$ . The covariate with the smallest combined P-value, $P_{c, j}$ , is added to the model $M_{t - 1}$ if its $P_{c, j} <= α_{1}$ . Renew the model $M_{t} = {M_{t - 1}, X_{j}}$ . If the smallest $P_{c, j} > α_{1}$ , the procedure terminates.
Step 2
Refit $M_{t}$ on D imputed datasets. Calculate the combined P-value $P_{c, j}$ for each covariate in the model. Sort the covariates by $P_{c, j}$ . The covariate with the largest combined P-value, $P_{c, j}$ , is removed from the model $M_{t}$ if its $P_{c, j} > α_{2}$ . Renew the model $M_{t} = {M_{t - 1}, - X_{j}}$ .
Step 3
Repeat Step 2 until no more covariates are removed from the model.
Step 4
Go back to Step 1 and iterate between Steps 1 and 2 until the procedure terminates.

To avoid endless iteration, the significance level $α_{2}$ should not be smaller than $α_{1}$ .

3.3. Variable selection after MI: MI-LASSO

Instead of fitting the model on each dataset separately, MI-LASSO considers fitting models on all imputed datasets jointly to yield a consistent variable selection across all imputed datasets. Denote ${\hat{β}}_{1, j}, \dots, {\hat{β}}_{D, j}$ be the D estimated coefficients for covariate $X_{j}$ on the D imputed datasets. To obtain consistency, ${\hat{β}}_{1, j}, \dots, {\hat{β}}_{D, j}$ are treated as a group to form the group LASSO penalty [38]. The specific optimization equation is shown as below:

min_{β_{d, j}} \sum_{d = 1}^{D} \sum_{i = 1}^{n} {(y_{d, i} - β_{d, 0} - \sum_{j = 1}^{p} β_{d, j} x_{d, i j})}^{2} + λ \sum_{j = 1}^{p} \sqrt{Σ_{d = 1}^{D} β_{d, j}^{2}},

(6)

where $\sum_{j = 1}^{p} \sqrt{Σ_{d = 1}^{D} β_{d, j}^{2}}$ is called the group LASSO penalty; λ is called tuning parameter, $λ > 0$ . The solution to the optimization equation (6) can be realized by iteration. The group LASSO penalty function is singular in the beginning. To overcome this difficulty, Chen et al. [4] adopted the local quadratic-approximation method, proposed by Fan and Li [11]. Let ${\hat{β}}_{d, j}^{(t)}, d = 1, \dots, D$ , denotes coefficient estimates at the $t^{t h}$ iteration. If $\sqrt{({\hat{β}}_{1, j}^{(t)})^{2} + \dots + ({\hat{β}}_{D, j}^{(t)})^{2}} > 0$ , we can make the following approximation:

\sqrt{β_{1, j}^{2} + \dots + β_{D, j}^{2}} \approx \frac{β_{1, j}^{2} + \dots + β_{D, j}^{2}}{\sqrt{{({\hat{β}}_{1, j}^{(t)})}^{2} + \dots + {({\hat{β}}_{D, j}^{(t)})}^{2}}} .

(7)

Accordingly, the optimization equation can be transformed to

\begin{aligned} min_{β_{d, j}} \sum_{d = 1}^{D} {\sum_{i = 1}^{n} {(y_{d, i} - β_{d, 0} - \sum_{j = 1}^{p} β_{d, j} x_{d, i j})}^{2} + λ \sum_{j = 1}^{p} c_{j} β_{d, j}^{2}}, \end{aligned}

(8)

\begin{aligned} c_{j}^{(t)} = 1 / \sqrt{{({\hat{β}}_{1, j}^{(t)})}^{2} + \dots + {({\hat{β}}_{D, j}^{(t)})}^{2}} . \end{aligned}

(9)

To keep the condition $\sqrt{({\hat{β}}_{1, j}^{(t)})^{2} + \dots + ({\hat{β}}_{D, j}^{(t)})^{2}} > 0$ when the group of coefficients are shrunken to zero, Chen et al. [4] proposed to fix ${\hat{β}}_{1, j}^{(t)} = \dots = {\hat{β}}_{D, j}^{(t)}$ when $\sqrt{({\hat{β}}_{1, j}^{(t)})^{2} + \dots + ({\hat{β}}_{D, j}^{(t)})^{2}} < \sqrt{D} δ$ . The value of δ was choice as $10^{- 10}$ . Equation (8) is the sum of D ridge regressions. Thus the group LASSO penalty problem can be solved by iteratively computing D separate ridge regression until it converges. The largest allowed difference of the estimated coefficients between two iterations is set as $10^{- 6}$ . The tuning parameters λ controls the strength of shrinkage and consequently the number of selected variables. One hundred and one tuning parameter λ's, ranging from 0.125 to 125, are generated, and the value of λ associated with the smallest BIC is used. For MI-LASSO method, Chen and Wang [4] defined the corresponding BIC and $d f_{2}$ as

\begin{aligned} B I C & = \log (\sum_{d = 1}^{D} \sum_{i = 1}^{n} {(y_{d, i} - {\hat{β}}_{d, 0} - \sum_{j = 1}^{p} {\hat{β}}_{d, j} x_{d, i j})}^{2} / (D n)) + d f_{2} * \frac{\log (D n)}{D n}, \end{aligned}

(10)

\begin{aligned} d f_{2} & = \sum_{j = 1}^{p} I (\sqrt{\sum_{d = 1}^{D} {\hat{β}}_{d, j}^{2}} > 0) + (D - 1) \sum_{j = 1}^{p} \frac{\sqrt{\sum_{d = 1}^{D} {\hat{β}}_{d, j}^{2}}}{\sqrt{\sum_{d = 1}^{D} {\tilde{β}}_{d, j}^{2}}}, \end{aligned}

(11)

where ${\hat{β}}_{d, j}$ is the MI-LASSO estimate and ${\tilde{β}}_{d, j}$ is the ordinary least square estimate.

Due to the nature of the group LASSO penalty, the estimated coefficients $({\hat{β}}_{1, j}, \dots, {\hat{β}}_{D, j})$ will be either all zero or all nonzero resulting in a desired consistent variable selection across all imputed datasets. After the selection, overall inference can be obtained by Rubin's rules.

4. Application to the ACL fall prevention data

ACL Fall Prevention Data provided pre- and post-intervention information about participants of fall prevention programs to evaluate changes in health outcomes and benefits received. Among variables collected to measure the physical, mental, and other conditions related to falls, falls efficacy is a robust measure that somewhat incorporates these different aspects. Falls efficacy reflects participants' degree of confidence to avoid falling or protect themselves before and after a fall. Falls efficacy is associated with engaging in physical activity and has been found to significantly improve among older adults attending SO [22] and TJQMBB [23]. Falls efficacy was collected before and after the intervention using five items, with each item scored on a 4-point Likert scale. Responses are summed to create a composite score at each time point. The difference of the summed falls efficacy scores were calculated (positive values indicated improvement from pre- to post-intervention) and used as the dependent variable.

The MI-stepwise and MI-LASSO were applied to the ACL Fall Prevention Data assuming MAR or MCAR, under which assumptions missing data handling through MI is deemed appropriate. A series of tuning parameter values were used in MI-LASSO, and the value with the smallest BIC was used. For MI-stepwise, we set $α_{1}$ =0.05 as the significant level for candidate variables to enter the model, and $α_{2}$ = 0.06 as the significant level for predictors to be removed from the model. The collected dataset had 37 variables. After excluding variables deemed irrelevant to the change of falls efficacy, 29 variables were considered as candidate covariates. These potential predictors included individual demographics, self-evaluated physical and mental health level, intervention(program), length of course, etc. A complete list of the variables included in the data application are presented in Table 1. Based on the recommendation from our field investigators, the final analysis was conducted on all subjects with complete information on the outcome variable, i.e. the change of falls efficacy. Variable selection was performed to find factors significantly associated with the change of falls efficacy. The variable selection results from MI-LASSO and MI-stepwise are shown in Table 2. On the top of the table are six variables (out of 29) selected by both methods. The bottom part shows the variables selected by MI-LASSO but not MI-stepwise. No variables were selected with MI-stepwise but not MI-LASSO. For each variable in the list, the estimates of regression coefficients and their P-values were obtained using Rubin's rules.

Table 2.

Results of complete-case (CC) analysis, MI-LASSO, and MI-stepwise methods in selecting important variables related to the change in participants' falls efficacy.

		CC Analysis		MI-LASSO		MI-stepwise
	Covariate	Estimate	Pr $> ∣$ t \|	Estimate	Pr $> ∣$ t \|	Estimate	Pr $> ∣$ t \|
0	Intercept	10.3065	<.0001	11.1023	<.0001	11.3618	<.0001
1	Age	− $0.0697$	<.0001	− $0.0697$	<.0001	− $0.0716$	<.0001
2	Education	0.1962	0.0001	0.1524	0.0002	0.1506	0.0002
3	Participation Rate	0.6873	0.0096	0.6846	0.0021	0.8572	<.0001
4	Fear of Fall (base)	− $0.2200$	0.0005	− $0.2342$	<.0001	− $0.2597$	<.0001
5	General Health (base)	0.3939	<.0001	0.3784	<.0001	0.4213	<.0001
6	Falls Efficacy (base)	− $0.6179$	<.0001	− $0.6531$	<.0001	− $0.6498$	<.0001
7	Live Alone (no)	0.1569	0.1081	0.2476	0.0023
8	Limited Activity (no)	0.3358	0.0015	0.2509	0.0096
9	Breathing/Lung Disease (no)	− $0.2447$	0.1004	− $0.2280$	0.0540
10	Number of Falls in Past 3 Months	− $0.0220$	0.5350	− $0.0454$	0.1869
11	Glaucoma/Other Eye Problem (no)	0.1922	0.1490	0.1526	0.1601
12	Length of Program	− $0.0003$	0.8605	− $0.0008$	0.6583
13	Program (Stepping On)	0.3784	0.0117	0.1559	0.2330
14	Heart Disease/Blood Circulation Problem (no)	0.1760	0.1285	0.1273	0.1780

Open in a new tab

The estimates of regression coefficients for the six variables selected by both methods were similar. Specifically, the improvement of falls efficacy is larger among those with a higher level of education, better general health status before they entered the program, and higher program workshop attendance (total number of workshop sessions attended/total workshop session offered). Falls efficacy improvements were smaller among those who were older, had more fear of falling before attending the program, and had higher falls efficacy at baseline. In addition, MI-LASSO selected more important factors including: (a) whether the participant lived alone; (b) whether the participant limited their activities because of health problems; (c) self-reported chronic conditions (e.g. arthritis, lung disease, glaucoma, heart disease); (d) the program type and the length of the program; and (e) number of falls in past 3 months. Among these factors, participants who lived alone or limited their activities had negative falls efficacy changes at 0.05 significance level. Other factors were not significantly related to changes in falls efficacy. To further compare the predictions of different methods, we obtained the prediction MSEs from fivefold cross-validation for MI-LASSO and MI-stepwise. MI-LASSO had an MSE of 5.94, and MI-stepwise resulted in an MSE of 5.97. Thus, the two approaches appear to achieve similar prediction performance on the ACL-falls data.

For comparison purpose, we also include results from the complete-case analysis in Table 2. We notice that the six explanatory variables selected by both MI-LASSO and MI-stepwise had similar effect sizes and remained to be statistically significant in the complete case analysis. There are discrepancies on two additional variables, ‘Live Alone’ and ‘Program’, between the complete-case analysis and MI-LASSO, suggesting possible selection bias from the complete-case analysis which only used around one third of the total sample size. A cautious approach should be applied when interpreting the effects of these two variables on the change of falls efficacy.

Overall, MI-stepwise appears to be more conservative in selecting variables than MI-LASSO. It leaves out some variables selected by MI-LASSO shown to be significant at 0.05 significant level. MI-LASSO selects twice as many variables as MI-stepwise. However, among those extra variables some were not significantly associated with changes in the dependent variable. To provide further insights into these observations from our data application, we conducted additional simulation studies.

5. Simulation study

5.1. Design

Simulation studies are conducted to compare the performance of MI-stepwise and MI-LASSO as variable selection methods with multiply-imputed data. We design our simulation after the Chen and Wang [4] study and extend their work under several circumstances not previously covered, including further investigations of data with different signal-to-noise ratios and various missing data patterns across predictors, as well as a data structure that allowed the missingness mechanism to be missing not at random (MNAR). In addition, we also assess the performance of MI-LASSO method with varying tuning parameters to address the overselection issue in cross-validation (CV)-based LASSO.

A dataset with 20 variables and 200 observations is generated from a multivariate standard normal distribution (mean = 0, variance = 1), with a compound symmetric correlation structure. The correlation coefficient ρ has a value of 0.3 for a moderate correlation. The 20 variables $X_{1} \sim X_{20}$ are all continuous, and the response variable Y is generated from the following linear regression model:

Y = X_{1} + X_{5} + X_{10} + X_{11} + X_{15} + X_{20} + ε,

(12)

with variable 1, 5, 10, 11, 15, 20 being important predictors and $ε \sim N (0, σ_{ε}^{2})$ . Three different values of $σ_{ε}^{2}$ are adopted to obtain different signal-to-noise ratios (SNR) at 0.5, 1 and 3. Here

S N R = \frac{β^{T} Σ β}{σ_{ε}^{2}},

(13)

where $β = (1, 1, 1, 1, 1, 1)$ is the vector of true coefficient parameters and Σ is the population covariance matrix of X.

Y is designed to be fully observed, and missing data are generated in the 20 simulated candidate covariates under several scenarios. Missing values are generated under both MAR and MNAR mechanisms. In addition, various missingness percentages and patterns are considered.

For MAR, the missing data indicator $R_{i j}$ was generated using the following the logistic model, in which subscript j represents the variable and subscript i represents the observation. When $j \leq 10$ , the probability of missing for $X_{i j}$ depends only on $X_{i (j + 10)}$ . When 10<j< = 20, the probability of missing for $X_{i j}$ depends only on $X_{i (j - 10)}$ , so that

{\begin{cases} logit {P r (R_{i j} = 0 ∣ X_{i (j + 10)})} = α + X_{i (j + 10)}, j <= 10 \\ logit {P r (R_{i j} = 0 ∣ X_{i (j - 10)})} = α + X_{i (j - 10)}, 10 < j <= 20 \end{cases};

(14)

while for MNAR, the probability for $X_{i j}$ to be missing depends on itself and not any other variables:

l o g i t {P r (R_{i j} = 0 ∣ X_{i j})} = α + X_{i j},

(15)

where α can be chosen to control variable missingness percentages.

In subsequent simulation studies, the overall percentage of missingness in the simulated dataset, $p_{t}$ , is either set as 10% or 20%, with 10% close to the overall percentage of missingness in the ACL-falls Data. We also study the missingness pattern in terms of how missing values are scattered in the dataset with a fixed overall missingness percentage. Three patterns are considered: (a) each variable has $p_{t}$ of missing entries; (b) $X_{1} - X_{10}$ each has $0.5 p_{t}$ of missing entries, while $X_{11} - X_{20}$ each has $1.5 p_{t}$ of missing entries; and (c) $X_{1} - X_{10}$ was fully observed, while $X_{11} - X_{20}$ each has $2 p_{t}$ of missing entries.

After data with missingness are generated, five imputations are processed using the chained equations approach with linear regression models. For each missing entry, all other variables except Y are used as predictors in the imputation. Subsequent data analyses apply MI-LASSO and MI-stepwise approaches to the multiply-imputed data. In certain scenarios, complete case analyses are performed to datasets with no imputation for a comparison purpose.

5.2. Measurements of performance

The simulation is repeated 200 times. Several criteria could be used to compare the performance of MI-LASSO and MI-stepwise methods among different missing value situations. (a). Sensitivity:

S E N = \frac{# of selected important variables}{# of true important variables};

(16)

(b). Specificity:

S P E = \frac{# of removed unimportant variables}{# of true unimportant variables};

(17)

(c). Geometric Mean of Sensitivity and Specificity:

G = \sqrt{Sensitivity \times Specificity};

(18)

(d). Mean-squared Error:

M S E = (\hat{β} - β)^{T} Σ (\hat{β} - β) .

(19)

Sensitivity measures the proportion of important variables correctly selected. Specificity measures the proportion of unimportant variables correctly removed. The Geometric Mean of Sensitivity and Specificity is used as an overall performance measure of model selection. As described by Kubat et al. [18], this measure has the distinctive property of being independent of the numbers of important and unimportant variables. The values of these measures range between 0 and 1. A value close to 1 is more desirable because it implies that most variables are selected correctly. Additionally, MSE serves as an assessment of the overall model accuracy [4,32].

5.3. Simulation results

Four simulation studies were conducted to compare the performances of stepwise and LASSO methods under various scenarios (full data, complete cases, and multiply-imputed data). In simulation 1, missing values were generated under different missingness mechanisms with different missingness percentages. The second simulation generated data with three signal-to-noise ratios. In the third simulation, missing values distributed with three patterns were considered. Finally, in the fourth simulation, we evaluated the performance of MI-LASSO with increased tuning parameters.

5.3.1. Simulation 1: Comparison of ignorable and non-ignorable missingness mechanisms with varying missingness percentages and SNRs

In this section, stepwise and LASSO methods are conducted on data with missingness under an ignorable mechanism (i.e. missing at random (MAR)) and on data with missingness under a non-ignorable mechanism (i.e. missing not at random (MNAR)). Total missingness percentage is set as either 10% or 20%.

Figure 1 shows that MI-LASSO generally selects more variables than MI-stepwise, agreeing with what was found in Chen and Wang [4]. The MI-LASSO under MNAR selects as many important variables as MI-LASSO under MAR. Comparing the selection between $X_{1} \sim X_{10}$ and $X_{11} \sim X_{20}$ , the increased percentage of missingness has a slight impact on MI-LASSO method, leading to fewer variables selected into the model. Meanwhile, the missingness percentage of variables affects the performance of MI-stepwise under both MAR and MNAR mechanisms; a higher missingness percentage makes the algorithm leave out more important variables and select fewer unimportant variables.

From Table 3, under all simulation scenarios, the sensitivities, specificities and geometric means of both MI-LASSO and MI-stepwise suggest that the two variable selection methods perform better on data missing under MAR than missing under MNAR. Specifically, when the total missingness percentage increases to 20%, the differences of three measures between two mechanisms (i.e. MAR and MNAR) are more pronounced. When the total missingness percentage is 10%, the performances of MI-variable selection methods for data with different missingness mechanisms are relatively similar. According to the changes of geometric mean, MI-LASSO appears to be more sensitive to missingness mechanism than MI-stepwise. MSE becomes smaller when SNR increases from 1 to 3.

Table 3.

Mean sensitivity (SEN), specificity (SPE), their geometric mean (G) and median mean-squared error (MSE) for stepwise and LASSO methods under MAR and MNAR in 200 replicates of simulations (N = 200): continuous covariates with compound symmetry covariance structure ( $ρ = 0.3$ , SNR = 1 or 3). The missingness percentage for each $X_{1} \sim X_{10}$ is 5% and for $X_{11} \sim X_{20}$ is 15% in group A. The missingness percentages correspondingly double in group B.

		Group A (Missing = 10%)				Group B (Missing = 20%)
SNR = 1		SEN	SPE	G	MSE	SEN	SPE	G	MSE
MAR
	MI-LASSO	98.8%	72.6%	84.7%	1.0	96.6%	68.3%	81.3%	1.0
	MI-stepwise	80.1%	95.7%	87.6%	1.4	68.3%	94.7%	80.4%	2.0
MNAR
	MI-LASSO	98.6%	71.3%	83.8%	1.1	95.8%	64.9%	78.9%	1.4
	MI-stepwise	78.8%	95.4%	86.7%	1.2	67.1%	93.8%	79.3%	2.5
		Group A (Missing = 10%)				Group B (Missing = 20%)
SNR=3		SEN	SPE	G	MSE	SEN	SPE	G	MSE
MAR
	MI-LASSO	100.0%	67.4%	82.1%	0.4	100.0%	58.8%	76.7%	0.6
	MI-stepwise	99.1%	97.2%	98.1%	0.2	92.0%	95.6%	93.8%	0.4
MNAR
	MI-LASSO	100.0%	63.7%	79.8%	0.4	99.8%	54.7%	73.9%	0.7
	MI-stepwise	99.3%	96.4%	97.9%	0.2	90.4%	94.8%	92.6%	0.4

Open in a new tab

Although multiple imputation is built upon the assumption of ignorable missingness mechanisms (i.e. MCAR and MAR), MI-variable selection methods under non-ignorable missingness mechanisms seem to provide satisfactory variable selection performances when percentage of missingness in each variable is low. Possible reasons are: (a) by including sufficient data and correct auxiliary variables in the imputation model, we could obtain good imputed results using MI even under MNAR mechanism [8,28] and (b) although applying MI to data with MNAR mechanism on the predictors may lead to biased inferences, they do not necessarily have a strong impact on the later variable selection process when missingness percentages of each variable are relatively low and the signals of important variables are strong enough to be detected. Meanwhile, complete-case analysis actually results in unbiased parameter estimates even if the missingness mechanism for predictors is MNAR [36].

5.3.2. Simulation 2: comparisons across varying sample sizes and SNRs

In this section, stepwise and LASSO methods are applied to full data, complete cases (CC) of data with missingness, and multiply-imputed data. The dependent variable is generated with three different signal-to-noise ratios (SNR) and two sample sizes.

Table 4 shows that the SNR has an influence on the performances of both stepwise and LASSO methods. For stepwise methods (including stepwise for full data, CC-stepwise and MI-stepwise), sensitivity, specificity and geometric mean increase with the increase of the SNR. For LASSO methods (including LASSO for full data, CC-LASSO and MI-LASSO), sensitivity increases and specificity decreases with the increase of the SNR. The discrepancies in geometric means between LASSO and MI-LASSO are relatively smaller than those between MI-stepwise and stepwise. Based on all four criteria listed in Table 4, MI-LASSO and MI-stepwise outperform CC-LASSO and CC-stepwise across three signal-to-noise conditions.

Table 4.

		SNR = 0.5				SNR = 1				SNR=3
		SEN	SPE	G	MSE	SEN	SPE	G	MSE	SEN	SPE	G	MSE
		N = 80
Full Data
	LASSO	66.7%	76.2%	71.3%	4.9	83.9%	74.2%	78.9%	2.8	99.3%	71.6%	84.3%	0.9
	Stepwise	34.9%	93.4%	57.1%	7.7	53.7%	93.4%	70.8%	4.3	89.5%	94.4%	91.9%	1.2
MAR
	CC-LASSO	62.5%	41.8%	51.1%	35.9	63.3%	42.1%	51.6%	18.5	67.7%	43.9%	54.5%	7.0
	CC-stepwise	11.3%	93.2%	32.4%	10.2	14.7%	92.5%	36.8%	8.9	23.2%	91.7%	46.1%	5.4
	MI-LASSO	53.8%	82.0%	66.4%	4.5	78.2%	75.5%	76.8%	2.8	98.3%	68.0%	81.8%	1.0
	MI-stepwise	29.4%	93.3%	52.4%	7.4	43.8%	93.9%	64.2%	5.2	71.4%	95.0%	82.4%	1.5
		N = 200
Full Data
	LASSO	92.8%	67.5%	79.1%	2.2	99.3%	65.8%	80.8%	1.1	100.0%	65.7%	81.1%	0.4
	Stepwise	66.6%	94.3%	79.2%	2.7	88.1%	94.8%	91.4%	1.2	100.0%	94.6%	97.3%	0.3
MAR
	CC-LASSO	28.8%	87.8%	50.2%	6.9	45.3%	85.1%	62.1%	5.5	77.1%	81.4%	79.2%	2.0
	CC-stepwise	18.8%	93.8%	42.0%	8.4	27.6%	93.2%	50.7%	6.9	50.2%	93.0%	68.3%	3.7
	MI-LASSO	88.9%	75.7%	82.0%	1.8	98.3%	71.5%	83.8%	1.0	100.0%	66.4%	81.5%	0.4
	MI-stepwise	56.7%	94.3%	73.1%	3.4	79.3%	95.4%	87.0%	1.5	99.1%	96.8%	97.9%	0.2

Open in a new tab

Controlling for the SNR, sensitivity, specificity and their geometric mean of stepwise methods decrease when the sample size decreases. The deterioration is worse when the SNR is low. For MI-LASSO and LASSO (full), controlling for the SNR, sensitivity decreases and specificity increases when the sample size decreases. MI-LASSO and MI-stepwise methods tend to retain the characteristics of LASSO and stepwise selections, respectively. As such, when the sample size increases (or when the ratio of sample size to number of covariate increases), LASSO is prone to select more variables in the model. The tendency of the CV-based LASSO to overselect variables has also been reported and discussed in several previous studies [1,5,40]. Meanwhile, the lower the SNR, the higher the differences in sensitivities/specificities between two sample sizes become. When sample size equals 200, the geometric mean of MI-LASSO is less sensitive to SNR. In general, MSE decreases when sample size and SNR increase, suggesting that larger sample sizes and higher SNRs lead to more accurate models.

5.3.3. Simulation 3: Comparisons across different missing data patterns

In this section, stepwise and LASSO methods are conducted on full data and multiply-imputed data with fixed total missingness percentage but with different missing data patterns, to test if the pattern of missingness affects the performance of MI-LASSO and MI-stepwise.

According to previous results, the missingness percentage of a covariate could affect the probability for it to be selected by MI-stepwise (Figure 1), and the overall missingness percentage affects the performance of MI-stepwise and MI-LASSO (Table 3). From Table 5, the means of sensitivity, specificity, their geometric mean and the median MSE for MI-stepwise and MI-LASSO are similar under different missing data patterns. When controlling for the overall percentage of missingness, the pattern of missingness does not have a significant impact on the performance of MI-stepwise and MI-LASSO.

Table 5.

Mean sensitivity (SEN), specificity (SPE), their geometric mean (G) and median mean-squared error (MSE) for stepwise and LASSO methods under MAR in 200 replicates of simulations (N = 200): continuous covariates with compound symmetry covariance structure ( $ρ = 0.3$ , SNR = 1). Totally 10% of missing values are generated under the MAR mechanism resulting in about 20% complete cases. ‘0.1/0.1’: missingness percentage for each $X_{1} \sim X_{20}$ is 10%. ‘0.05/0.15’ : missingness percentage for each $X_{1} \sim X_{10}$ is 5% and for $X_{11} \sim X_{20}$ is 15%. ‘0/0.2’: missingness percentage for each $X_{11} \sim X_{20}$ is 20%.

	SEN	SPE	G	MSE
	Full Data
LASSO	99.3%	65.8%	80.8%	1.1
Stepwise	88.1%	94.8%	91.4%	1.2
	0.1/0.1
MI-LASSO	98.3%	71.5%	83.8%	1.0
MI-stepwise	79.3%	95.4%	87.0%	1.5
	0.05/0.15
MI-LASSO	98.8%	72.6%	84.7%	1.0
MI-stepwise	80.1%	95.7%	87.6%	1.4
	0/0.2
MI-LASSO	98.5%	72.4%	84.4%	1.2
MI-stepwise	79.9%	94.9%	87.1%	1.3

Open in a new tab

Comparing performance of MI-LASSO and MI-stepwise in different missing data patterns with the performance of LASSO and stepwise in full data, MI-LASSO and MI-stepwise show minimal loss of efficiency.

5.3.4. Simulation 4: MI-LASSO performances with varying tuning parameters

From previous simulation results, the MI-LASSO is prone to over-select variables into the model, which is not desirable. We attempt to explore the performance of MI-LASSO with an enlarged tuning parameter. More variables are excluded in the following MI-LASSO selections by proportionally increasing the tuning parameter that was initially picked by the smallest BIC.

According to the nature of LASSO selection, the larger the tuning parameter, the fewer covariates will be selected into the model. Our simulation results are in agreement with this principle. When the tuning parameter increases, the specificities increase and the sensitivities decrease. The decrease in sensitivity is negligible with the increase of lambda when the sample size equals to 200, but it becomes more distinguishable when the sample size drops to 80. The specificity increases dramatically with the increase of lambda in both small and large datasets.

Generally, according to the geometric means, the performance of MI-LASSO selection improves with a certain level of increase in the BIC-based tuning parameter. But, an extremely large tuning parameter may result in a worse performance of MI-LASSO because it will leave many important variables out of the model. From Table 6, when the sample size is large, a moderately proportional increase in λ initially selected by BIC could lead to an improvement in specificity without affecting sensitivity much. Overall, it appears to be more desirable to increase the tuning parameter of MI-LASSO when the sample size is large than when the sample size is relatively small. Additionally, results of MSE suggest that larger sample sizes lead to more accurate models, but varying tuning parameters have minimal impacts on the precision.

Table 6.

Mean sensitivity (SEN), specificity (SPE), their geometric mean (G), and median mean-squared error (MSE) of MI-LASSO method with different tuning parameters, λ, in 200 replicates of simulations: lambda=‘Ax’ means the tuning parameter is A times the one selected by the smallest BIC. Data have 20 covariates and 200 or 80 observations. Missing values are evenly generated in all 20 candidate variables with about 20 complete cases under MAR mechanism.

	MI-LASSO (SNR = 3)
	Lambda	SEN	SPE	G	MSE
N = 80
	1x	98.3%	68.0%	81.8%	1.0
	1.5x	97.8%	69.7%	82.6%	1.0
	2x	96.3%	73.7%	84.2%	1.0
	3x	91.1%	79.8%	85.3%	0.9
N = 200
	1x	100.0%	66.4%	81.5%	0.4
	1.5x	100.0%	68.8%	82.9%	0.4
	2x	100.0%	73.5%	85.7%	0.4
	3x	99.9%	81.3%	90.1%	0.3

Open in a new tab

6. Discussion

According to our simulation studies, conducting variable selection on complete-cases has a low sensitivity of selection, resulting in failure to identify critically important variables. The studies show that both the MI-LASSO and the MI-stepwise have higher sensitivity and overall better geometric mean and model accuracy than applying LASSO and stepwise to complete cases. Under our simulation settings, MI-LASSO and the MI-stepwise perform similarly to applying the LASSO and stepwise methods on the original full dataset, respectively. Same conclusions were drawn by Chen and Wang [4]. The above simulation examined the performance of both MI-LASSO and the MI-stepwise methods on data with various missing data mechanisms, signal-to-noise ratios, missingness patterns, and sample sizes. Generally, two MI-variable selection methods are suggested when encountering data with significant missing values. They could provide comparable selection results as when corresponding methods are applied on the original full dataset, assuming the imputations are well conducted.

Furthermore, in real practice, it is difficult to distinguish whether the missingness mechanism is truly MAR or MNAR based only on observed values. If imputation is performed under an MAR assumption when data are actually MNAR, this could lead to biased inference. However, our simulation results showed the performances of two MI-variable selection methods under the simulated MAR and the MNAR mechanism are similar when the missingness percentage is not too large. Also, according to Chen and Wang [4], the MI-LASSO and MI-stepwise methods do not require an ignorable missingness mechanism to be useful for variable selection, as long as the multiple imputation was well conducted.

The MI-LASSO method is easy to implement and computationally efficient since it gives results of parameter estimation using matrix calculation. This method can be modified for generalized linear models. The performance of MI-LASSO is relatively less sensitive than MI-stepwise to missingness percentages, missingness patterns and different signal-to-noise ratios. According to our simulation, it can be superior to the MI-stepwise method especially for the high-dimensional dataset with a large number of variables and small sample size. Chen and Wang also demonstrated similar results in their study [4]. However, the MI-LASSO method may suffer from the common problem of LASSO – over-selecting. The possible reasons for over-selection is that LASSO selects variables based on the ‘hidden correlation’ [16], the sum of the correlation between covariates $x_{j}$ and the ordinary least square (OLS) residual in the absence of $x_{j}$ (direct correlation), and the correlation between $x_{j}$ and the LASSO residual in the absence of $x_{j}$ (indirect correlation). LASSO would continue incorporating insignificant variables until all the significant variables are selected in some later steps. To make this method select fewer covariates, a reasonable raise in the tuning parameter is acceptable for data with large sample sizes, according to our simulation results.

The MI-stepwise method is easy to implement and can be easily extended to generalized linear models. It can achieve variable selection with both high sensitivity and specificity for data with high signal-to-noise ratio and for data with a larger ratio of sample size to the number of covariates. Stepwise regression is free from the indirect correlation [16], and thus less likely to over-select variables. After refitting the model with all selected variables, the result showed that most of the selected covariates were significantly associated with the response at the significant level $α_{1}$ . Thus MI-stepwise is good methodological choice when a conservative selection is expected; however, its performance is significantly deteriorated as the proportion of missingness in candidate covariates increases. Further, it is negatively impacted by the large number of candidate covariates, which makes it a time-consuming method. In general, it is not capable of selecting variables for data with a large number of variables and small sample sizes.

In the current study, the MI-LASSO method only works on ordinary linear regression. In subsequent work, we will consider extending it to generalized linear models, especially the logistic model, and apply it to binary responses using the ACL Fall Prevention Data. Additionally, although LASSO and stepwise are widely used tools for variable selection, using what criterion for model selection is still under debate. The current study only uses BIC for the MI-LASSO and the significance level of coefficient estimate for the MI-stepwise. In future studies, the performance of MI-LASSO using different criteria such as cross-validation and AIC will also be evaluated.

7. Conclusion

Our research conducted several simulation studies to evaluate and compare two variable selection methods, MI-stepwise and MI-LASSO, for multiply-imputed data and applied them to the ACL Fall Prevention Data. According to our results, their performances are better than directly applying the corresponding methods to complete cases. MI-LASSO and MI-stepwise each have their own traits and could outperform the other under different scenarios. Overall, MI-LASSO and MI-stepwise are two desirable methods for variable selection on data with a significant number of missing values.

Acknowledgments

The Administration for Community Living/Administration on Aging (ACL/AoA) is the primary funding source for this national study. The findings, conclusions and opinions expressed do not necessarily represent official Administration for Community Living/Administration on Aging policy. The National Council on Aging (NCOA) served as the Technical Assistance Resource Center for this initiative and collected data on program participation from grantees.

Funding Statement

This work was supported by National Natural Science Foundation of China[71771211]NCOA, Lewin Subcontract Agreement[TLG14007-5176.20]ACL, Prevention and Public Health Fund-NCOA, National Falls Prevention Resource Center Cooperative Agreement Grant[90FP0023], National Bureau of Statistics of China Research Fund (2019LD07). Dr. Y Li is supported by Platform of Public Health & Disease Control and Prevention, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

1.Bühlmann P. and Van De Geer S., Statistics for high-dimensional data: Methods, theory and applications, Springer Science & Business Media, 2011.
2.Bergen G., Falls and fall injuries among adults aged ≥65 years-united states, 2014, MMWR. Morbidity and mortality weekly report 65, 2016. [DOI] [PubMed]
3.Centers for Disease Control and Prevention. Take a Stand on Falls: What Can Older Adults Do to Prevent Falls? (2017). https://www.cdc.gov/features/older-adult-falls/index.html.
4.Chen Q. and Wang S., Variable selection for multiply-imputed data with application to dioxin exposure study, Stat. Med. 32 (2013), pp. 3646–3659. [DOI] [PubMed] [Google Scholar]
5.Chetverikov D., Liao Z., and Chernozhukov V., On cross-validated lasso in high dimensions, Annal. Stat. 49 (2021), pp. 1300–1317. [Google Scholar]
6.Cho J., Smith M.L., Ahn S., Kim K., Appiah B., and Ory M.G., Effects of an evidence-based falls risk-reduction program on physical activity and falls efficacy among oldest-old adults, Front. Public. Health. 2 (2015), pp. 182. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Clemson L., Cumming R.G., Kendig H., Swann M., Heard R., and Taylor K., The effectiveness of a community-based program for reducing the incidence of falls in the elderly: a randomized trial, J. Am. Geriatr. Soc. 52 (2004), pp. 1487–1494. [DOI] [PubMed] [Google Scholar]
8.Collins L.M., Schafer J.L., and Kam C.M., A comparison of inclusive and restrictive strategies in modern missing data procedures., Psychol. Methods 6 (2001), pp. 330. [PubMed] [Google Scholar]
9.Crookston N.L. and Finley A.O., Yaimpute: an r package for knn imputation, J. Stat. Softw. 23 (2008), pp. 1–16. [Google Scholar]
10.Efroymson M.A., Multiple regression analysis. in Mathematical Methods for Digital Computers, A. Ralston and H. S. Wilf, eds., John Wiley, New York, 1960. [Google Scholar]
11.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]
12.for Community Living A., Evidence-based falls prevention programs financed solely by 2017 prevention and health funds (hhs-2017-acl-aoa-fpsg-0206), Tech. Rep., Administration for Community Living/Administration on Aging: Washington, DC, USA, 2017.
13.Friedman S.M., Munoz B., West S.K., Rubin G.S., and Fried L.P., Falls and fear of falling: which comes first? a longitudinal prediction model suggests strategies for primary and secondary prevention, J. Am. Geriatr. Soc. 50 (2002), pp. 1329–1335. [DOI] [PubMed] [Google Scholar]
14.George E.I. and McCulloch R.E., Approaches for bayesian variable selection, Stat. Sin. 7 (1997), pp. 339–373. [Google Scholar]
15.Heymans M.W., van Buuren S., Knol D.L., van Mechelen W., and de Vet H.C., Variable selection under multiple imputation using the bootstrap in a prognostic study, BMC. Med. Res. Methodol. 7 (2007), pp. 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Ho H.Y., The lasso and its model selection criteria, Ph.D. diss., Hong Kong University of Science and Technology, 2014.
17.Kadane J.B. and Lazar N.A., Methods and criteria for model selection, J. Am. Stat. Assoc. 99 (2004), pp. 279–290. [Google Scholar]
18.Kubat M. and Matwin S., Addressing the curse of imbalanced training sets: one-sided selection, in ICML, Vol. 97. Nashville, USA, 1997, pp. 179–186.
19.Kulinski K.P., Boutaugh M., Smith M.L., Ory M.G., and Lorig K., Setting the stage: measure selection, coordination, and data collection for a national self-management initiative, Front. Public Health. 2 (2015), pp. 206. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Li F., Transforming traditional tai ji quan techniques into integrative movement therapy-tai ji quan: Moving for better balance, J. Sport. Health. Sci. 3 (2014), pp. 9–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.National Council on Aging: Highest Tier Evidence-Based Health Promotion/Disease Prevention Programs. (2018). https://www.ncoa.org/wp-content/uploads/Title-IIID-Highest-Tier-EBPs-June28. https://www.ncoa.org/wp-content/uploads/Title-IIID-Highest-Tier-EBPs-June28.2018.
22.Ory M.G., Smith M.L., Jiang L., Lee R., Chen S., Wilson A.D., Stevens J.A., and Parker E.M., Fall prevention in community settings: results from implementing stepping on in three states, Front. Public Health 2 (2015), pp. 232. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Ory M.G., Smith M.L., Parker E.M., Jiang L., Chen S., Wilson A.D., Stevens J.A., Ehrenreich H., and Lee R., Fall prevention in community settings: results from implementing tai chi: Moving for better balance in three states, Front. Public Health 2 (2015), pp. 258. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Raghunathan T.E., Lepkowski J.M., Van Hoewyk J., and Solenberger P., A multivariate technique for multiply imputing missing values using a sequence of regression models, Surv. Methodol. 27 (2001), pp. 85–96. [Google Scholar]
25.Rubin D.B., Multiple Imputation for Nonresponse in Surveys, Vol. 81, John Wiley & Sons, Hoboken, New Jersey, 2004. [Google Scholar]
26.Ruuskanen J. and Ruoppila I., Physical activity and psychological well-being among people aged 65 to 84 years, Age. Ageing 24 (1995), pp. 292–296. [DOI] [PubMed] [Google Scholar]
27.Sattin R.W., Lambert Huber D.A., Devito C.A., Rodriguez J.G., Ros A., Bacchelli S., Stevens J.A., and Waxweiler R.J., The incidence of fall injury events among the elderly in a defined population, Am. J. Epidemiol. 131 (1990), pp. 1028–1037. [DOI] [PubMed] [Google Scholar]
28.Sinharay S., Stern H.S., and Russell D., The use of multiple imputation for the analysis of missing data, Psychol. Methods 6 (2001), pp. 317. [PubMed] [Google Scholar]
29.Smith M.L., Jiang L., and Ory M.G., Falls efficacy among older adults enrolled in an evidence-based program to reduce fall-related risk: sustainability of individual benefits over time, Family Commun. Health 35 (2012), pp. 256–263. [DOI] [PubMed] [Google Scholar]
30.Smith M., Towne S., Herrera-Venson A., Cameron K., Horel S., Ory M., Gilchrist C., Schneider E., DiCocco C., and Skowronski S., Delivery of fall prevention interventions for at-risk older adults in rural areas: findings from a national dissemination, Int. J. Environ. Res. Public. Health 15 (2018), pp. 2798. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Stevens J.A., A cdc compendium of effective fall interventions: What works for community-dwelling older adults (2010).
32.Tibshirani R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B 58 (1996), pp. 267–288. [Google Scholar]
33.Tinetti M.E. and Williams C.S., The effect of falls and fall injuries on functioning in community-dwelling older persons, J. Gerontol. Ser. A: Biol. Sci. Med. Sci. 53 (1998), pp. M112–M119. [DOI] [PubMed] [Google Scholar]
34.Van Buuren S., Boshuizen H.C., and Knook D.L., Multiple imputation of missing blood pressure covariates in survival analysis, Stat. Med. 18 (1999), pp. 681–694. [DOI] [PubMed] [Google Scholar]
35.Van Buuren S., Brand J.P., Groothuis-Oudshoorn C.G., and Rubin D.B., Fully conditional specification in multivariate imputation, J. Stat. Comput. Simul. 76 (2006), pp. 1049–1064. [Google Scholar]
36.White I.R. and Carlin J.B., Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values, Stat. Med. 29 (2010), pp. 2920–2931. [DOI] [PubMed] [Google Scholar]
37.Wood A.M., White I.R., and Royston P., How should variable selection be performed with multiply imputed data?, Stat. Med. 27 (2008), pp. 3227–3246. [DOI] [PubMed] [Google Scholar]
38.Yuan M. and Lin Y., Model selection and estimation in regression with grouped variables, J. R. Stat. Soc.: Ser. B 68 (2006), pp. 49–67. [Google Scholar]
39.Yuan Y.C., Multiple imputation for missing data: Concepts and new development (version 9.0), Vol. 49, Rockville, MD: SAS Institute Inc; 2010, pp. 1–11.
40.Zou H., The adaptive lasso and its oracle properties, J. Am. Stat. Assoc. 101 (2006), pp. 1418–1429. [Google Scholar]

[CIT0001] 1.Bühlmann P. and Van De Geer S., Statistics for high-dimensional data: Methods, theory and applications, Springer Science & Business Media, 2011.

[CIT0002] 2.Bergen G., Falls and fall injuries among adults aged ≥65 years-united states, 2014, MMWR. Morbidity and mortality weekly report 65, 2016. [DOI] [PubMed]

[CIT0003] 3.Centers for Disease Control and Prevention. Take a Stand on Falls: What Can Older Adults Do to Prevent Falls? (2017). https://www.cdc.gov/features/older-adult-falls/index.html.

[CIT0004] 4.Chen Q. and Wang S., Variable selection for multiply-imputed data with application to dioxin exposure study, Stat. Med. 32 (2013), pp. 3646–3659. [DOI] [PubMed] [Google Scholar]

[CIT0005] 5.Chetverikov D., Liao Z., and Chernozhukov V., On cross-validated lasso in high dimensions, Annal. Stat. 49 (2021), pp. 1300–1317. [Google Scholar]

[CIT0006] 6.Cho J., Smith M.L., Ahn S., Kim K., Appiah B., and Ory M.G., Effects of an evidence-based falls risk-reduction program on physical activity and falls efficacy among oldest-old adults, Front. Public. Health. 2 (2015), pp. 182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0007] 7.Clemson L., Cumming R.G., Kendig H., Swann M., Heard R., and Taylor K., The effectiveness of a community-based program for reducing the incidence of falls in the elderly: a randomized trial, J. Am. Geriatr. Soc. 52 (2004), pp. 1487–1494. [DOI] [PubMed] [Google Scholar]

[CIT0008] 8.Collins L.M., Schafer J.L., and Kam C.M., A comparison of inclusive and restrictive strategies in modern missing data procedures., Psychol. Methods 6 (2001), pp. 330. [PubMed] [Google Scholar]

[CIT0009] 9.Crookston N.L. and Finley A.O., Yaimpute: an r package for knn imputation, J. Stat. Softw. 23 (2008), pp. 1–16. [Google Scholar]

[CIT0010] 10.Efroymson M.A., Multiple regression analysis. in Mathematical Methods for Digital Computers, A. Ralston and H. S. Wilf, eds., John Wiley, New York, 1960. [Google Scholar]

[CIT0011] 11.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]

[CIT0012] 12.for Community Living A., Evidence-based falls prevention programs financed solely by 2017 prevention and health funds (hhs-2017-acl-aoa-fpsg-0206), Tech. Rep., Administration for Community Living/Administration on Aging: Washington, DC, USA, 2017.

[CIT0013] 13.Friedman S.M., Munoz B., West S.K., Rubin G.S., and Fried L.P., Falls and fear of falling: which comes first? a longitudinal prediction model suggests strategies for primary and secondary prevention, J. Am. Geriatr. Soc. 50 (2002), pp. 1329–1335. [DOI] [PubMed] [Google Scholar]

[CIT0014] 14.George E.I. and McCulloch R.E., Approaches for bayesian variable selection, Stat. Sin. 7 (1997), pp. 339–373. [Google Scholar]

[CIT0015] 15.Heymans M.W., van Buuren S., Knol D.L., van Mechelen W., and de Vet H.C., Variable selection under multiple imputation using the bootstrap in a prognostic study, BMC. Med. Res. Methodol. 7 (2007), pp. 33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0016] 16.Ho H.Y., The lasso and its model selection criteria, Ph.D. diss., Hong Kong University of Science and Technology, 2014.

[CIT0017] 17.Kadane J.B. and Lazar N.A., Methods and criteria for model selection, J. Am. Stat. Assoc. 99 (2004), pp. 279–290. [Google Scholar]

[CIT0018] 18.Kubat M. and Matwin S., Addressing the curse of imbalanced training sets: one-sided selection, in ICML, Vol. 97. Nashville, USA, 1997, pp. 179–186.

[CIT0019] 19.Kulinski K.P., Boutaugh M., Smith M.L., Ory M.G., and Lorig K., Setting the stage: measure selection, coordination, and data collection for a national self-management initiative, Front. Public Health. 2 (2015), pp. 206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0020] 20.Li F., Transforming traditional tai ji quan techniques into integrative movement therapy-tai ji quan: Moving for better balance, J. Sport. Health. Sci. 3 (2014), pp. 9–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0021] 21.National Council on Aging: Highest Tier Evidence-Based Health Promotion/Disease Prevention Programs. (2018). https://www.ncoa.org/wp-content/uploads/Title-IIID-Highest-Tier-EBPs-June28. https://www.ncoa.org/wp-content/uploads/Title-IIID-Highest-Tier-EBPs-June28.2018.

[CIT0022] 22.Ory M.G., Smith M.L., Jiang L., Lee R., Chen S., Wilson A.D., Stevens J.A., and Parker E.M., Fall prevention in community settings: results from implementing stepping on in three states, Front. Public Health 2 (2015), pp. 232. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0023] 23.Ory M.G., Smith M.L., Parker E.M., Jiang L., Chen S., Wilson A.D., Stevens J.A., Ehrenreich H., and Lee R., Fall prevention in community settings: results from implementing tai chi: Moving for better balance in three states, Front. Public Health 2 (2015), pp. 258. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0024] 24.Raghunathan T.E., Lepkowski J.M., Van Hoewyk J., and Solenberger P., A multivariate technique for multiply imputing missing values using a sequence of regression models, Surv. Methodol. 27 (2001), pp. 85–96. [Google Scholar]

[CIT0025] 25.Rubin D.B., Multiple Imputation for Nonresponse in Surveys, Vol. 81, John Wiley & Sons, Hoboken, New Jersey, 2004. [Google Scholar]

[CIT0026] 26.Ruuskanen J. and Ruoppila I., Physical activity and psychological well-being among people aged 65 to 84 years, Age. Ageing 24 (1995), pp. 292–296. [DOI] [PubMed] [Google Scholar]

[CIT0027] 27.Sattin R.W., Lambert Huber D.A., Devito C.A., Rodriguez J.G., Ros A., Bacchelli S., Stevens J.A., and Waxweiler R.J., The incidence of fall injury events among the elderly in a defined population, Am. J. Epidemiol. 131 (1990), pp. 1028–1037. [DOI] [PubMed] [Google Scholar]

[CIT0028] 28.Sinharay S., Stern H.S., and Russell D., The use of multiple imputation for the analysis of missing data, Psychol. Methods 6 (2001), pp. 317. [PubMed] [Google Scholar]

[CIT0029] 29.Smith M.L., Jiang L., and Ory M.G., Falls efficacy among older adults enrolled in an evidence-based program to reduce fall-related risk: sustainability of individual benefits over time, Family Commun. Health 35 (2012), pp. 256–263. [DOI] [PubMed] [Google Scholar]

[CIT0030] 30.Smith M., Towne S., Herrera-Venson A., Cameron K., Horel S., Ory M., Gilchrist C., Schneider E., DiCocco C., and Skowronski S., Delivery of fall prevention interventions for at-risk older adults in rural areas: findings from a national dissemination, Int. J. Environ. Res. Public. Health 15 (2018), pp. 2798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT0031] 31.Stevens J.A., A cdc compendium of effective fall interventions: What works for community-dwelling older adults (2010).

[CIT0032] 32.Tibshirani R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B 58 (1996), pp. 267–288. [Google Scholar]

[CIT0033] 33.Tinetti M.E. and Williams C.S., The effect of falls and fall injuries on functioning in community-dwelling older persons, J. Gerontol. Ser. A: Biol. Sci. Med. Sci. 53 (1998), pp. M112–M119. [DOI] [PubMed] [Google Scholar]

[CIT0034] 34.Van Buuren S., Boshuizen H.C., and Knook D.L., Multiple imputation of missing blood pressure covariates in survival analysis, Stat. Med. 18 (1999), pp. 681–694. [DOI] [PubMed] [Google Scholar]

[CIT0035] 35.Van Buuren S., Brand J.P., Groothuis-Oudshoorn C.G., and Rubin D.B., Fully conditional specification in multivariate imputation, J. Stat. Comput. Simul. 76 (2006), pp. 1049–1064. [Google Scholar]

[CIT0036] 36.White I.R. and Carlin J.B., Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values, Stat. Med. 29 (2010), pp. 2920–2931. [DOI] [PubMed] [Google Scholar]

[CIT0037] 37.Wood A.M., White I.R., and Royston P., How should variable selection be performed with multiply imputed data?, Stat. Med. 27 (2008), pp. 3227–3246. [DOI] [PubMed] [Google Scholar]

[CIT0038] 38.Yuan M. and Lin Y., Model selection and estimation in regression with grouped variables, J. R. Stat. Soc.: Ser. B 68 (2006), pp. 49–67. [Google Scholar]

[CIT0039] 39.Yuan Y.C., Multiple imputation for missing data: Concepts and new development (version 9.0), Vol. 49, Rockville, MD: SAS Institute Inc; 2010, pp. 1–11.

[CIT0040] 40.Zou H., The adaptive lasso and its oracle properties, J. Am. Stat. Assoc. 101 (2006), pp. 1418–1429. [Google Scholar]

PERMALINK

Analyzing evidence-based falls prevention data with significant missing information using variable selection after multiple imputation

Yujia Cheng

Yang Li

Matthew Lee Smith

Changwei Li

Ye Shen

Abstract

1. Background

1.1. Purpose of study

1.2. Techniques for missing data

1.3. Variable selection

2. ACL-falls data overview

Table 1.

3. Methodology

3.1. Multiple imputation

3.2. Variable selection after MI: MI-stepwise

3.3. Variable selection after MI: MI-LASSO

4. Application to the ACL fall prevention data

Table 2.

5. Simulation study

5.1. Design

5.2. Measurements of performance

5.3. Simulation results

5.3.1. Simulation 1: Comparison of ignorable and non-ignorable missingness mechanisms with varying missingness percentages and SNRs

Figure 1.

Table 3.

5.3.2. Simulation 2: comparisons across varying sample sizes and SNRs

Table 4.

5.3.3. Simulation 3: Comparisons across different missing data patterns

Table 5.

5.3.4. Simulation 4: MI-LASSO performances with varying tuning parameters

Table 6.

6. Discussion

7. Conclusion

Acknowledgments

Funding Statement

Disclosure statement

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases