Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2021 Jul 28;16(7):e0254720. doi: 10.1371/journal.pone.0254720

Evaluating the impact of multivariate imputation by MICE in feature selection

Maritza Mera-Gaona 1,*, Ursula Neumann 2, Rubiel Vargas-Canas 1, Diego M López 1
Editor: Zaher Mundher Yaseen3
PMCID: PMC8318311  PMID: 34320016

Abstract

Handling missing values is a crucial step in preprocessing data in Machine Learning. Most available algorithms for analyzing datasets in the feature selection process and classification or estimation process analyze complete datasets. Consequently, in many cases, the strategy for dealing with missing values is to use only instances with full data or to replace missing values with a mean, mode, median, or a constant value. Usually, discarding missing samples or replacing missing values by means of fundamental techniques causes bias in subsequent analyzes on datasets. Aim: Demonstrate the positive impact of multivariate imputation in the feature selection process on datasets with missing values. Results: We compared the effects of the feature selection process using complete datasets, incomplete datasets with missingness rates between 5 and 50%, and imputed datasets by basic techniques and multivariate imputation. The feature selection algorithms used are well-known methods. The results showed that the datasets imputed by multivariate imputation obtained the best results in feature selection compared to datasets imputed by basic techniques or non-imputed incomplete datasets. Conclusions: Considering the results obtained in the evaluation, applying multivariate imputation by MICE reduces bias in the feature selection process.

Introduction

Missing data is a common problem in real-world datasets. Even if the researchers work hard to avoid them, missing values frequently occur for different reasons. Consequently, missingness can lead to issues in analyzing the data because most statistical methods and packages exclude subjects with any missing value. The result is that analyzes are made only with complete cases, affecting precision and leading to biased results. Although removing incomplete data is a fast and straightforward technique, it is also a risky solution since in applying it we must assume that discarded data does not influence the dataset. As a result of discarding cases with missing values, datasets could lose many instances of interest [1].

Considering the above, before deciding how to handle missing values in a dataset, the researchers must determine what the missing values depend on. The choice of a correct strategy will ensure an appropriate dataset to support subsequent analyzes such as Feature Selection and Classification.

According to Rubin [2,3] there are three types of mechanisms of missing values: (i) Missing Completely At Random (MCAR), (ii) Missing At Random, and (iii) Missing Not At Random (MNAR). Missingness is MCAR if the probability of having missing data does not depend on the observed data or missing variables. For example, when a sensor’s battery runs out, the sensor stops sending data to servers. Missing data is called MAR when the missing values (values can be missing or not) are related to other available information but not on unobserved data, which means that some variables depend on others. An example is that women usually avoid revealing their age in surveys (gender is related to missingness in the age variable). MNAR occurs if the probability of missingness depends on the values of unobserved variables. For example, people with high salaries avoid revealing their incomes in surveys. For some researchers, the mechanisms of MAR and MNAR are similar and indistinguishable [4].

Many studies have been carried out in order to explore mechanisms for handling missing values in different fields [513]. Although choosing the method may be difficult, most studies conclude that imputation is better than removing data due to the fact that deleting data could bias datasets as well as subsequent analyzes on these [14]. Consequently, data imputation is an important preprocessing task in Machine Learning.

An additional problem in the last few years is the proliferation of datasets with hundreds or even tens of thousands of variables. Thus, feature selection (FS) has become an option for reducing high dimensionality, redundant features, or noise from datasets [15]. Nevertheless, in real scenarios it is necessary to deal with missing values in the datasets and the most common FS techniques consider only datasets with complete data in the independent variables.

According to [16], missing values could be present in the target variable in the classification context. For example, when a classification or estimation model is evaluated, missing values are imputed in the test data’s target variable and the model predicts values for the target variable. However, when a dataset has missing values in the features, we must find a way to handle the missing values and perform preprocessing tasks to get a dataset with complete data. Commonly, the missing data problem is solved by removing the instances or features with missing values or replacing the missing values using basic mechanisms such as mean, mode, etc. Although these strategies are easy to implement, they change the distribution of the datasets and may bias subsequent Machine Learning analyzes, for instance the feature selection or classification processes. On one hand, the methods to handle missing values could eliminate from the dataset: (i) relevant features or (ii) instances that reveal the importance of the relevant features. On the other hand, the machine learning models could be trained using only a part of the original datapoints.

Some studies have proposed new techniques to carry out FS on datasets with missing values [1719]. Although these studies showed promising results, the authors’ experiments did not evaluate the effect of data imputation on datasets to analyze whether or not the imputation methods bias the FS process. Moreover, the experiments in [17] and [19] were carried out using only rates of missing values less than or equal to 10%.

In previous studies, we evaluated how feature selection improved the performance of the classification of epileptic events and normal brain activity in Electroencephalograms [20,21]. The experiments were carried out using datasets with high dimensionality in a scenario with the need of reducing the computational complexity. The results indicated that the best subset of relevant features was selected by an approach based on Ensemble Feature Selection (EFS).

We thus proposed a Framework of Ensemble Feature Selection to improve the selection of relevant features in datasets with high dimensionality [22]. Nonetheless, one of the weakness of the original proposal was the handling of datasets with missing values. In the real world, datasets have a high probability of having incomplete data, which means that handling missing values is necessary before selecting relevant features. This renders the results of FS uncertain when the dataset has incomplete data.

This research aims to describe how data imputation can improve feature selection on datasets with missing data and avoid biasing the dataset. For this, we showed the impact of missing values in the FS process by implementing a data imputation algorithm and evaluating it with different datasets to compare the FS process using datasets without handling missing values versus imputed datasets. In light of this, this paper is organized as follows: Section 2 presents the datasets used to evaluate our proposal and theoretical descriptions about basic mechanisms for handling Missing Values, Multivariate Imputation, Multiple Imputation, and Feature Selection. In Section 3, the evaluation and results are presented. Section 4 describes the discussion of results. Finally, the main conclusions are laid out in Section 5.

Materials and methods

Systematic mapping studies in software engineering

To review works related to FS and data imputation, we carried out two systematic mappings focused on identifying studies related to imputation and the assembly of feature selection algorithms following the guidelines described by Petersen [23]. We used two search strings, one for each topic:

  • Imputation data: (imputation data) and (missing values or missingness rates or incomplete data or incomplete dataset)

  • Feature selection: ("framework" and "ensemble") and ("dimensionality reduction" or "feature selection") and ("EEG" and "automatic") and ("detector" or "reading" or "recognition" or “analysis”).

The searches guided by the previous keywords, were used to find relevant papers from IEEE, PubMed, and Science Direct databases. The analysis of the papers was led following review criteria based on the quality of their contributions, particularly the proposal of imputation and assembly of feature selection algorithms.

Datasets

This research uses 4 datasets [2427], Breast-cancer, letter-recognition, Statlog—Heart and Spambase, from UCI Machine Learning Repository [28] to evaluate our proposal. These collections include categorical and numerical features and contain data from different fields. Besides, the datasets are available for everyone and do not require privileges to be accessed.

The Breast-Cancer dataset contains data provided by the Oncology Institute [24]. Each instance is described by 9 attributes and represents information from a patient.

Letter-recognition is a dataset that represents 26 capital letters in the English alphabet [25]. The dataset was built considering the black-and-white pixel representation on 20 different fonts. Each representation was randomly distorted to get 20.000 instances. Each instance was converted into 16 numerical features.

The Statlog–(Heart) dataset contains information about heart diseases. This dataset is a modified version of the Heart Disease dataset [26].

The Spambase dataset is a collection of spam and non-spam emails [27]. It is described by 57 attributes representing emails from emails classified as spam, work or personal emails.

Table 1 describes the number of categorical and numerical features and the number of instances in each dataset.

Table 1. Datasets.

Dataset Categorical Numerical Instances
Breast-Cancer 9 0 286
Letter-recognition 0 16 20000
Statlog—(Heart) 7 6 269
Spambase 0 57 4601

Removing data

The most basic method for handling missing values in datasets is removing data. However, this option could delete all class instances, remove relevant variables, unbalance the dataset, and generate biases in classification or prediction.

Listwise

Listwise deletion removes all data for a case with at least one missing value. If the dataset contains a small number of instances, this strategy can remove all samples from one or more classes. Besides, when we remove the dataset cases, the result unbalances the dataset in most cases.

Dropping variables

Dropping variables is a good option when the variables with missing values are insignificant. Nonetheless, it is difficult to know the relevant features without making a feature selection analysis. Considering the above, imputation is usually better than dropping variables.

Imputation

Imputation allows replacing missing values with substitute or replacement values. There is a wide variety of imputation methods, and their main differences are associated with the process used to calculate the new values. It is relevant to mention that imputation does not necessarily give better results because a suitable imputation method cannot always be found.

Mean, median and mode replacement

A primary imputation method is to replace missing values with the overall mean, median, or mode. Although it is a fast strategy, this method presents clear disadvantages such as the mean, median, or mode imputation to reduce variance in the dataset.

Multivariate imputation by chained equations

Multivariate imputation by chained equations (MICE) is an imputation method based on Fully Conditional Specification, where different models impute incomplete attributes. Hence, MICE can impute missing values in datasets with continuous, binary, and categorical attributes by using a different model for each attribute. Thus, each attribute is modeled according to its distribution; for example, binary or categorical variables are modeled using logistic regression and continuous variables using linear regression. In the regression models, the modeled attribute represents the dependent variable, and the remaining attributes represent the independent variables. MICE algorithm considers the assumption that missing values are MAR, which means that its use in a dataset where the missing values are not MAR could generate biased imputations.

The MICE algorithm is described below.

  1. Build a basic imputation for every missing value in the dataset.

  2. Set back missing values for one feature (Fx).

  3. The observed values of Fx are used to train a prediction model in which Fx is a dependent variable, and the other features are independent.

  4. The missing values for Fx are replaced with the predictions calculated by the model built in step 3.

  5. For each feature with missing values, steps 2–4 are repeated. When a prediction model has imputed all features with missing values, one cycle or iteration is finished.

  6. Steps 2–5 are repeated for n iterations, and the imputations are updated at each cycle. The objective is to use the number of iterations to achieve a stable imputation. The imputed dataset is obtained in the last iteration.

The researcher determines the number of iterations n. Many iterations can improve imputation or promote overfitting. The stable number of iterations must be found by testing different values and depends on the data and missing values.

According to the MICE algorithm, we obtain one imputed dataset when the algorithm performs n iterations. Additionally, if the previous process is repeated m times, we get multiple imputed datasets.

Multiple imputation

Multiple imputation is a mechanism for creating multiple complete datasets in which for each missing value we calculate m predictions [29]. The goal of multiple imputation is predicting or estimating the missing values and considering the uncertainty about missing values and the imputation model. This approach is not meant for generating new values only because a single unique value could be calculated using more straightforward means [30].

Fig 1 shows the main steps of Multiple Imputation.

Fig 1. Main steps used in multiple imputation [31].

Fig 1

MICE is a technique used to produce multiple imputations and pool them into one imputed dataset [32]. The standard strategy in Multiple Imputation is building a large joint model to predict all attributes with missing values. However, this approach is challenging to implement when there are hundreds of variables of different types. In these cases, MICE is an excellent option for handling the types [33], since the algorithm establishes a series of regression models according to the distribution and type of each attribute.

The setting of multiple imputation by MICE

  • Number of Imputations

A critical task in Multiple Imputation is defining the number of datasets that we must impute. All imputed datasets contain the same data according to the original observed data; the differences appear initially with only the missing values. The literature recommends the number of imputed datasets ought to be between 5 and 10 [29].

  • Data to train the prediction models.

A relevant aspect to consider in setting up MICE is selecting the variables or attributes included in the imputation process. Usually, we use all available variables, especially those used in subsequent analyses such as feature selection and classification/estimation. In [29], the authors consider three important points in selecting variables and their values: (i) the imputation model must be more general than the analysis model; then, if it is possible, including “auxiliary” variables (in the imputation regression model of a variable) that will not be used in the analysis process but offer information to improve the imputations; (ii) Defining whether the imputations are calculated at the item level or the summary level; for example, when there are variables constructed from other variables, it is necessary to decide if it is better to impute the original variables or the resulting variables; and, (iii) determining if the imputations will be calculated to reflect raw scores or standardized scores.

In some cases, researchers have proposed using outcome-dependent variables in the imputation model to include all possible relationships in the imputation regression model [34]. This assumption is based on the fact that the outcome depends on variables to impute. If outcomes are excluded from the imputation process, imputations will be calculated assuming that these are independent of the outcome.

  • Pooling

The m imputed datasets generated by multiple imputation are pooled considering the types of attributes with missing values in the dataset. For instance, binary or categorical attributes are usually pooled, finding the mode of predictions and numerical attributes, calculating the mean of predictions [31].

Feature selection

Select K Best

Select K Best (SKB) is an FS algorithm for selecting a set of features according to the k highest scores. Scores are calculated using a test between each feature and the target. Some of the most widely used tests are described below.

Chi-squared

Chi-squared is a statistical test to evaluate features and determine whether these are dependent or independent of the target. If a feature is independent, it is considered irrelevant to the classification. Eq 1 describes the Chi-squared test.

X2=(Observedfrequency-Expectedfrequency)2Expectedfrequency (1)

Where observed frequency is the number of class observations and expected frequency the number of expected class observations if there was no relationship between feature and target.

F-test and ANOVA F-test

These are statistical tests to evaluate features and obtain the significance of each feature to improve a classification or regression model. The result of these measures is a subset of features with the k most informative features.

Recursive feature elimination

The RFE algorithm uses an external estimator to evaluate the importance of the features. Recursively, it removes features and evaluates the remaining subset by building a model with the current subset of features. The accuracy of the model is used to identify which features contribute to improving the prediction. The algorithm thus eliminates the worst-performing features on a model until the best subset is found.

Feature importance measures for tree models

The importance of a feature is calculated using Decision Trees, or the ensemble methods built upon them. One of the most common measures is Gini importance [35], based on the impurity reduction of splits. This counts when a feature is used to split a node, weighted by the number of samples it divides. When a tree model is trained using scikit-learn [36], a vector with the importance of each feature is calculated. The sum of the vector values is 1. Vector values can be used as scores to select the k most essential features, where the feature with the highest score is the most important.

Metrics to evaluate imputation methods

We calculated the mean absolute error (MAE) and root mean square error (RMSE) between imputed values and original values in numerical variables and accuracy in categorical variables to evaluate the imputation quality.

  • MAE and RMSE

The mean absolute error and the root mean square error are the standard statistical metrics used to evaluate models [37].

MAE and RMSE are described by Eqs 2 and 3,

MAE=1ni=1n|ei| (2)
RMSE=1ni=1nei2 (3)

where ei represents n samples of model errors (ei, i = 1, 2, …, n). To evaluate the quality of imputations, we considered Eqs 4 and 5. Where Y^i represents the values predicted by imputation and Yi real values.

MAE=1ni=1n|Y^i-Yi| (4)
RMSE=1ni=1n(Y^i-Yi)2 (5)
  • Accuracy

Accuracy is an error-rate used to evaluate the performance of classification models. It estimates the overall probability of correct classification of a test sample [38]. Accuracy is described by Eq 6,

error=FN+FPN (6)

where N is the total of instances, FN the number of false negatives, and FP the number of false negatives.

Results

In this section, we present the evaluation results for analyzing the quality of imputation and the behavior of the process of feature selection on datasets imputed by MICE and mean/mode replacement.

Evaluating the quality of imputation

The described datasets were used to create simulated realistic datasets with missing values. Each original dataset was transformed considering 10 levels of missing data (% missingness = 5, 10, 15…,45, and 50), and for each level, the transformation was repeated 100 times. Hereafter we refer to datasets with randomly removed missing values as simulated datasets. Besides, each simulated dataset was imputed using MICE and mean/mode replacement.

Once the imputed datasets were generated and processed, we compared them with the original datasets to evaluate the quality of the imputations.

Outcomes

The MICE algorithm was evaluated comparing the imputed values with real values in the original dataset. We further compared the imputation calculated by MICE with the imputation calculated by mean/mode replacement. The latter is the most common and basic solution implemented to impute missing values. For this, the simulated datasets were imputed 100 times with the two methods mentioned for each missingness rate. To evaluate if the imputed values were correct, we calculated MAE and RMSE for imputations in numerical variables and accuracy for categorical variables.

  • Evaluation: Breast-cancer

Fig 2 describes the overall accuracy of imputations calculated by the MICE algorithm and mode imputation.

Fig 2. Accuracy of imputations by MICE and mode.

Fig 2

Fig 3 describes the accuracy by the feature of imputations calculated using the MICE algorithm.

Fig 3. Accuracy of imputations of MICE by feature.

Fig 3

Fig 4 describes the accuracy by the feature of imputations calculated using mode replacement.

Fig 4. Accuracy of mode imputations by feature.

Fig 4

Table 2 describes the overall accuracy of imputations calculated using MICE and mode replacement. According to the results, the overall accuracy achieved by MICE was better than the overall accuracy achieved by mode replacement in 100% of the missingness rates.

Table 2. The overall accuracy of MICE and mode.
RATE MICE MODE
0.05 0.979 0.970
0.1 0.957 0.950
0.15 0.936 0.921
0.2 0.912 0.900
0.25 0.889 0.869
0.3 0.865 0.846
0.35 0.843 0.812
0.4 0.819 0.801
0.45 0.793 0.781
0.5 0.768 0.750

According to the results given in Tables 15 and 16 in Appendix A, the accuracy of the MICE imputation outperformed the accuracy of mode replacement in 97.59% of missingness rates by feature. Mode replacement obtained the best performance only for missingness rates of 35% and 40% in feature F3.

  • Evaluation: Letter-recognition

Table 3 describes the overall MAE and RMSE of imputations calculated using MICE and mean replacement.

Table 3. The overall MAE and RMSE.
RATE MICE MEAN
MAE RMSE MAE RMSE
0.05 0.0773 0.1063 0.1171 0.1542
0.1 0.0798 0.1094 0.1171 0.1543
0.15 0.081 0.1109 0.1173 0.1544
0.2 0.0834 0.1138 0.1171 0.1542
0.25 0.0872 0.1187 0.1172 0.1544
0.3 0.0924 0.1256 0.1171 0.1541
0.35 0.0929 0.1263 0.1172 0.1542
0.4 0.0938 0.1271 0.1172 0.1542
0.45 0.0948 0.1279 0.1176 0.1544
0.5 0.0952 0.1283 0.1176 0.1544

According to the results, the overall MAE and RMSE achieved by MICE were better than the overall MAE and RMSE achieved by mean replacement in 100% of missingness rates.

The MAE and RMSE of imputation calculated using MICE outperformed the MAE and RMSE of imputation calculated by mean replacement in 99.62% and 96.87% of the missingness rates by feature. Considering the number of variables of the letter-recognition dataset, we calculated these percentages but did not show each feature’s results and each missingness rate.

  • Evaluation: Statlog (heart)

Considering that the statlog dataset has categorical and numerical variables, we showed MAE and RMSE for numerical variables and accuracy for categorical variables.

Table 4 describes the overall accuracy of imputations calculated using MICE and mode replacement. According to the results, the overall accuracy achieved by MICE was better than the overall accuracy achieved by mode replacement in 100% of missingness rates.

Table 4. The overall accuracy of MICE and mode.
RATE MICE MODE
0.05 0.984 0.982
0.1 0.966 0.962
0.15 0.949 0.943
0.2 0.931 0.923
0.25 0.914 0.904
0.3 0.895 0.885
0.35 0.877 0.866
0.4 0.858 0.848
0.45 0.838 0.83
0.5 0.819 0.812

According to the results given in Tables 17 and 18 in Appendix B, the accuracy of MICE’s imputation outperformed the accuracy of mode replacement in 75% of the missingness rates by feature.

Table 5 describes the overall MAE and RMSE of imputations calculated using MICE and mean replacement.

Table 5. The overall MAE and RMSE.
RATE MICE MEAN
MAE RMSE MAE RMSE
0.05 0.141 0.189 0.173 0.217
0.1 0.142 0.191 0.175 0.22
0.15 0.145 0.195 0.174 0.22
0.2 0.146 0.198 0.174 0.22
0.25 0.152 0.205 0.174 0.22
0.3 0.156 0.212 0.174 0.22
0.35 0.162 0.22 0.174 0.22
0.4 0.167 0.226 0.175 0.221
0.45 0.168 0.226 0.174 0.221
0.5 0.168 0.225 0.174 0.221

According to the results, the overall MAE and RMSE achieved by MICE were better than MAE and RMSE achieved by mean replacement in 100% of the missingness rates.

In accordance with the results given in Tables 19 and 20 in Appendix B, the MAE of the imputation of MICE outperformed the MAE of mean replacement in 81.42% of the missingness rates by feature. Also, Table 21 in Appendix E and Table 22 in Appendix F show that the RMSE of the imputation of MICE outperformed the RMSE of the mode replacement in a 68.85% of the missingness rates by feature.

  • Evaluation: Spambase

Table 6 describes the overall MAE and RMSE of imputations calculated using MICE and mean replacement.

Table 6. The overall MAE and RMSE.
RATE MICE MEAN
MAE RMSE MAE RMSE
0.05 0.0185 0.0508 0.0229 0.0568
0.1 0.0187 0.0509 0.023 0.0569
0.15 0.0189 0.0511 0.0231 0.0565
0.2 0.0195 0.0522 0.0234 0.057
0.25 0.02 0.0531 0.0234 0.0566
0.3 0.0215 0.0553 0.0237 0.0565
0.35 0.0234 0.0579 0.0241 0.0568
0.4 0.0233 0.0572 0.0241 0.0565
0.45 0.0239 0.0579 0.0247 0.0571
0.5 0.0241 0.0575 0.0249 0.0569

According to the results, the overall MAE and RMSE achieved by MICE outperformed the overall MAE and RMSE achieved by mean replacement in 100% and 60% of missingness rates, respectively.

The MAE and RMSE of imputation calculated using MICE outperformed the MAE and RMSE of imputation calculated by mode replacement in 77.36% and 70% of the missingness rates by feature, respectively. Considering the number of variables of the spambase dataset, we calculated these percentages but did not show each feature’s results and each missingness rate.

Densities

Figs 57 describe each variable’s probability density function of the complete breast-cancer dataset and datasets imputed using MICE and mode replacement. According to the figures, the imputation calculated using MICE has densities similar to the complete dataset ones. However, most densities of datasets imputed using mode replacement did not only change in their shape but also increased the probabilities for some values compared to the complete dataset.

Fig 5. Distribution of complete breast-cancer dataset.

Fig 5

Fig 7. Distribution of breast-cancer dataset imputed by mode.

Fig 7

Fig 6. Distribution of breast-cancer dataset imputed by MICE.

Fig 6

Considering the number of variables of the startlog (heart), spambase, and letter-recognition datasets, the densities of their variables are not shown in this paper. However, they were plotted and analyzed. As a result of this analysis, the imputations calculated by MICE maintain their densities close to the densities of the complete dataset, while densities of the imputed dataset using mode/mean replacement changed their shapes and probabilities.

Evaluating feature selection

To evaluate the impact of missing values on feature selection, we simulated realistic datasets using the datasets described in Table 1. For each dataset, we generated three datasets with three different missingness rates: 25%, 30%, and 35%. Considering the simulated datasets, five FS algorithms were used to select relevant features on the complete dataset, on the dataset imputed using MICE, the dataset imputed using basic methods (Mean/Mode replacement), the dataset without missing values in instances (listwise elimination), and the dataset without missing values on variables (dropping variables).

Letter-recognition

Table 7 describes the letter-recognition dataset’s relevant features that were selected using five algorithms of feature selection.

Table 7. Results of feature selection of the letter-recognition dataset.
dataset Algorithm
Full Select K Best (Chi-squared) F11, F13, F15
Select K Best (F-value) F7, F11, F14
Select K Best (ANOVA F-value) F7, F11, F13
Feature Recursive Elimination F12, F13, F14
Feature Importance F9, F13, F15

The results of applying five feature selection algorithms on datasets generated from simulations of missing values in the letter-recognition dataset are described in Table 19 in Appendix C. Each simulated dataset handled missing values with imputation by MICE and mean/mode replacement, listwise deletion, and dropping variables.

Fig 8 describes the intersection between the set of relevant features of the letter-recognition dataset and each simulated dataset’s relevant features.

Fig 8. Intersections of sets of relevant features of the letter-recognition dataset and its simulated datasets.

Fig 8

According to the results in Table 7 and Fig 8, the datasets imputed using MICE obtained the same set of relevant features as the complete dataset. The results also showed that datasets that were imputed using basic methods or removing instances of variables with missing values were influenced by dataset changes and produced different sets of relevant features.

Statlog (heart)

Table 8 describes the relevant features of the statlog dataset that were selected using five algorithms of feature selection.

Table 8. Results of feature selection of the statlog dataset.
dataset Algorithm
Full Select K Best (Chi-squared) F3, F9, F12, F13
Select K Best (F-value) F3, F9, F12, F13
Select K Best (ANOVA F-value) F3, F9, F12, F13
Feature Recursive Elimination F8, F10, F12
Feature Importance F3, F9, F12, F13

The results of applying five feature selection algorithms on datasets generated from simulations of missing values in the statlog dataset are presented in Table 20 in Appendix D. Each simulated dataset handled missing values with imputation by MICE and mean/mode replacement, listwise deletion, and dropping variables.

Fig 9 describes the intersection between the set of relevant features of the statlog dataset and each simulated dataset’s relevant features.

Fig 9. Intersections of sets of relevant features of the statlog dataset and its simulated datasets.

Fig 9

Spambase

Table 9 describes the relevant features of the spambase dataset that were selected using five algorithms of feature selection.

Table 9. Results of feature selection of the spambase dataset.
dataset Algorithm
Full Select K Best (Chi-squared) F25, F27, F55, F56, F57
Select K Best (F-value) F7, F19, F21, F23, F53
Select K Best (ANOVA F-value) F7, F19, F21, F23, F53
Feature Recursive Elimination F7, F27, F53
Feature Importance F7, F16, F21, F52, F53

The application of five feature selection algorithms on datasets generated from simulations of missing values in the spambase dataset is shown in Table 21 in Appendix E. Each simulated dataset handled missing values with imputation by MICE and mean/mode replacement, listwise deletion, and dropping variables.

Fig 10 describes the intersection between the set of relevant features of the spambase dataset and the relevant features of each simulated dataset.

Fig 10. Intersections of sets of relevant features of the spambase dataset and its simulated datasets.

Fig 10

Breast-cancer

Table 10 describes the relevant features of the breast-cancer dataset selected using five feature selection algorithms.

Table 10. Results of feature selection of the breast-cancer dataset.
dataset Algorithm
Full Select K Best (Chi-squared) F3, F4, F5, F6
Select K Best (F-value) F4, F5, F6, F9
Select K Best (ANOVA F-value) F4, F5, F6, F9
Feature Recursive Elimination F5, F6, F7
Feature Importance F1, F3, F6, F8

The results of applying five feature selection algorithms on datasets generated from simulations of missing values in the breast-cancer dataset are given in Table 22 in Appendix F. Each simulated dataset handled missing values with imputation by MICE and mean/mode replacement, listwise deletion, and dropping variables.

Fig 11 describes the intersection between the set of relevant features of the breast-cancer dataset and the set of relevant features of each simulated dataset.

Fig 11. Intersections of sets of relevant features of the breast-cancer dataset and its simulated datasets.

Fig 11

Discussion

In this work, we built an implementation of the MICE algorithm to evaluate the impact of multivariate and multiple imputation in datasets with categorical, numerical, and mixed categorical and numerical variables. The algorithm was assessed using datasets with different rates of missing values, which were generated randomly. The results were compared with the results of simple methods to handle missing values. The evaluation measured the quality of imputation, the distribution of imputed variables, and the impact in feature selection on imputed datasets.

To set up our MICE algorithm for each dataset, we took into account some aspects discussed in previous studies. For instance, Graham [39] suggests increasing the number of imputations to as many as 40 to improve imputation power when datasets have a high percentage of missing values. In practice, Graham also describes that many imputations could be inappropriate due to the dataset size, the models used to impute it, the amount of missingness in the data, and the available computer resources. In this sense, the imputation of a single dataset can take minutes, hours, or days. Thus, for datasets with hundreds or thousands of attributes and instances and a high rate of missingness, it would be impractical to calculate 40 imputed datasets as this could take hours or days. Consequently, we used many imputations for datasets with small sizes and smaller imputations for datasets of larger dimensions.

In accordance with the evaluation, the RMSE described in the previous section showed a good performance of all imputations calculated using MICE for all missingness rates. According to [40], a good result must be low (<0.3), and all results of RMSE of the MICE algorithm are less than 0.3 in overall results and results by feature.

The evaluation conducted in this paper was divided into two stages: reviewing of quality of imputation and analyzing results of FS on imputed datasets. For the breast-cancer dataset, the overall accuracy achieved by MICE was better than the overall accuracy of mode replacement in 100% of missingness rates, Fig 2 and Table 2. The accuracy calculated by feature showed that some features obtained better accuracy than others, Figs 3 and 4. For feature F3, the accuracy achieved using mode replacement was better than the MICE imputations. When FS was carried out, feature F3 was not considered relevant, which meant that this feature could represent noise. Besides, analyzing the accuracies calculated for missingness rates by feature, the imputation of MICE outperformed mode replacement in 97.54% of cases.

For the letter-recognition dataset, the overall RMSE and MAE achieved by MICE were better than the overall RMSE and MAE of mean replacement in 100% of the overall errors, Table 3. In the analysis by the feature of missingness rates, the MAE and RMSE achieved by MICE were better than the MAE and RMSE of mean replacement in 99.62% and 96.87% of cases, respectively.

The statlog (heart) contained mixed numerical and categorical variables. For this dataset, the overall accuracy achieved by MICE was better than the overall accuracy of mode replacement in 100% of missingness rates analyzed, Table 4. The accuracies calculated of missingness rates by feature showed that MICE was better than the accuracy of mode replacement in 75% of cases. The overall RMSE and MAE achieved by MICE were better than the overall RMSE and MAE of mean replacement in 100% of missingness rates, Table 5. Moreover, the RMSE and MAE calculated for missingness rates by the MICE feature outperformed the RMSE and MAE of mean replacement in 68.85% and 81.42% of cases, respectively. Features for which imputation calculated by mode/mean replacement was better than the corresponding MICE imputation were F2, F4, F5, and F6. These were not selected as relevant features in the FS process carried out on a complete letter-recognition dataset.

In the spambase dataset, Table 6, the overall RMSE and MAE achieved by MICE were better than the overall RMSE and MAE of mean replacement in 60% and 100% of missingness rates, respectively. However, in the analysis by feature, the RMSE and MAE of MICE were better than RMSE and MAE of mean replacement in 70% and 77.36%, respectively. In the results by feature, the percentages of RMSE and MAE decreased because the spambase dataset has a high number of features, and several of them are irrelevant and considered as noise.

In addition, Figs 57 show how the distribution of the breast-cancer dataset changed when the method of imputing data was mode replacement while the imputation performed by MICE algorithm achieved a similar distribution to the original dataset. Likewise, the startlog (heart), spambase, and letter-recognition datasets had changes in their distributions when the mode replacement method was employed.

For evaluating the impact of missing values in the FS process, three simulated datasets were built for each complete dataset (breast-cancer, letter-recognition, statlog, and spambase) using different missingness rate percentages (25%, 30%, and 35%) and four techniques to handle missing values were applied on each simulated dataset. The results showed the differences among the sets of relevant features of the datasets processed with techniques to handle missing values. For letter-recognition, the datasets imputed by MICE and complete dataset obtained the same set of relevant features, see Table 7 and Fig 8. However, the datasets imputed by basic replacement and dropped datasets changed their sets of relevant features regarding the complete dataset set of relevant features. In the statlog dataset, the set of relevant features of datasets imputed by MICE had two or three elements different to those of the complete dataset, see Table 8 and Fig 9. The other sets of relevant features changed in 6, 15, and 17 elements regarding the complete dataset set of relevant features. The results of FS on the spambase dataset showed that the most similar set to the set of relevant features of the complete dataset was the set of relevant features of the dataset imputed by MICE, Table 9 and Fig 10. For breast-cancer, the set of relevant features of the complete dataset and the datasets imputed by MICE differ in very few elements. The sets of relevant features of datasets imputed by basic replacement changed slightly, Table 10 and Fig 11. The sets of relevant features of datasets imputed by listwise and dropping variables have many different elements.

In general, FS results showed that the datasets imputed by using MICE obtained sets of relevant features similar to the sets of relevant features calculated using the complete datasets. Likewise, the biggest differences were found between the sets of relevant features of the complete datasets and the datasets imputed by listwise and dropping variables.

Researchers have compared methods to impute data in previous work to determine how to improve the quality of imputation or to establish which method is better for a specific mechanism of missing values, type of variables, or dataset. Nonetheless, most studies did not evaluate the impact of imputation or removing data in the feature selection process. For instance, a comparison of imputation methods was carried out in [41]. The study used a complete dataset about smoking habits to simulate datasets with missingness rates of 5% and 15%. Although the authors showed imputation results for different missingness simulations, they only considered two missingness rates, and the dataset contained only categorical variables. Another work compared basic imputation and deletion methods. The results showed that pairwise deletion was the best technique for the dataset used in the evaluation [42]. The study evaluated missingness rates of 5%, 10%, 15%, 20% and 30%. However, the study considered neither imputation in numerical variables nor analysis of feature selection. The comparison of six methods for missing data was carried out in [43]. For the evaluation, simulated datasets were built using different missingness rate percentages (from 5% to 45%). Although the evaluation showed a detailed and reliable process to evaluate the quality of imputations calculated by the most popular methods, this did not show the impact of imputation in the feature selection process. The comparison of imputation methods in [44] also evaluated some of the most common techniques to impute data. However, the results only showed the limitations of the algorithms to impute data in any dataset. In general, most studies showed the evaluation of imputation quality but did not present the impact of missing values in subsequent analyzes. Some researchers have studied the influence of missing values in classification. However, they did not review the effect caused for missing values or imputed values in the FS process [4548].

This study has several limitations, and the results of the quality of imputation for each method are limited to the datasets used. Hence, researchers should study their datasets to decide which method applies. In this sense, the main contribution of our research is not providing a universal solution to handle missing values or to select relevant features. Rather it involves presenting evidence about the need to consider the impact of missing values in the feature selection process.

As future work, we are considering improving the implementation of the MICE algorithm to use regression models and other methods to predict or estimate missing values. Another enhancement to ponder is evaluating whether or not the imputations improve when the target variable is included as an independent variable in predicting missing values. Besides, it is important to mention that although we designed an experiment to evaluate the impact of missing values in the feature selection process, we did not experiment simulating the three different mechanisms of missing values. For future work, we consider that the evaluation and results should be analyzed treating the mechanism of missing values separately.

Conclusions

In this paper, the implementation and evaluation of the MICE algorithm are described. MICE was developed to handle missing data, a commonly occurring problem in real datasets. Our implementation was evaluated by calculating imputed datasets from simulated datasets with different missingness rates. The evaluation compared the imputation quality of the MICE algorithm and basic methods, and the results of feature selection on complete datasets and imputed datasets (by MICE and basic methods).

According to the overall results of accuracy, MAE and RMSE shown in the evaluation, the MICE algorithm was better than the basic methods in all missingness rates used to simulate missing values in the breast-cancer, letter-recognition, and statlog (heart) datasets. For the spambase dataset, although the MICE algorithm achieved an overall MAE in all missingness rates better than the overall MAE of the basic imputations, the RMSE of the MICE algorithm only outperformed the RMSE of the basic method in 60% of all missingness rates.

The analysis of accuracy, MAE, and RMSE by feature showed that the basic method of imputation outperformed the imputation of the MICE algorithm for some features. According to the feature selection process applied to the complete datasets, these features were not relevant.

The evaluation results showed that for missingness rates greater than 5% and less than 50%, the complete datasets and imputed datasets calculated using MICE obtained similar distributions of their variables and similar results in the analyzes of feature selection.

Moreover, the datasets imputed using basic methods showed better results in the feature selection process than the simulated datasets handled by dropping missing variables or missing cases. However, the distribution of the variables of imputed datasets changed, meaning that the basic methods bias the datasets and accordingly that learning models could be biased.

Furthermore, selecting an appropriate method to handle missing values depends on the dataset, the mechanism of missing values, and the missingness rate. This paper showed evidence about the impact of missing values in common subsequent analyzes, such as the feature selection process.

Finally, as with any study, this work has limitations, and we cannot conclude that the MICE algorithm is the best method to handle missing values in all situations. However, the evidence presented in this paper shows that imputation could potentially be better for the avoidance of bias in subsequent analyzes than simply removing data in datasets with missing values.

Appendixes

Appendix A: Results of breast-cancer

Table 11. Accuracy of MICE by feature.

RATE F1 F2 F3 F4 F5 F6 F7 F8 F9
0.05 0.973 0.988 0.962 0.988 0.991 0.975 0.977 0.974 0.988
0.1 0.943 0.972 0.922 0.975 0.985 0.948 0.953 0.942 0.978
0.15 0.915 0.958 0.881 0.962 0.976 0.924 0.934 0.914 0.966
0.2 0.883 0.937 0.838 0.951 0.967 0.896 0.908 0.883 0.955
0.25 0.853 0.927 0.793 0.936 0.958 0.87 0.885 0.848 0.944
0.3 0.816 0.902 0.752 0.926 0.951 0.843 0.862 0.818 0.934
0.35 0.795 0.884 0.712 0.91 0.938 0.816 0.836 0.786 0.921
0.4 0.758 0.861 0.668 0.9 0.93 0.794 0.809 0.759 0.911
0.45 0.722 0.834 0.63 0.887 0.917 0.763 0.791 0.725 0.899

Table 12. Accuracy of Mode by feature.

RATE F1 F2 F3 F4 F5 F6 F7 F8 F9
0.05 0.962 0.978 0.96 0.988 0.99 0.973 0.977 0.969 0.988
0.1 0.929 0.954 0.919 0.973 0.981 0.945 0.953 0.936 0.978
0.15 0.891 0.93 0.875 0.961 0.971 0.92 0.929 0.904 0.966
0.2 0.859 0.906 0.835 0.949 0.96 0.89 0.903 0.874 0.953
0.25 0.821 0.882 0.791 0.933 0.949 0.86 0.875 0.839 0.942
0.3 0.791 0.859 0.752 0.924 0.941 0.831 0.85 0.808 0.931
0.35 0.758 0.837 0.714 0.909 0.932 0.803 0.828 0.779 0.919
0.4 0.722 0.815 0.669 0.899 0.922 0.776 0.8 0.748 0.906
0.45 0.692 0.794 0.626 0.888 0.911 0.744 0.775 0.711 0.893

Appendix B: Results of statlog (heart)

Table 13. Accuracy of MICE by feature.

RATE F2 F3 F6 F7 F9 F13
0.05 0.984 0.978 0.991 0.977 0.988 0.986
0.1 0.967 0.952 0.982 0.954 0.973 0.967
0.15 0.951 0.928 0.976 0.931 0.961 0.949
0.2 0.932 0.903 0.968 0.907 0.943 0.931
0.25 0.916 0.878 0.961 0.884 0.93 0.914
0.3 0.901 0.854 0.954 0.855 0.91 0.897
0.35 0.879 0.828 0.947 0.833 0.895 0.881
0.4 0.867 0.803 0.939 0.801 0.878 0.858
0.45 0.849 0.772 0.93 0.78 0.863 0.837
0.5 0.833 0.75 0.922 0.756 0.843 0.813

Table 14. Accuracy of mode replacement by feature.

RATE F2 F3 F6 F7 F9 F13
0.05 0.984 0.976 0.991 0.974 0.985 0.98
0.1 0.968 0.95 0.982 0.946 0.967 0.957
0.15 0.952 0.924 0.976 0.922 0.951 0.935
0.2 0.932 0.897 0.969 0.895 0.934 0.912
0.25 0.918 0.872 0.962 0.865 0.921 0.888
0.3 0.902 0.848 0.956 0.838 0.902 0.863
0.35 0.885 0.82 0.95 0.81 0.886 0.848
0.4 0.871 0.797 0.942 0.788 0.867 0.821
0.45 0.856 0.771 0.934 0.764 0.851 0.802
0.5 0.841 0.748 0.926 0.741 0.837 0.78

Table 15. MAE of MICE by feature.

RATE F1 F4 F5 F8 F10 F11 F12
0.05 0.133 0.123 0.089 0.112 0.12 0.199 0.212
0.1 0.134 0.125 0.09 0.119 0.112 0.195 0.22
0.15 0.133 0.128 0.088 0.12 0.119 0.203 0.223
0.2 0.136 0.129 0.09 0.12 0.118 0.202 0.225
0.25 0.144 0.132 0.093 0.125 0.119 0.214 0.236
0.3 0.145 0.14 0.094 0.13 0.125 0.215 0.242
0.35 0.154 0.147 0.097 0.14 0.129 0.225 0.242
0.4 0.157 0.151 0.102 0.143 0.132 0.233 0.251
0.45 0.157 0.15 0.105 0.141 0.133 0.236 0.253
0.5 0.156 0.148 0.1 0.141 0.134 0.241 0.253

Table 16. MAE of mode replacement by feature.

RATE F1 F4 F5 F8 F10 F11 F12
0.05 0.165 0.126 0.091 0.134 0.158 0.286 0.264
0.1 0.159 0.129 0.094 0.143 0.147 0.28 0.26
0.15 0.156 0.13 0.09 0.143 0.149 0.282 0.267
0.2 0.156 0.131 0.089 0.142 0.147 0.282 0.268
0.25 0.159 0.13 0.089 0.143 0.146 0.284 0.271
0.3 0.156 0.13 0.087 0.144 0.149 0.281 0.27
0.35 0.159 0.131 0.089 0.144 0.147 0.281 0.267
0.4 0.157 0.13 0.088 0.146 0.147 0.283 0.267
0.45 0.157 0.13 0.088 0.145 0.148 0.283 0.266
0.5 0.157 0.131 0.087 0.144 0.147 0.283 0.268

Table 17. RMSE of MICE by feature.

RATE F1 F4 F5 F8 F10 F11 F12
0.05 0.163 0.154 0.115 0.138 0.155 0.25 0.272
0.1 0.166 0.16 0.123 0.149 0.147 0.245 0.283
0.15 0.165 0.163 0.12 0.149 0.157 0.256 0.289
0.2 0.166 0.165 0.121 0.152 0.156 0.256 0.298
0.25 0.178 0.169 0.121 0.157 0.159 0.27 0.31
0.3 0.179 0.179 0.124 0.164 0.167 0.274 0.321
0.35 0.19 0.187 0.128 0.178 0.173 0.286 0.323
0.4 0.195 0.193 0.135 0.184 0.175 0.294 0.334
0.45 0.194 0.191 0.139 0.18 0.177 0.297 0.334
0.5 0.195 0.189 0.132 0.179 0.176 0.299 0.33

Table 18. RMSE of mode replacement by feature.

RATE F1 F4 F5 F8 F10 F11 F12
0.05 0.197 0.157 0.116 0.163 0.191 0.31 0.307
0.1 0.191 0.163 0.125 0.177 0.179 0.301 0.305
0.15 0.189 0.166 0.121 0.174 0.186 0.307 0.314
0.2 0.189 0.166 0.12 0.174 0.183 0.306 0.317
0.25 0.191 0.164 0.117 0.176 0.182 0.309 0.322
0.3 0.188 0.167 0.116 0.176 0.187 0.304 0.321
0.35 0.191 0.168 0.117 0.178 0.183 0.305 0.316
0.4 0.189 0.166 0.116 0.179 0.184 0.309 0.317
0.45 0.189 0.168 0.116 0.18 0.187 0.309 0.316
0.5 0.189 0.168 0.114 0.178 0.184 0.309 0.32

Appendix C: Results of feature selection on letter-recognition

Table 19. Results of feature selection of simulated datasets.

dataset Algorithm 25% 30% 35%
Imputed Select K Best (Chi-squared) F11, F13, F15 F11, F13, F15 F11, F13, F15
Select K Best (F-value) F7, F11, F14 F7, F11, F14 F7, F11, F14
Select K Best (ANOVA F-value) F7, F11, F13 F7, F11, F13 F7, F11, F13
Feature Recursive Elimination F12, F13, F14 F12, F13, F14 F12, F13, F14
Feature Importance F9, F13, F15 F9, F13, F15 F9, F13, F15
Basic Select K Best (Chi-squared) F8, F9, F13 F8, F9, F13 F9, F13, F15
Select K Best (F-value) F7, F11, F14 F7, F11, F14 F7, F11, F14
Select K Best (ANOVA F-value) F2, F13, F14 F2, F13, F14 F2, F13, F14
Feature Recursive Elimination F12, F13, F14 F12, F13, F14 F12, F13, F14
Feature Importance F8, F9, F13 F8, F9, F13 F9, F12, F14
Listwise Select K Best (Chi-squared) F8, F13, F15 F2, F13, F15 F5, F13, F15
Select K Best (F-value) F7, F11, F14 F9, F11, F14 F7, F11, F14
Select K Best (ANOVA F-value) F7, F11, F12 F7, F11, F13 F11, F12, F14
Feature Recursive Elimination F3, F5, F13 F12, F14, F15 F12, F13, F14
Feature Importance F9, F12, F13 F9, F13, F15 F12, F13, F16
Dropping variables Select K Best (Chi-squared) F6, F7, F10 F7, F8, F11 F6, F7, F9
Select K Best (F-value) F9, F11, F12 F7, F10, F12 F8, F10, F11
Select K Best (ANOVA F-value) F9, F10, F11 F10, F11, F12 F8, F9, F10
Feature Recursive Elimination F9, F10, F11 F10, F11, F12 F8, F9, F10
Feature Importance F6, F7, F10 F7, F8, F11 F6, F7, F9

Appendix D: Result of feature selection on statlog(heart)

Table 20. Results of feature selection of simulated datasets.

dataset Algorithm 25% 30% 35%
Imputed Select K Best (Chi-squared) F3, F9, F12, F13 F3, F9, F12, F13 F3, F9, F12, F13
Select K Best (F-value) F3, F10, F12, F13 F9, F10,F12, F13 F3, F9, F10, F13
Select K Best (ANOVA F-value) F3, F10, F12, F13 F9, F10,F12, F13 F3, F9, F10, F13
Feature Recursive Elimination F8, F10, F12 F8, F10, F12 F8, F10, F12
Feature Importance F3, F9, F12, F13 F3, F10, F12, F13 F3, F9, F12, F13
Basic Select K Best (Chi-squared) F3,F10,F12, F13 F3,F10,F12, F13 F3,F10,F12, F13
Select K Best (F-value) F8,F10,F12, F13 F3, F8,F10,F12 F8,F10,F12, F13
Select K Best (ANOVA F-value) F8,F10,F12, F13 F3, F8,F10,F12 F8,F10,F12, F13
Feature Recursive Elimination F8, F10, F12 F8, F10, F12 F8, F10, F12
Feature Importance F3, F10,F12, F13 F3, F8,F10,F12 F8,F10,F12, F13
listwise Select K Best (Chi-squared) F2, F3, F11, F12 F2, F3, F9, F13 F2, F7, F8, F9
Select K Best (F-value) F2, F5, F7, F12 F4, F5, F8, F12 F2, F6, F7, F9
Select K Best (ANOVA F-value) F2, F5, F7, F12 F3, F8, F9, F13 F2, F6, F7, F9
Feature Recursive Elimination F2, F3, F11 F3, F7, F13 F6, F7, F9
Feature Importance F2, F3, F5, F7 F3, F8, F9, F13 F2, F6, F7, F9
Dropping variables Select K Best (Chi-squared) F2, F8, F9, F10 F2, F8, F9, F10 F3, F7, F8, F9
Select K Best (F-value) F7, F8, F9, F10 F3, F7, F8, F10 F3, F6, F7, F9
Select K Best (ANOVA F-value) F7, F8, F9, F10 F3, F7, F8, F10 F3, F6, F7, F9
Feature Recursive Elimination F4, F7, F8 F7, F8, F10 F6, F7, F9
Feature Importance F3, F7, F8, F10 F3, F7, F8, F10 F3, F6, F7, F9

Appendix E: Result of feature selection on spambase

Table 21. Results of feature selection of simulated datasets.

dataset Algorithm 25% 30% 35%
Imputed SKB (Chi-squared) F25, F27, F55, F56, F57 F25, F27, F55, F56, F57 F25, F27, F55, F56, F57
SKB (F-value) F7, F19, F21, F23, F53 F7, F19, F21, F23, F53 F7, F17, F21, F23, F56
SKB (ANOVA F-value) F7, F19, F21, F23, F53 F7, F19, F21, F23, F53 F7, F17, F21, F23, F56
FRE F7, F23, F53 F7, F23, F53 F7, F41, F53
Feature Importance F7, F16, F21, F52, F53 F7, F16, F21, F52, F53 F7, F21, F52, F53, F56
Basic SKB (Chi-squared) F25, F27, F55, F56, F57 F16, F27, F55, F56, F57 F16, F27, F55, F56, F57
SKB (F-value) F7, F16, F21, F23, F53 F7, F16, F21, F23, F53 F7, F16, F17, F21, F53
SKB (ANOVA F-value) F7, F16, F21, F23, F53 F7, F16, F21, F23, F53 F7, F16, F17, F21, F53
FRE F7, F23, F53 F7, F23, F53 F7, F23, F53
Feature Importance F7, F16, F21, F52, F53 F16, F21, F23, F52, F53 F16, F21, F52, F56, F55
listwise SKB (Chi-squared) F16, F27, F55, F56, F57 F25, F27, F55, F56, F57 F22, F27, F55, F56, F57
SKB (F-value) F7, F16, F21, F23, F57 F16, F21, F23, F53, F56 F8, F17, F21, F52, F53
SKB (ANOVA F-value) F7, F16, F21, F23, F57 F16, F21, F23, F53, F56 F8, F17, F21, F52, F53
FRE F7, F24, F53 F16, F23, F53 F16, F21, F27
Feature Importance F7, F16, F21, F52, F53 F5, F16, F21, F52, F53 F16, F17, F21, F52, F56
Dropping variables SKB (Chi-squared) F18, F20, F21, F42, F43 F12, F16, F38, F39, F40 F9, F12, F34, F37, F38
SKB (F-value) F14, F16, F18, F19, F40 F6, F12, F13, F16, F18 F9, F10, F12, F14, F35
SKB (ANOVA F-value) F14, F16, F18, F19, F40 F6, F12, F13, F16, F18 F9, F10, F12, F14, F35
FRE F19, F21, F40 F16, F18, F25 F14, F19, F35
Feature Importance F13, F18, F39, F40, F42 F6, F12, F16, F18, F37 F9, F12, F34, F35, F37

Appendix F: Result of feature selection breast-cancer

Table 22. Results of feature selection of simulated datasets.

dataset Algorithm 25% 30% 35%
Imputed Select K Best (Chi-squared) F3, F4, F5, F6 F3, F4, F5, F6 F3, F4, F5, F6
Select K Best (F-value) F4, F5, F6, F9 F4, F5, F6, F9 F4, F5, F6, F9
Select K Best (ANOVA F-value) F4, F5, F6, F9 F4, F5, F6, F9 F4, F5, F6, F9
Feature Recursive Elimination F5, F6, F7 F1, F5, F6 F1, F5, F6
Feature Importance F1, F3, F6, F8 F1, F3, F6, F8 F1, F3, F4, F8
Basic Select K Best (Chi-squared) F3, F4, F5, F6 F3, F4, F5, F6 F3, F4, F6, F9
Select K Best (F-value) F4, F5, F6, F9 F4, F5, F6, F9 F4, F5, F6, F9
Select K Best (ANOVA F-value) F4, F5, F6, F9 F4, F5, F6, F9 F4, F5, F6, F9
Feature Recursive Elimination F1, F6, F7 F1, F4, F6 F5, F7, F9
Feature Importance F1, F3, F6, F8 F1, F3, F6, F8 F1, F3, F4, F8
listwise Select K Best (Chi-squared) F1, F2, F3, F4 F3, F4, F5, F8 F1, F2, F4, F7
Select K Best (F-value) F3, F5, F7, F9 F3, F4, F5, F6 F1, F2, F4, F7
Select K Best (ANOVA F-value) F3, F5, F7, F9 F3, F4, F5, F6 F1, F2, F4, F7
Feature Recursive Elimination F5, F7, F9 F2, F4, F7 F2, F7, F8
Feature Importance F1, F3, F4, F7 F4, F5, F6, F7 F2, F3, F7, F8
pairwise Select K Best (Chi-squared) F2, F3, F4, F7 F2, F3, F4, F7 F1, F3, F4, F6
Select K Best (F-value) F2, F3, F4, F7 F2, F3, F4, F7 F1, F3, F4, F6
Select K Best (ANOVA F-value) F2, F3, F4, F7 F2, F3, F4, F7 F1, F3, F4, F6
Feature Recursive Elimination F4, F5, F7 F3, F4, F5 F1, F3, F4
Feature Importance F2, F3, F4, F6 F1, F2, F4, F6 F1, F3, F4, F5

Acknowledgments

The authors are grateful too to Mr Colin McLachlan for suggestions related to the English text.

Data Availability

The breast-Cancer dataset is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The Letter-recognition dataset is available at https://archive.ics.uci.edu/ml/datasets/Letter+Recognition. The Statlog – (Heart) dataset is available at https://archive.ics.uci.edu/ml/datasets/statlog+(heart). The Spambase dataset is available at https://archive.ics.uci.edu/ml/datasets/spambase. These are all third party data. The third-party data are available for everyone and does not require privileges to be accessed.

Funding Statement

The work was funded by a grant from Colciencias, Colombian Agency of Science, Technology, and Innovation, under Funding call 647- 2015, project: “Mechanism of selection of relevant features for the automatic detection of epileptic seizures”; the funder provided support in the form of a scholarship for MMG but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. Additionally, University of Cauca and Fraunhofer Center for Applied Research on Supply Chain Services SCS provided support in the form of salaries for DML and RVC, and UN, respectively. However, the employers did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.

References

  • 1.Schafer J. L. and Graham J. W., “Missing data: Our view of the state of the art,” Psychol. Methods, vol. 7, no. 2, pp. 147–177, 2002. [PubMed] [Google Scholar]
  • 2.Rubin D. B., “Biometrika Trust Inference and Missing Data Author (s): Donald B. Rubin Published by: Oxford University Press on behalf of Biometrika Trust Stable URL: https://www.jstor.org/stable/2335739,” BiometrikaTrust, vol. 63, no. 3, pp. 581–592, 1976. [Google Scholar]
  • 3.Rubin D., Multiple imputation for nonresponse in surveys, vol. 31, no. 1. New York: Wiley, 1990. [Google Scholar]
  • 4.Perkins N. J. et al., “Principled Approaches to Missing Data in Epidemiologic Studies,” Am. J. Epidemiol., vol. 187, no. 3, pp. 568–575, 2018. doi: 10.1093/aje/kwx348 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Quinteros M. E. et al., Use of data imputation tools to reconstruct incomplete air quality datasets: A case-study in Temuco, Chile. Elsevier Ltd, 2019. [Google Scholar]
  • 6.Cheliotis M., Gkerekos C., Lazakis I., and Theotokatos G., “A novel data condition and performance hybrid imputation method for energy efficient operations of marine systems,” Ocean Eng., vol. 188, no. June, p. 106220, 2019. [Google Scholar]
  • 7.Williams D. A., Nelsen B., Berrett C., Williams G. P., and Moon T. K., “A comparison of data imputation methods using Bayesian compressive sensing and Empirical Mode Decomposition for environmental temperature data,” Environ. Model. Softw., vol. 102, pp. 172–184, 2018. [Google Scholar]
  • 8.Lan Q., Xu X., Ma H., and Li G., “Multivariable Data Imputation for the Analysis of Incomplete Credit Data,” Expert Syst. Appl., vol. 141, p. 112926, 2019. [Google Scholar]
  • 9.Young-Saver D. F., Gornbein J., Starkman S., and Saver J. L., “Handling of Missing Outcome Data in Acute Stroke Trials: Advantages of Multiple Imputation Using Baseline and Postbaseline Variables,” J. Stroke Cerebrovasc. Dis., vol. 27, no. 12, pp. 3662–3669, 2018. doi: 10.1016/j.jstrokecerebrovasdis.2018.08.040 [DOI] [PubMed] [Google Scholar]
  • 10.Delaporte G., Cladière M., and Camel V., “Missing value imputation and data cleaning in untargeted food chemical safety assessment by LC-HRMS,” Chemom. Intell. Lab. Syst., vol. 188, no. February, pp. 54–62, 2019. [Google Scholar]
  • 11.Chittora P. et al., “Prediction of Chronic Kidney Disease -A Machine Learning perspective,” IEEE Access, vol. 9, 2021. [Google Scholar]
  • 12.Ali L. et al., “An Optimized Stacked Support Vector Machines Based Expert System for the Effective Prediction of Heart Failure,” IEEE Access, vol. 7, pp. 54007–54014, 2019. [Google Scholar]
  • 13.Raihan-Al-Masud M. and Rubaiyat Hossain Mondal M., “Data-driven diagnosis of spinal abnormalities using feature selection and machine learning algorithms,” PLoS One, vol. 15, no. 2, pp. 1–21, 2020. doi: 10.1371/journal.pone.0228422 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.McKnight P., McKnight K., Sidani S., and Figueredo A., Missing data: A gentle introduction. The Guildord Press, 2007. [Google Scholar]
  • 15.Guyon I., “An Introduction to Variable and Feature Selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003. [Google Scholar]
  • 16.Seaman S., Galati J., Jackson D., and Carlin J., “What is meant by ‘missing at random’?,” Stat. Sci., vol. 28, no. 2, pp. 257–268, 2013. [Google Scholar]
  • 17.Aussem A. and Rodrigues de Morais S., “A conservative feature subset selection algorithm with missing data,” Neurocomputing, vol. 73, no. 4–6, pp. 585–590, 2010. [Google Scholar]
  • 18.Doquire G. and Verleysen M., “Feature selection with missing data using mutual information estimators,” Neurocomputing, vol. 90, pp. 3–11, 2012. [Google Scholar]
  • 19.Qian W. and Shu W., “Mutual information criterion for feature selection from incomplete data,” Neurocomputing, vol. 168, pp. 210–220, 2015. [Google Scholar]
  • 20.Mera-Gaona M., Vargas-Canas R., and Lopez D. M., “Towards a Selection Mechanism of Relevant Features for Automatic Epileptic Seizures Detection.,” Stud. Health Technol. Inform., vol. 228, no. 4, pp. 722–6, 2016. [PubMed] [Google Scholar]
  • 21.Mera M., Lopez D. M., and Vargas-Canas R., “Feature Selection in EEG Signals to Support Automatic Detection of Epileptiform Events,” University of Cauca, 2021. [Google Scholar]
  • 22.Mera-Gaona M., Lopez D. M., and Vargas-Canas R., “Framework for the Ensemble of Feature Selection Methods,” 2021. [Google Scholar]
  • 23.Petersen K., Vakkalanka S., and Kuzniarz L., “Guidelines for conducting systematic mapping studies in software engineering: An update,” in Information and Software Technology, 2015, vol. 64, pp. 1–18. [Google Scholar]
  • 24.Tan M. and Schilimmer J., “Breast cancer dataset.” Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia, 1988. [Google Scholar]
  • 25.Frey P. W. and Slate D. J., “Letter Recognition Using Holland-Style Adaptive Classifiers,” Mach. Learn., vol. 6, no. 2, pp. 161–182, 1991. [Google Scholar]
  • 26.Janosi A., Steinbrunn W., Pfisterer M., and Detrano R., “Heart Disease Databases.” 1988. [Google Scholar]
  • 27.Hopkins M., Reeber E., Forman G., and Suermondt J., “SAMP E-mail Database.” Hewlett-Packard Labs, 1999. [Google Scholar]
  • 28.Dheeru D. and Karra Taniskidou E., “UCI Machine Learning Repository,” University of California, Irvine, School of Information and Computer Sciences. School of Information and Computer Science, Irvine, CA, 2017. [Google Scholar]
  • 29.Azur M. J., Stuart E. A., Frangakis C., and Leaf P. J., “Multiple Imputation by Chained Equations What is it and how does it work?,” Int J Methods Psychiatr Res, vol. 20, no. 1, pp. 40–49, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Murray J. S., “Multiple Imputation: A Review of Practical and Theoretical Findings,” Stat. Sci., vol. 33, no. 2, pp. 142–159, 2018. [Google Scholar]
  • 31.van Buuren S. and Groothuis-oudshoorn K., “MICE: Multivariate Imputation by Chained,” JSS J. Stat. Softw., vol. 45, no. 3, 2011. [Google Scholar]
  • 32.Raghunathan T. E., Lepkowski J. M., Van Hoewyk J., Solenberger P., and van Hoewyk J., “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models Key Words: Item nonresponse; Missing at random; Multiple imputation; Nonignorable missing mechanism; Regression; Sampling properties and simulations,” 2001. [Google Scholar]
  • 33.He Y., Landrum M. B., Harrington D. P., and Catalano P., “Multiple imputation in a large-scale complex survey: a practical guide *,” Stat. Methods Med. Res., vol. 19, pp. 653–670, 2010. doi: 10.1177/0962280208101273 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Moons K. G. M., Donders R. A. R. T., Stijnen T., and Harrell F. E., “Using the outcome for imputation of missing predictor values was preferred,” J. Clin. Epidemiol., vol. 59, no. 10, pp. 1092–1101, Oct. 2006. doi: 10.1016/j.jclinepi.2006.01.009 [DOI] [PubMed] [Google Scholar]
  • 35.Menze B. H. et al., “A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data,” BMC Bioinformatics, vol. 10, no. August, 2009. doi: 10.1186/1471-2105-10-213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Pedregosa F. et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, no. 1, pp. 2825–2830, 2011. [Google Scholar]
  • 37.Chai T. and Draxler R. R., “Root mean square error (RMSE) or mean absolute error (MAE)? -Arguments against avoiding RMSE in the literature,” Geosci. Model Dev., vol. 7, no. 3, pp. 1247–1250, 2014. [Google Scholar]
  • 38.Landgrebe T. C. W., Paclik P., Duin R. P. W., and Bradley A. P., “Precision-Recall Operating Characteristic (P-ROC) curves in imprecise environments,” Proc.—Int. Conf. Pattern Recognit., vol. 4, no. July 2014, pp. 123–127, 2006. [Google Scholar]
  • 39.Graham J. W., Olchowski A. E., and Gilreath T. D., “How many imputations are really needed? Some practical clarifications of multiple imputation theory,” Prev. Sci., vol. 8, no. 3, pp. 206–213, Sep. 2007. doi: 10.1007/s11121-007-0070-9 [DOI] [PubMed] [Google Scholar]
  • 40.Veerasamy R., Rajak H., Jain A., Sivadasan S., Varghese C. P., and Agrawal R. K., “Validation of QSAR Models -Strategies and Importance,” Int. J. Drug Des. Discov., vol. 2, no. 3, pp. 511–519, 2011. [Google Scholar]
  • 41.Torres Munguía J. A., “Comparison of imputation methods for handling missing categorical data with univariate pattern,” Rev. Metod. Cuantitativos para la Econ. y la Empres., vol. 17, no. 1, pp. 101–120, 2014. [Google Scholar]
  • 42.Lotsi A., Asiedu L., and Katsekpor J., “Comparison of Imputation Methods for Missing Values in Longitudinal Data Under Missing Completely at Random (mcar) mechanism,” African J. Appl. Stat., vol. 4, no. 1, pp. 241–258, 2017. [Google Scholar]
  • 43.Schmitt P., Mandel J., and Guedj M., “A Comparison of Six Methods for Missing Data Imputation,” J. Biom. Biostat., vol. 06, no. 01, pp. 1–6, 2015. [Google Scholar]
  • 44.Musil C. M., Warner C. B., Yobas P. K., and Jones S. L., “A Comparison of Imputation Techniques for Handling Missing Data,” West. J. Nurs. Res., vol. 24, no. 7, p. 815, 2002. doi: 10.1177/019394502762477004 [DOI] [PubMed] [Google Scholar]
  • 45.Farhangfar A., Kurgan L., and Dy J., “Impact of imputation of missing values on classification error for discrete data,” Pattern Recognit., vol. 41, no. 12, pp. 3692–3705, 2008. [Google Scholar]
  • 46.Acuña E. and Rodriguez C., “The Treatment of Missing Values and its Effect on Classifier Accuracy,” Classif. Clust. Data Min. Appl., no. 1995, pp. 639–647, 2004. [Google Scholar]
  • 47.Orczyk T. and Porwik P., “Influence of missing data imputation method on the classification accuracy of the medical data,” J. Med. Informatics Technol., vol. 22, pp. 111–116, 2013. [Google Scholar]
  • 48.Mundfrom D. and Whitcomb A., “Imputing missing values: The effect on the accuracy of classification,” Mult. Linear Regres. Viewpoints, vol. 25, pp. 13–19, 1998. [Google Scholar]

Decision Letter 0

Zaher Mundher Yaseen

15 Mar 2021

PONE-D-20-39195

Evaluating the impact of Multivariate Imputation by MICE in Feature Selection

PLOS ONE

Dear Dr. Mera-Gaona,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Apr 29 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Zaher Mundher Yaseen

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Competing Interests section:

"The authors have declared that no competing interests exist."

We note that one or more of the authors are employed by a commercial company: FraunhoferCenter for Applied Research on Supply Chain Services SCS, Nuremberg, Germany

(1) Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement.

“The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement.

(2) Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.  

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and  there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper provides the experiments to demonstrate the positive impact of MICE in the feature selection methods to handle data sets with missing values. The experimental results show that feature selection with MICE can achieve better performance compared with traditional imputation methods.

Here are some concerns:

1. This paper should provide a more comprehensive survey to introduce some related work, including feature selection and imputation methods.

2. This paper only uses 4 data sets. It would be better to use more data sets. Moreover, since feature selection is often used to handle high-dimensional data sets, it would be better to use some high-dimensional data sets. Note that, the dimensions of used data are too low (i.e., they only contain several tens of features).

3. It would be better to conduct experiments with some state-of-the-art feature selection methods, including some filter, embedding and wrapper feature selection methods.

4. The paper needs a more careful proofreading.

Reviewer #2: The topic is important. The results can be useful. However, there are a number of issues that require attention. These are listed below. Note that some research papers are mentioned below which may be consulted and cited if the authors wish, Or better quality research papers may be used and cited instead):

(1)The literature review of this paper is not satisfactory. It does not clearly describe the relevant research work. A strong literature review should be done. Multivariate imputation by chained equations (MICE) is an existing method described in the literature. The references should be cited at the beginning. In the review, feature selection in relevant context also needs to be discussed. A number of papers have described how the important features are extracted from a dataset, some of the works are: (a) Data-Driven Diagnosis of Spinal Abnormalities Using Feature Selection and Machine Learning Algorithms. PLOS One, 2020. (b) Prediction of Chronic Kidney Disease - A Machine Learning Perspective," in IEEE Access, 2021. (c) An Optimized Stacked Support Vector Machines Based Expert System for the Effective Prediction of Heart Failure, IEEE Access, 2019.

(2)The motivation of this research must be clearly and elaborately described. Why this research is important? Currently brief discussion is provided which is not enough. Also Section 4 (discussion) gives some indication, but that is at the end of the paper. So, some of that discussion can be placed at the literature review section.

(3)In Section 3.1.1.3 and in a number of places “100%” is used. Such as “MICE's overall accuracy was 100% better than the overall accuracy of mode replacement”. It is not clear why this is used. Please clarify how this wording makes sense.

(4)In Section 3 – the results section, the authors present a number of graphs and results. However, there is little explanation of the results.

(5)In Section 3 – the results section, a formatting issue is present in the form of “Error! Reference source not found”. This needs to be corrected.

(6)Overall the presentation and writing of the paper need some editing. Currently the results and findings are not easy to understand. Particularly, the concept “the impact of imputation in the feature selection process” is not clearly understood in the paper. Also a clearer discussion is required for the “the use of multiple datasets of different diseases and the difference in findings based on datasets”.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jul 28;16(7):e0254720. doi: 10.1371/journal.pone.0254720.r002

Author response to Decision Letter 0


25 May 2021

Manuscript ID: PONE-D-20-39195

Title: Evaluating the impact of multivariate imputation by MICE in feature selection

Dear Reviewers,

We have uploaded a revised version of the paper. We appreciate the valuable comments, which had contributed to improving the paper’s quality.

A detailed description of changes considering all issues mentioned in your comments, and an explanation of every change made, point by point, are appended below.

Comments to the Author

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Response: We used 4 public datasets available in the UCI Machine Learning dataset repository. The description of the datasets used was included in the section “Materials and Methods”.

The 4 datasets are available in references [24], [25], [26] [27]

Reviewer #1: Yes

Reviewer #2: No

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

Response: Thank you for the comments. We have used a local proofreading service following your recommendation. The information of the reviewer was included in the Acknowledgments section.

Response to Reviewer 1 Comments

Reviewer #1: This paper provides the experiments to demonstrate the positive impact of MICE in the feature selection methods to handle data sets with missing values. The experimental results show that feature selection with MICE can achieve better performance compared with traditional imputation methods.

Here are some concerns:

1. This paper should provide a more comprehensive survey to introduce some related work, including feature selection and imputation methods.

Response: We appreciate your suggestions and recommendations.

We have extended the introduction section, to better introduce the related work. We consider that, in the previous version, we did not describe well the motivation of our research and this could cause a misunderstanding of the contributions of our work. We find out there are many imputation methods and FS algorithms in the literature, however, previous studies (Feature Selection and Imputation methods) did not show how the imputation methods could bias the datasets and the Feature Selection processes.

Besides, Most Feature Selection algorithms are designed for datasets where complete data and imputation methods are evaluated to show the quality of the imputation. However, they do not show how the imputation impacts the subsequent analyzes (feature selection, or classification).

We added a description of the protocol followed to explore the state of the art in data imputation and feature selection: the result of these reviews allows us to add references of related works in both fields. However, please, consider that the objective of our manuscript is not to deepen in the advances of Feature Selection or Data Imputation. Instead, we wanted to demonstrate in the introduction section that there are many studies in these areas, but they do not explore the effect of Imputation data in the Feature selection process. Subsequently, the comparison of our work with the related works is introduced in the Discussion section.

We added the following lines in the Introduction materials and methods Section to show how we can carried out the systematic mapping:

“To review works related to FS and data imputation, we carried out two systematic mappings focused on identifying studies related to imputation and the assembly of feature selection algorithms following the guidelines described by Petersen [5]. We used two search strings, one for each topic:

Imputation data: (imputation data) and (missing values or missingness rates or incomplete data or incomplete dataset)

Feature selection: ("framework" and "ensemble") and ("dimensionality reduction" or "feature selection") and ("EEG" and "automatic") and ("detector" or "reading" or "recognition" or “analysis”).

The searches guided by the previous keywords, were used to find relevant papers from IEEE, PubMed, and Science Direct databases. The analysis of the papers was led following review criteria based on the quality of their contributions, particularly the proposal of imputation and assembly of feature selection algorithms ”…

Also the following sentences were added to clarify the manuscripts contributions:

… “However, when a dataset has missing values in the features, we must find a way to handle the missing values and perform preprocessing tasks to get a dataset with complete data. Commonly, the missing data problem is solved by removing the instances or features with missing values or replacing the missing values using basic mechanisms such as mean, mode, etc. Although these strategies are easy to implement, they change the distribution of the datasets and may bias subsequent Machine Learning analyzes, for instance, the feature selection or classification processes. On one hand, the methods used to handle missing values could eliminate from the dataset: (i) relevant features or (ii) instances that reveal the importance of the relevant features. On the other hand, the machine learning models could be trained using only a part of the original data points.”…

…”

In previous studies, we evaluated how feature selection improved the performance of the classification of epileptic events and normal brain activity in Electroencephalograms [18][19]. The experiments were carried out using datasets with high dimensionality in a scenario with the need of reducing the computational complexity. The results indicated that the best subset of relevant features was selected by an approach based on Ensemble Feature Selection (EFS).

We thus proposed a Framework of Ensemble Feature Selection to improve the selection of relevant features in datasets with high dimensionality [20]. Nonetheless, one of the weaknesses of the original proposal was the handling of datasets with missing values. In the real world, datasets have a high probability of having incomplete data, which means that handling missing values is necessary before selecting relevant features. This renders the results of FS uncertain when the dataset has incomplete data.

”..

2. This paper only uses 4 data sets. It would be better to use more data sets. Moreover, since feature selection is often used to handle high-dimensional data sets, it would be better to use some high-dimensional data sets. Note that, the dimensions of used data are too low (i.e., they only contain several tens of features).

Response: We appreciate your suggestions and recommendations.

The reason why we selected these 4 well-known datasets was that these datasets not only have been used in previous studies of FS but also because our goal was to evidence how the handling of missing data methods change the distribution of the datasets and how this affects the FS process. In this study, handling “too low” dimensionality allowed us to show graphically the effect of the changes. Even, this task was more difficult to perform for the spambase dataset, which has 57 features.

3. It would be better to conduct experiments with some state-of-the-art feature selection methods, including some filter, embedding, and wrapper feature selection methods.

Response: thank you for your comment.

We described in the evaluation section, the results of experiments with (a) three filter algorithms based on three statistical methods (Chi-squared, F-test, & ANOVA-test), (b) a wrapper method: Feature Importance measures for tree models, and (c) an embedded method: Recursive Feature Elimination.

We kindly ask you to consider that we want to show how the imputation methods can help us to stabilize the results of the FS process in scenarios where we must deal with missing data. This is the main reason why we generated datasets with different rates of missing data, imputed data in the generated datasets, then, we applied feature selection algorithms, and finally we compared the relevant features selected in imputed datasets with the relevant features selected in the original datasets.

4. The paper needs a more careful proofreading.

Response: Thank you for the comments. We have used a local proofreading service following your recommendation.

Response to Reviewer 2 Comments

Reviewer #2: The topic is important. The results can be useful. However, there are a number of issues that require attention. These are listed below. Note that some research papers are mentioned below which may be consulted and cited if the authors wish, Or better quality research papers may be used and cited instead):

(1)The literature review of this paper is not satisfactory. It does not clearly describe the relevant research work. A strong literature review should be done. Multivariate imputation by chained equations (MICE) is an existing method described in the literature. The references should be cited at the beginning. In the review, feature selection in a relevant context also needs to be discussed. Several papers have described how the important features are extracted from a dataset, some of the works are (a) Data-Driven Diagnosis of Spinal Abnormalities Using Feature Selection and Machine Learning Algorithms. PLOS One, 2020. (b) Prediction of Chronic Kidney Disease - A Machine Learning Perspective," in IEEE Access, 2021. (c) An Optimized Stacked Support Vector Machines Based Expert System for the Effective Prediction of Heart Failure, IEEE Access, 2019.

Response: We appreciate your suggestions.

In addition to the studies found in the systematic mappings carried out, we have included the references mentioned by the reviewer.

Furthermore, we have extended the introduction section, to better introduce the related work. We consider that we did not describe well the motivation of our research and this could cause a misunderstanding of the contributions of our work. We find out there are many imputation methods and FS algorithms in the literature, however, previous studies (Feature Selection and Imputation methods) did not show how the imputation methods could bias the datasets and the Feature Selection processes.

Besides, Most Feature Selection algorithms are designed for datasets where complete data and imputation methods are evaluated to show the quality of the imputation. However, they do not show how the imputation impacts the subsequent analyzes (feature selection, or classification).

We added a description of the protocol followed to explore the state of the art in data imputation and feature selection: the result of these reviews allowed us to add references of related works in both fields. However, please, consider that the objective of our manuscript is not to deepen in the advances of Feature Selection or Data Imputation. Instead, we wanted to demonstrate in the introduction section that there are many studies in these areas, but they do not explore the effect of Imputation data in the Feature selection process. Subsequently, the comparison of our work with the related works is introduced in the Discussion section.

Besides, We added the following lines in the materials and methods Section to show how we can carried out the systematic mapping

“To review works related to FS and data imputation, we carried out two systematic mappings focused on identifying studies related to imputation and the assembly of feature selection algorithms following the guidelines described by Petersen [5]. We used two search strings, one for each topic:

Imputation data: (imputation data) and (missing values or missingness rates or incomplete data or incomplete dataset)

Feature selection: ("framework" and "ensemble") and ("dimensionality reduction" or "feature selection") and ("EEG" and "automatic") and ("detector" or "reading" or "recognition" or “analysis”).

The searches guided by the previous keywords, were used to find relevant papers from IEEE, PubMed, and Science Direct databases. The analysis of the papers was led following review criteria based on the quality of their contributions, particularly the proposal of imputation and assembly of feature selection algorithms ”…

Also the following sentences were added to clarify the manuscripts contributions:

… “However, when a dataset has missing values in the features, we must find a way to handle the missing values and perform preprocessing tasks to get a dataset with complete data. Commonly, the missing data problem is solved by removing the instances or features with missing values or replacing the missing values using basic mechanisms such as mean, mode, etc. Although these strategies are easy to implement, they change the distribution of the datasets and may bias subsequent Machine Learning analyzes, for instance, the feature selection or classification processes. On one hand, the methods used to handle missing values could eliminate from the dataset: (i) relevant features or (ii) instances that reveal the importance of the relevant features. On the other hand, the machine learning models could be trained using only a part of the original data points.”…

…”

In previous studies, we evaluated how feature selection improved the performance of the classification of epileptic events and normal brain activity in Electroencephalograms [18][19]. The experiments were carried out using datasets with high dimensionality in a scenario with the need of reducing the computational complexity. The results indicated that the best subset of relevant features was selected by an approach based on Ensemble Feature Selection (EFS).

We thus proposed a Framework of Ensemble Feature Selection to improve the selection of relevant features in datasets with high dimensionality [20]. Nonetheless, one of the weaknesses of the original proposal was the handling of datasets with missing values. In the real world, datasets have a high probability of having incomplete data, which means that handling missing values is necessary before selecting relevant features. This renders the results of FS uncertain when the dataset has incomplete data.”..

In addition, we mentioned in the discussion section “This study has several limitations, and the results of the quality of imputation for each method are limited to the used datasets. Hence, the researchers should study their datasets to decide which method to apply. In this sense, the main contribution of our research is not providing a universal solution to handle missing values or select relevant features. Instead, it is showing evidence about the need to consider the impact of missing values in the feature selection process”

(2)The motivation of this research must be clearly and elaborately described. Why this research is important? Currently, a brief discussion is provided which is not enough. Also, Section 4 (discussion) gives some indication, but that is at the end of the paper. So, some of that discussion can be placed in the literature review section.

Response: thank you for your comment.

We agree. We have changed part of the introduction section to describe better our motivation. We realized that we did not describe well the motivation of our research and this could cause a misunderstanding. We have added more details in the Introduction section and the description of the discussion to align the content of both sections, as follows:

“According to (13), missing values could be present in the target variable in the context of classification. For example, when a classification or estimation model is evaluated, missing values are imputed in the test data's target variable and the model predicts values for the target variable. However, when a dataset has missing values in the features, we must find a way to handle the missing values and perform preprocessing tasks to get a dataset with complete data. Commonly, missing data are solved by removing the instances or features with missing values or replacing the missing values by using basic mechanisms such as mean, mode, etc. Although these strategies are easy to implement, they change the distribution of the dataset and could bias the subsequences analyzes of Machine Learning, for instance, the Feature Selection or Classification processes. On the one hand, the methods to handle missing values could eliminate from the dataset: (i) relevant features or (ii) instances that reveal the importance of the relevant features. On the other hand, the models could be trained by using only a part of the original data.”

….

“This research aims to describe how Imputation Data can improve the Feature Selection on datasets with missing data and avoid biasing the dataset. For this, we showed the impact of missing values in the FS process by implementing an algorithm of Imputation Data and evaluating it with different datasets to compare the FS process using datasets without handling missing values versus imputed datasets”

(3)In Section 3.1.1.3 and several places “100%” is used. Such as “MICE's overall accuracy was 100% better than the overall accuracy of mode replacement”. It is not clear why this is used. Please clarify how this wording makes sense.

Response: thank you for your comment. we have rewritten the explanation of the results.

For example, the overall accuracy achieved by MICE was better than the overall accuracy achieved by mode replacement in the 100% of the missing rates.

(4)In Section 3 – the results section, the authors present a number of graphs and results. However, there is little explanation of the results.

Response: We appreciate your comments. We have added references in the discussion section to tables and figures described in the results section. This is because the results section only shows the results of the evaluation, but the analyzes and comparisons with related works are performed in the discussion section.

(5)In Section 3 – the results section, a formatting issue is present in the form of “Error! Reference source not found”. This needs to be corrected.

Response: Thank you for your comment. We have corrected the reference.

(6)Overall the presentation and writing of the paper need some editing. Currently, the results and findings are not easy to understand. Particularly, the concept “the impact of imputation in the feature selection process” is not clearly understood in the paper. Also, a clearer discussion is required for the “the use of multiple datasets of different diseases and the difference in findings based on datasets”.

Response: thanks for your comment.

We kindly ask you to consider that the motivation of this research is not to propose a new method to impute data or select features, it is to evaluate how the imputation data could change the distribution of the dataset and bias the feature selection process. Considering this, we selected 4 well-known datasets available at the UCI Machine Learning repository. The selected datasets have been used in many previous studies of Feature Selection.

Additionally, we have added more details in the introduction section to make clearer the motivation of our research:

In addition, In the discussion section, we have added some references to tables and figures presented in the results section to relate the results of the evaluation to the final analysis.

Decision Letter 1

Zaher Mundher Yaseen

2 Jul 2021

Evaluating the impact of multivariate imputation by MICE in feature selection

PONE-D-20-39195R1

Dear Dr. Mera-Gaona,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Zaher Mundher Yaseen

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors have addressed all my concerns, and I have no more comments. I recommend for acceptance.

Reviewer #2: After the revision, the manuscript has been improved. The authors may wish to elaborate the acronym "MICE" in the abstract and Introduction sections.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Acceptance letter

Zaher Mundher Yaseen

16 Jul 2021

PONE-D-20-39195R1

Evaluating the impact of multivariate imputation by MICE in feature selection

Dear Dr. Mera-Gaona:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Zaher Mundher Yaseen

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Availability Statement

    The breast-Cancer dataset is available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). The Letter-recognition dataset is available at https://archive.ics.uci.edu/ml/datasets/Letter+Recognition. The Statlog – (Heart) dataset is available at https://archive.ics.uci.edu/ml/datasets/statlog+(heart). The Spambase dataset is available at https://archive.ics.uci.edu/ml/datasets/spambase. These are all third party data. The third-party data are available for everyone and does not require privileges to be accessed.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES