Skip to main content
. 2021 Oct 27;8(1):140. doi: 10.1186/s40537-021-00516-9

Table 1.

A summary of various missing data techniques in machine learning

Refs. DataSet Performance objective Mechanism Summary Limitations
[124] Balance, Breast, Glass, Bupa, Cmc, Iris, Housing, Ionosphere, wine To study the influence of noise on missing value handling methods when noise and missing values distributed throughout the dataset MCAR, MAR, MNAR The technique proved that noise had a negative effect on imputation methods, particularly when the noise level is high Division of qualitative values may have been a problem
[85] German, Glass(g2), heart-statlo, ionosphere, kr-vs-kp, labor, Pima-indians, sonar, balance-scale, iris, waveform, lymphography, vehicle, anneal, glass, satimage, image, zoo, LED, vowel, letter Experimenting methods for handling incomplete training and test data for different missing data with various proportions and mechanisms MCAR, MAR In this technique an understanding of the relative strengths and weaknesses of decision trees for missing value imputation was discussed The approach did not consider correlations between features
[125] Los Angeles ozone pollution and Simulated data To study classification and regression problems using a variety of missing data mechanisms in order to compare the approaches on high dimensional problems MCAR, MAR Here the authors tested the potential of imputation technique’s dependence on the correlation structure of the data Random choice of missing values may have weakened the experiment consistency
[38] Breast Cancer To evaluate the performance of statistical and machine learning imputation techniques that were used to predict recurrence in breast cancer patient data The machine learning techniques proved to be the most suited imputation and led to a significant enhancement of prognosis accuracy compared to statistical techniques One type of data was used for the imputation model, therefore, the presented results may not generalise to different datasets
[126] Iris, Wine, Voting, Tic-Tiac-Toe, Hepatitis To propose a novel technique to impute missing values based on feature relevance MCAR, MAR The approach employed mutual information to measure feature relevance and proved to reduce classification bias Random choice of missing values may have weakened the experiment consistency
[127] Liver, Diabetis, Breast Cancer, Heart, WDSC, Sonar Experimented on missing data handling using Random Forests and specifically analysed the impact of correlation of features on the imputation results MCAR, MAR, MNAR The imputation approach was reported to be generally robust with performance improving when increasing correlation Random choice of missing values in MNAR could have weakened the consistency of the experiment
[128] Wine , Simulated To create an improved imputation algorithm for handling missing values MCAR, MAR, MNAR Demonstrated the superiority of a new algorithm to existing imputation methods on accuracy of imputing missing data Features may have had different percentages of missing data, also MAR and MNAR may have been weakened
[129] De novo simulation, Health surveys S1, S2 and S3 To compare various techniques of combining internal validation with multiple imputation MCAR,MAR The approach was regarded to be comprehensive with regard to the use of simulated and real data with different data characteristics, validation strategies and performance measures The approach influenced potential bias by the relationship between effect strengths and missingness in covariates
[130] Pima Indian Diabetes dataset To experiment on missing values approach that takes into account feature relevance The results of the technique proved that the hybrid algorithm was better than the existing methods in terms of accuracy, RMSE and MAE Missing values mechanism was not considered
[13] Iris, Voting, Hepatitis Proposed an iterative KNN that took into account the presence of the class labels MCAR, MAR The technique considered class labels and proved to perform good against other imputation methods The approach has not been theoretically proven to converge, though it was empirically shown
[74] Camel, Ant, Ivy, Arc, Pcs, Mwl, KC3, Mc2 To develop a novel incomplete-instance based imputation approach that utilized cross-validation to improve the parameters for each missing value MCAR, MAR The study demonstrated that their approach was superior to other missing values approaches
[131] Blood, breast-cancer, ecoli, glass, ionosphere, iris, Magic, optdigits, pendigits, pima, segment, sonar, waveform, wine, yeast, balance-scale, Car, chess-c, chess-m, CNAE-9, lymphography, mushroom, nursery, promoters, SPECT, tic-tac-toe, abalone, acute, card, contraceptive, German, heart, liver, zoo To develop a missing handling approach is introduced with effective imputation results MCAR The method was based on calculating the class center of every class and using the distances between it and the observed data to define a threshold for imputation. The method performed better and had less imputation time Only one missing mechanism was implemented
[132] Groundwater Developed a multiple imputation method that can handle the missingness in ground water dataset with high rate of missing values MAR Here the technique used to handle the missing values, was chosen looking at its ability to consider the relationships between the variables of interest There was no prior knowledge on the label of missing data which may have provided difficulty when performing imputation
[133] Dukes’ B colon cancer, the Mice Protein Expression and Yeast Developed a novel hybrid Fuzzy C means Rough parameter missing value imputation method The technique handled the vagueness and coarseness in the dataset and proved to produce better imputation results There was no report of missing values mechanisms used for the experiment
[134] Forest fire, Glass, Housing, Iris, MPG, MV, Stocks, Wine The method proposed a variant of the forward stage-wise regression algorithm for data imputation by modelling the missing values as random variables following a Gaussian mixture distribution. Categorical The method proved to be effective compared to other approaches that combined standard missing data approaches and the original FSR algorithm There was no report of missing values mechanisms used for the experiment
[135] Weather dataset This method applied four(Likewise, Multiple imputation, KNN, MICE) missing data handling methods to the training data before classification Of the imputation methods applied the authors concluded that the most effective missing data imputation method for photovoltaic forecasting was the KNN method There was no report of missing values mechanisms used for the experiment
[136] Air quality data To make time series prediction for missing values using three machine learning algorithms and identify the best method The study concluded that deep learning performed better when data was large and machine learning models produced better results when the data was less Heavy costs in time consumption and computational powers for training when implementing their most effective method (deep learning)
[137] Traumatic Brain Injury and Diabetes To demonstrate how performance varies with different missing value mechanisms and the imputation method used and further demonstrate how MNAR is an important tool to give confidence that valid results are obtained using multiple imputation and complete case analysis MCAR, MAR, MNAR The study showed that both complete case analysis and multiple imputation can produce unbiased results under more conditions The method was limited by the absence of nonlinear terms in the substantive models
[138] Grades Dataset To develop a new decision tree approach for missing data handling MCAR, MAR, MNAR The method produced a higher accuracy compared to other missing values handling techniques and had more interpretable classifier The algorithm suffered from a weakness when the gating variable had no predictive power
[139] Air Pressure System data The study proposed a sorted missing percentages approach for filtering attributes when building machine learning classification model using sensor readings with missing data The technique proved to be effective for scenarios dealing with missing data in industrial sensor data analysis The proposed approach could not meet the needs of automation
[139] Abalone and Boston Housing To experiment the reliability of missing value handling at not missing at random MAR The results of the study indicated that the approach achieved satisfactory performance in solving the lower incomplete problem compared to other six methods The approach did not consider any missingness rate which may have affected the analysis
[140] Cleveland Heart disease Proposed a systematic methodology for the identification of missing values using the KNN, MICE, mean, and mode with four classifiers Naive Bayes, SVM, logistic regression, and random forest The result of the study demonstrated that MICE imputation performed better than other imputation methods used on the study The approach compared stage of the art methods with simple imputation methods, mean and mode that are bias and unrealistic results
[141] Iris, Wine, Ecoli and Sonar datasets To retrieve missing data by considering the attribute correlation in the imputation process using a class center-based adaptive approach using the firefly algorithm MCAR The result of the experiment demonstrated that the class center-based firefly algorithm was an efficient method for handling missing values Imputation was tested on only one missing value mechanism
[15] Abalone, Iris, Lymphography and Parkinsons Proposed a novel tuple-based region splitting imputation approach that used a new metric, mean integrity rate to measure the missing degree of a dataset to impute various types missing data The region splitting imputation model outperformed the competitive models of imputation Random generator was used to impute missing values and other mechanisms for missing values were not considered
[142] Artificial and real metabolomics data To develop a new kernel weight function-based imputation approach that handles missing values and outliers MAR The proposed kernel weight-based approach proved to be superior compared to other data imputation techniques The method was experimented on one type of dataset and may not perform as reported on other types of data