Single imputation methods |
|
|
Simple imputation |
In a predictor (X) which is unrelated to all other X’s, substitution replaces all missing continuous values with the mean (or median) of all participants who have a valid value or the mode for categorical predictors [71]. |
Mean substitution is easily implemented with the package ‘Hmisc’ of R statistical software through the function ‘impute (x, fun = mean)’ where x is the predictor of interest [72]. |
|
Simple imputation reduces variability and correlation estimates by ignoring relationships between variables but assumes MCAR. Regression coefficients are biased towards 0 (zero) since the outcome (Y) is not considered [1]. |
|
Conditional mean imputation |
Regression imputation assumes strong relationships between the X to be imputed and the independent X’s used in the univariable or multivariable regression formula [1,66,73]. An imputation model is made to predict the missing values when X is related to the other X’s, this method is far more efficient [74-76]. Conditional mean imputation leads to a weakening of the variance and overestimation of the model fit and correlation estimates. The outcome (Y) should not be included in the imputation model to prevent over exaggeration of the strength of relationship between X and Y [1]. |
Conditional mean imputation can be implemented in R through the creation of a regression model and the subsequent inbuilt ‘predict’ function. |
Stochastic regression imputation |
An alternative to conditional mean imputation, stochastic regression imputation includes a random element to the prediction of values, highlighting the uncertainty of imputed values [73]. A random draw is taken from the distribution of predicted values, which allows for the inclusion of the outcome in the prediction model. |
This can be implemented with the ‘mice’ package for R via the command ‘mice.impute.norm.nob’ [77]. |
Hotdecking |
Hotdecking replaces the missing value of an individual with a random value from a pool of individuals who are matched to the missing individual by predictors, the ‘deck’ [78,79]. These deck predictors may be researcher-determined or a correlation matrix may be used to determine which the most highly correlated predictors are. The standard error is better approximated through the hotdeck procedure than simple imputation. |
The command ‘hotdeck’ of the R package ‘VIM’ can implement the hotdecking [80]. |
Multiple imputation methods |
|
|
Markov chain Monte Carlo (MCMC) |
Multivariate normal imputation assumes a multivariate distribution and the MCMC algorithm is used to obtain imputed values and allow for uncertainty in the estimated model predictors [81]. MCMC describes a group of methods that use Markov chains to generate pseudorandom draws from probability distributions. |
The command ‘mcmcNorm’ of the R package ‘MCMCglmm’ can implement MCMC approach to multiple imputation [82]. |
Maximum likelihood |
The expectation-maximization (EM) algorithm, also called joint modeling, assumes a multivariate distribution. First a set of parameter values that produces the maximum likelihood are identified from the conditional distribution; values that would most likely have resulted in the observed data [77,83]. New parameter estimates are randomly drawn from a Bayesian posterior distribution, the distribution of unobserved values conditional on observed data [84]. Bootstrap procedures are employed to obtain standard error estimates, correcting for bias associated with non-normality. |
The package ‘Amelia’ in R implements bootstrapping algorithms to give EM results [85]. |