Abstract
High-dimensional data have been regarded as one of the most important types of big data in practice. It happens frequently in practice including genetic study, financial study, and geographical study. Missing data in high dimensional data analysis should be handled properly to reduce nonresponse bias. We discuss some modern machine learning techniques including penalized regression approaches, tree-based approaches, and deep learning (DL) for handling missing data with high dimensionality. Specifically, our proposed methods can be used for estimating general parameters of interest including population means and percentiles with imputation-based estimators, propensity score estimators, and doubly robust estimators. We compare those methods through some limited simulation studies and a real application. Both simulation studies and real application show the benefits of DL and XGboost approaches compared with other methods in terms of balancing bias and variance.
Keywords: Deep learning, high-dimensional data, imputation, machine learning, missing data
1. Introduction
Missing data are a critical problem in practical research including sample surveys, epidemiology, economics, and social science. Simply ignoring missing data in statistical analysis may lead to biased results, see Refs. [34,41]. There are two types of missingness in practice: item nonresponse and unit nonresponse. Item nonresponse is often handled by imputation approaches including hot-deck imputation [1,48], nearest neighbor imputation [8,9,63], predictive mean matching (PMM) imputation [28,40,51,64], multiple imputation (MI) [52–54], fractional imputation [31,32,62], among others. For a comprehensive review in dealing with item nonresponse, see Ref. [12]. Unit nonresponse is often handled by using inverse probability weighting techniques, see Refs. [27,29,33] among others. The validity of the above methods depend on the underlying outcome regression model and nonresponse model assumptions. Doubly robust approaches [2,30,50] and multiply robust approaches [10,11,25] have been proposed to improve the robustness to model misspecification.
Big data are becoming one of the hottest topics in current research of computer science, data mining, engineering, and applied mathematics. High-dimensional data have been regarded as one of the most important types of big data, see Ref. [23]. With the rapid development of information technology, high-dimensional data are common in modern research areas including genetic study [18,56,57], financial study [42,49], and health care study [37,44]. Currently, there are many supervised learning approaches for handling high-dimensional data. They can be classified mainly into three categories: (i) penalized regression methods such as least absolute shrinkage and selection operator (LASSO) [59], ridge regression (RR) [55], and smoothly clipped absolute deviation (SCAD) [20]. (ii) Tree-based approaches including decision or regression tree (RT) [38,39], random forests (RFs) [46,61], and boosting methods [15,16,43]. (iii) Deep learning (DL) approaches [24,36]. Unsupervised learning including clustering methods have also been developed in the context of high-dimensional data, see Refs. [21,35] among others.
Even though there are numerous articles on missing data analysis with low dimensional data and machine learning approaches with high-dimensional data, the research for machine learning-based approaches for handling high dimensional data with missing values is relatively new. Penalized regression methods have been used for imputation, see Refs. [45,65,66]. Tree-based imputation approaches have been discussed in Refs. [7,58,67]. DL-based methods for missing data imputation have been shown to perform well for high-dimensional medical [3,17] and genetic data [47]. In the previous literature, a systematic discussion and an empirical comparison of handling missing data with high dimensionality by using machine learning approaches are lacking. For example, there is the lack of the comparison among statistical learning approaches, tree-based approaches, traditional imputation methods, and DL approaches. To fill this research gap, we compare these approaches through an extensive simulation study as well as a real application. In addition, we also consider inverse probability weighted estimators and doubly robust estimators. We discuss DL-based approaches for handling missing data in the paper. For simplicity, we only focus on estimating the population mean for the scenario where only the outcome variable is missing. Our proposed methods can be naturally extended for estimating other parameters such as population percentiles, regression coefficients, and distribution functions, where both the outcome variable and the covariate variables are prone to missing values.
The paper is organized as follows. Section 2 introduces the mathematical notation and the problem. Penalized approaches are discussed in Section 3. Tree-based approaches including decision tree, RF, and boosting methods are considered in Section 4. Section 5 contains DL-based approaches. A simulation study is presented in Section 6. A real application is given in Section 7. Section 8 contains some discussions.
2. Basic setups
Suppose we have independently identically distributed ( i.i.d.) copies of where is a d dimensional covariate vector, is the study variable subject to missingness, and is the response indicator such that if is observed and 0 otherwise. Denote by S the sample, the set of respondents, and the set of non-respondents. The vector is assumed to be fully observed and the dimension d is not fixed and increases as n increases. We assume that the outcome regression model is
(1) |
where is assumed to be an unknown function with unknown parameter and the 's are assumed to be independent errors with and . Our proposed method in the next section can be naturally extended to the heteroskedasticity scenario. Suppose there are only non-zero parameters out of , where is a subset of with being a fixed number. Such condition is the so-called sparsity condition. The missing mechanism is assumed to be missing at random and the response indicators are assumed to have a Bernoulli distribution, , with
(2) |
where η is the link function and is an unknown function with unknown parameter Suppose there are only non-zero parameters out of , where is a subset of with as a fixed number. For simplicity, suppose the parameters of interest can be defined as the solution of the estimating equation , where is assumed to be known and can be either a smooth or non-smooth function of . Without loss of generality, we assume the solution of the previous estimating equation is unique and that the function satisfies the regularity conditions described in Ref. [14]. Special cases of include population means with , regression coefficients with , and population τ-th percentile with . For more information for estimating parameters defined as the solution of an estimating equation, see Ref. [60].
3. Penalized approaches
Under the working model assumption that , where , the estimator of can be obtained by minimizing the following penalized distance function:
(3) |
where is the penalty term. Popular examples include:
RR: .
Hard thresholding: .
LASSO (soft thresholding): .
SCAD: .
Then, the penalized deterministic imputed estimator for the population mean can be written as
(4) |
To estimate general parameters defined as the solution of an estimating equation , we propose using the following stochastic imputation approach. One can first create a pool of residuals , where Then, generate the imputed values for with selected from randomly with equal selection probabilities. The estimator of can be obtained by solving the following estimating equation:
(5) |
The performance of the penalized imputation approach depends on the assumption of the working model . If the assumption violated, the estimators may be biased.
Alternatively, one can use a penalized approach for estimating the propensity score model . Specifically, suppose that the working model for the propensity score is . Then, the estimator can be obtained by minimizing the following penalized distance function:
(6) |
where and is the penalty term. The options of penalty terms have been discussed above. Then the propensity score-based estimator can be obtained by solving the following estimating equation:
(7) |
The performance of the propensity score estimator depends on the assumption that . To improve the robustness, we consider a random hot-deck imputation procedure within classes [4,13] that can be described as follows:
Obtain the estimators and by using the penalized methods described previously.
The sample is partitioned into C imputation classes using an equal quantile method that consists of ordering the -values, from the lowest to the largest and then forming C classes of (approximately) equal size.
- A missing value in a given class is replaced by the value of a respondent belonging to the same class and chosen at random with probability proportional to . Denote the imputed value as for , then the resulting doubly robust estimator of can be obtained by solving the following estimating equation:
(8)
The estimator is doubly robust in the sense that it is consistent if one of the two working models and is correctly specified.
Remark 3.1
For estimating population mean , one can use the following deterministic doubly robust estimator as well:
(9)
4. Tree-based approaches
4.1. RT
For simplicity, we only consider the mid-point as the split point for tree construction in this section. Suppose that we have a partition of the support of into L regions, , then the RT-based imputed value can be written as:
(10) |
where is the sample mean of for the respondents with and . Denote as the corresponding dictionary (set of basis functions). In order to find the best partition, one can apply the following algorithm:
-
(Step 1).
Let be the current dictionary (set of basis functions) with where is the tth basis function, and is the current partition of the support of . Denote .
-
(Step 2).
Expand the basis function by multiplying an additional split using to each of the basis functions. Thus, we have possible candidates: , where and are obtained by applying additional split on using .
-
(Step 3).
Among the possible dictionaries, choose the best one by finding the minimizer of the loss function: .
-
(Step 4).
If the Bayesian information criterion (BIC) by using is smaller than the BIC by using , then stop splitting and the final model is based on .
After the above tree selection algorithm, the imputed estimator for can be obtained by solving (5) with for with selected from randomly with equal selection probabilities, where with denotes the final selected tree. , where Similarly, the propensity score model can also be estimated by using RT method. The propensity score-based estimator and doubly robust estimator can be obtained by using similar techniques in Section 3.
4.2. RF
To improve the prediction precision, Breiman [6] proposed using RF with multiple trees. The algorithm for imputation can be summarized as follows:
-
(Step 1).
Draw B bootstrap samples , from the original respondents by using simple random sampling design with replacement.
-
(Step 2).
For each bootstrap sample of respondents , select predictors from d predictors randomly for modeling.
-
(Step 3).
For each bootstrap sample , obtain the tree model by using the algorithm described in Section 4.1.
-
(Step 4).
Obtain the final estimator of by using .
Similarly, the propensity score model can also be estimated by using the RF method. The corresponding imputed estimator, propensity score estimator, and doubly robust estimator of can be obtained by using similar techniques as that in Section 3.
4.3. Boosting
Boosting [22] is another procedure which can be used to any statistical learning approaches to improve the precision of model predictions and has been used frequently in tree-based methods. Boosting is an iterative method that starts with a weak learner and improves the prediction at each sequential step by predicting the residuals of prior models and combining them together to make the final prediction. The idea for boosting is very similar to that of additive modeling, see Refs. [22,26]. Suppose that the prediction model has the following additive form:
(11) |
where , are Q different tree models with parameters . They will be determined iteratively by using a forward stagewise procedure. For the quadratic loss function, the algorithm is as follows:
-
(Step 1).
Initialize with an offset value. The default value is , where .
-
(Step 2).
Increase q by 1 and compute the residuals for .
-
(Step 3).
Fit the residual vector to for by using the RT method described in Section 4.1. Denote the fitted value as .
-
(Step 4).
Obtain the updated fitted value , with as a step-length factor.
-
(Step 5).
Iterate (step 2) to ( step 4) until some stopping criterias have been reached. For instance, one can consider using cross-validation or some information criterion.
Similarly, the propensity score model can also be estimated through the boosting method. The corresponding imputed estimator, propensity score estimator, and doubly robust estimator of can be obtained by using similar techniques than those described in Section 3.
5. DL approaches
DL has been regarded as the optimal strategy for handling hierarchical non-linear data structure. Suppose we want to use the following deep neural networks working model
(12) |
where denotes the composition of two functions and L is the number of hidden layers, and is usually called depth of a neural networks model. Define , and we consider the multilayer perceptrons with the following choice of for ,
where and are the weight matrix and the bias/intercept, respectively, associated with the lth layer, and is the so-called activation function for the lth layer. Popular choices of activation function include the rectified linear unit (ReLU) function: and scaled exponential linear unit (SELU) function if z>0, or uf . Define , , and . Then the estimators and of the parameters and can be obtained by minimizing the following loss function:
(13) |
where is a given continuous function such as norm for p = 1, 2. The optimization can be generally done numerically via stochastic gradient descent [5] and backpropagation [36], whilst guaranteeing reliable performance on unseen data (ensuring generalization). Then the DL-based imputed estimator of can be written as
(14) |
By using the logit function as the final activation function in (12) and the logistic regression entropy function as the loss function in (13), one can obtain the propensity score-based estimator as follows:
(15) |
where and are the estimators of and which are corresponding weight matrix and the bias/intercept for predicting the nonresponse probability. The estimation of other parameters and doubly robust estimator can be developed by using similar techniques as those described in Section 3.
6. Simulation study
For each iteration, we generated a data set of n = 600 i.i.d. copies , where is the set of the p predictor variables, is the outcome variable of interest, is the response indicator for the ith subject, and is the sth predictor value for the ith individual, which were generated from a uniform distribution , . We assumed two conditions of correlation matrix of : if , if and 0 for all other . If j>k, then , where is the jkth element of and . The vector with specific correlation matrix was generated by using R package ‘MultiRNG’ and the method discussed in Ref. [19]. The random variables are mutually independent when and have auto-regressive structure when . The outcome regression model and nonresponse model are as follows:
-
(S1).
and -
(S2).
and
where , , and , with cdf() as the cumulative standard normal distribution function. We set the and 1.5 and 1 for S1 and S2, respectively, which resulted in variance of Y explained by X to be around 65%. We set 2.3 and 3.5 for S1 and S2, respectively, to have an overall average response rate of 60%.
In addition, we considered a scenario of mixed types of distribution of X and error term as follows:
-
(S3).
and
where , , , , and . Similarly, we controlled the variance of Y explained by X to be around 65% and the overall average response rate to be 60%.
We compare the following approaches in the simulation study:
-
(M1).
Parametric approaches by using linear model (LM). The model will be built by stepwise model selection based on BIC.
-
(M2).
MI approaches with five imputed values. We will use unconditional mean imputation (Mean), predictive mean matching (PMM), and Bayesian linear regression (Norm) to impute the missing value. The three methods we selected represent commonly used naive (Mean), semi-parametric (PMM), and parametric (Norm) models in MI. The models were implemented using R package ‘mice’.
-
(M3).
Regularized approaches by using LASSO, RR, and SCAD. The R package glmnet with default setting was used. The tuning parameter will be selected using 10-fold cross-validation. For LASSO and SCAD, we used a de-biased estimator by refitting LM based on selected variables.
-
(M4).
RT, RF, and XGboosting approach (XGB). The hyper-parameters were tuned through grid search using caret or MLR package R in Table 1. Five times repeated five-fold cross-validation were used to evaluate the model performance during the tuning process. RT and RF were tuned for each iteration. Considering the complexity of tuning XGB, we randomly picked 1–3 simulated datasets to tune the parameters and applied them to all datasets.
-
(M5).
DL imputation approaches. We consider totally four DL architectures with four or five fully connected layers: DL-1r and DL-2r using ReLU and linear activation function (Figure 1), and DL-1s and DL-2s using SELU and linear activation function presented in Figure 2; DL-1r and DL-1s have number of nodes reduced in deeper layers, while DL-2r and DL-2s have same number of nodes in all layers. In order to control overfitting, we dropout nodes in some layers with rate of 0.25–0.4. Also, the training validation split is 60%:40%. Because of the high computation demand for DL simulation, we randomly pick 10 generated data sets to tune the learning rate from 0.1, 0.01, 0.001, and 0.0001. The selected learning rate will be applied to all data sets.
Table 1.
List of tuning parameters involved in M4 approaches.
Approach | Tuning package | Parameter | Range |
---|---|---|---|
RT | mlr | maxdepth | |
mlr | complexity parameter (cp) | ||
RF | caret | mtry | |
XGB | mlr | _depth | 2:10 |
mlr | nrounds | 20:200 | |
mlr | eta | 0:0.35 | |
mlr | alpha | 0:10 | |
mlr | lambda | 0:2 | |
mlr | lambda_bias | 0:2 | |
mlr | booster | (‘gblinear’, ‘gbtree’) |
Note: p is the number of predictors in the data.
Figure 1.
Structures for DL-1r and DL-2r.
Figure 2.
Structures for DL-1s and DL-2s.
We consider imputation, propensity score, and doubly robust approaches in LASSO and SCAD. Suppose we are interested in estimating the population mean . We implement the DL approaches in Python and all other approaches in R.
We performed 1000 iterations to compare the approaches described previously in terms of relative bias, relative standard deviation (SD), and relative root mean squared error (RMSE). The results for scenario (S1) to scenario (S3) are presented in Tables 2– 10. Under our simulation settings, the boosting and DL approaches outperformed the other approaches in terms of balancing the relative bias and relative SD in most of the scenarios. Boosting method was top in terms of least relative SD. Boosting method and DL methods have comparable performance in terms of RMSE. With proper structures, DL methods can produce lease relative bias and RMSE. MI methods have similar performance as LM, LASSO, SCAD, RR, RT, and RF when the dimension is low (p = 10). However, they have worse performance when the dimension becomes high. We cannot even obtain the estimates when the dimension is extremely high (p = 1000 ). The LM, LASSO-DB, LASSO-DR, RR, and RF have similar performance in all scenarios. LASSO-PS, SCAND-PS, and RT show comparable or better performance than LM, LASSO-DB, LASSO-DR, RR, and RF. The correlation in predictors did not affect the rank of the approaches a lot.
Table 2.
Relative bias (%) of different approaches in scenario S1.
p = 10 | p = 400 | p = 1000 | ||||
---|---|---|---|---|---|---|
Model | ||||||
LM | 46.49 | 45.96 | 47.08 | 46.87 | 46.65 | 47.02 |
MI-PMM | 46.36 | 45.84 | 41.23 | 41.32 | NA | NA |
MI-Norm | 46.42 | 45.90 | −88.45 | −85.09 | NA | NA |
MI-Mean | 39.68 | 40.17 | 39.99 | 39.76 | NA | NA |
LASSO-DB | 47.88 | 47.05 | 49.79 | 49.50 | 49.85 | 49.75 |
LASSO-PS | 37.99 | 38.15 | 37.54 | 37.27 | 36.90 | 36.81 |
LASSO-DR | 47.79 | 47.02 | 48.84 | 48.58 | 48.72 | 48.68 |
SCAD-DB | 46.47 | 45.94 | 47.54 | 47.26 | 47.43 | 47.53 |
SCAD-PS | 51.04 | 50.25 | 45.30 | 45.17 | 42.38 | 42.58 |
SCAD-DR | 49.24 | 48.37 | 49.09 | 48.86 | 48.61 | 48.72 |
RR | 45.87 | 45.27 | 40.88 | 40.88 | 40.35 | 40.34 |
RT | 49.93 | 49.05 | 50.48 | 50.20 | 50.08 | 49.98 |
RF | 47.87 | 46.88 | 41.56 | 41.51 | 40.58 | 40.47 |
XGB | −7.08 | −7.65 | −7.11 | −7.22 | −7.31 | −7.32 |
DL-1r | −5.86 | −6.03 | −1.75 | −1.91 | −2.04 | −2.25 |
DL-2r | 1.18 | 0.45 | 6.46 | 5.82 | 6.49 | 6.39 |
DL-1s | −7.77 | −8.52 | −9.32 | −10.01 | −1.73 | −2.46 |
DL-2s | −10.38 | −11.33 | −10.73 | −11.25 | 3.49 | 2.52 |
Table 10.
Relative RMSE (%) of different approaches in scenario S3.
Model | p = 10 | p = 400 | p = 1000 |
---|---|---|---|
LM | 36.45 | 36.85 | 39.28 |
MI-PMM | 53.84 | 258.41 | NA |
MI-Norm | 35.80 | 2507.57 | NA |
MI-Mean | 87.87 | 88.13 | NA |
LASSO-DB | 60.29 | 66.12 | 67.72 |
LASSO-PS | 56.01 | 56.74 | 57.31 |
LASSO-DR | 54.63 | 56.26 | 57.30 |
SCAD-DB | 35.22 | 37.44 | 39.53 |
SCAD-PS | 65.15 | 58.51 | 57.72 |
SCAD-DR | 75.82 | 58.90 | 57.16 |
RR | 39.53 | 82.43 | 84.67 |
RT | 57.15 | 62.22 | 63.02 |
RF | 48.28 | 87.45 | 89.82 |
XGB | 27.95 | 28.28 | 28.34 |
DL-1r | 24.53 | 26.41 | 27.64 |
DL-2r | 25.83 | 26.21 | 27.23 |
DL-1s | 23.12 | 29.23 | 29.04 |
DL-2s | 23.12 | 26.61 | 27.01 |
Table 3.
Relative SD (%) of different approaches in scenario S1.
p = 10 | p = 400 | p = 1000 | ||||
---|---|---|---|---|---|---|
Model | ||||||
LM | 6.13 | 6.10 | 6.11 | 6.00 | 8.01 | 7.35 |
MI-PMM | 7.24 | 7.10 | 11.39 | 11.17 | NA | NA |
MI-Norm | 6.93 | 6.97 | 3446.20 | 4016.47 | NA | NA |
MI-Mean | 5.80 | 5.65 | 5.54 | 5.60 | NA | NA |
LASSO-DB | 6.89 | 6.73 | 6.89 | 6.95 | 7.06 | 6.91 |
LASSO-PS | 5.81 | 5.62 | 5.48 | 5.48 | 5.77 | 5.51 |
LASSO-DR | 6.25 | 6.19 | 6.11 | 6.14 | 6.39 | 6.22 |
SCAD-DB | 6.10 | 6.09 | 6.25 | 6.12 | 6.60 | 6.32 |
SCAD-PS | 7.10 | 6.74 | 6.46 | 6.40 | 6.67 | 6.21 |
SCAD-DR | 6.03 | 6.03 | 5.94 | 5.82 | 6.36 | 6.05 |
RR | 6.00 | 5.95 | 5.51 | 5.55 | 5.78 | 5.54 |
RT | 6.44 | 6.72 | 6.16 | 6.06 | 6.48 | 6.21 |
RF | 6.05 | 6.13 | 5.53 | 5.54 | 5.75 | 5.51 |
XGB | 4.23 | 4.15 | 4.14 | 4.15 | 4.17 | 4.11 |
DL-1r | 6.60 | 6.61 | 9.51 | 9.19 | 9.53 | 9.63 |
DL-2r | 6.14 | 5.97 | 6.37 | 6.39 | 6.79 | 6.44 |
DL-1s | 5.43 | 5.50 | 4.79 | 4.79 | 6.25 | 6.12 |
DL-2s | 4.94 | 4.66 | 4.52 | 4.52 | 6.66 | 7.10 |
Table 5.
Relative bias (%) of different approaches in scenario S2.
p = 10 | p = 400 | p = 1000 | ||||
---|---|---|---|---|---|---|
Model | ||||||
LM | 19.98 | 19.40 | 19.73 | 19.72 | 19.51 | 19.74 |
MI-PMM | 20.51 | 19.20 | −0.73 | −0.56 | NA | NA |
MI-Norm | 20.35 | 19.64 | 7.92 | 12.87 | NA | NA |
MI-Mean | 16.20 | 11.44 | 16.27 | 16.29 | NA | NA |
LASSO-DB | 17.12 | 17.28 | 13.81 | 13.80 | 13.54 | 13.64 |
LASSO-PS | 9.24 | 7.71 | 9.78 | 9.72 | 10.01 | 10.33 |
LASSO-DR | 19.39 | 19.13 | 17.23 | 17.20 | 16.97 | 17.17 |
SCAD-DB | 20.05 | 19.58 | 18.15 | 18.16 | 17.39 | 17.74 |
SCAD-PS | 6.44 | 4.88 | 8.44 | 8.46 | 8.90 | 9.15 |
SCAD-DR | 21.34 | 21.02 | 20.26 | 20.27 | 19.54 | 19.90 |
RR | 20.39 | 19.89 | 17.23 | 17.25 | 17.11 | 17.29 |
RT | 12.49 | 12.89 | 9.41 | 10.03 | 8.47 | 8.70 |
RF | 21.02 | 20.04 | 15.58 | 15.56 | 15.45 | 15.61 |
XGB | −7.62 | −9.34 | −8.40 | −8.20 | −8.58 | −8.29 |
DL-1r | 11.01 | 13.02 | 5.16 | 3.68 | −0.86 | −2.14 |
DL-2r | 8.00 | 7.51 | 3.78 | 2.72 | 0.00 | 0.96 |
DL-1s | 5.84 | 6.07 | −2.32 | −12.54 | 0.41 | −20.24 |
DL-2s | 3.49 | 8.71 | −1.65 | −7.96 | −6.12 | −17.30 |
Table 6.
Relative SD (%) of different approaches in scenario S2.
p = 10 | p = 400 | p = 1000 | ||||
---|---|---|---|---|---|---|
Model | ||||||
LM | 9.10 | 8.95 | 9.29 | 9.40 | 11.50 | 10.47 |
MI-PMM | 9.59 | 9.65 | 16.09 | 16.67 | NA | NA |
MI-Norm | 10.06 | 9.79 | 1510.95 | 925.54 | NA | NA |
MI-Mean | 7.05 | 9.27 | 7.05 | 7.10 | NA | NA |
LASSO-DB | 11.56 | 10.80 | 12.14 | 12.00 | 11.75 | 11.90 |
LASSO-PS | 6.91 | 7.14 | 6.80 | 6.85 | 6.52 | 6.62 |
LASSO-DR | 10.29 | 9.74 | 10.52 | 10.42 | 10.17 | 10.24 |
SCAD-DB | 8.94 | 8.74 | 10.11 | 10.24 | 9.95 | 9.88 |
SCAD-PS | 8.42 | 8.58 | 8.62 | 8.51 | 8.30 | 8.56 |
SCAD-DR | 9.40 | 9.27 | 9.67 | 9.74 | 9.31 | 9.25 |
RR | 8.28 | 8.08 | 7.04 | 7.13 | 6.88 | 6.92 |
RT | 11.18 | 11.20 | 11.02 | 11.30 | 9.83 | 10.39 |
RF | 9.50 | 9.00 | 7.13 | 7.14 | 6.87 | 6.95 |
XGB | 5.89 | 6.23 | 5.77 | 5.79 | 5.49 | 5.65 |
DL-1r | 8.56 | 8.74 | 7.47 | 7.61 | 7.28 | 7.57 |
DL-2r | 7.73 | 7.91 | 6.71 | 6.56 | 6.14 | 6.22 |
DL-1s | 8.26 | 8.03 | 7.75 | 7.01 | 9.26 | 6.26 |
DL-2s | 7.72 | 7.93 | 6.77 | 7.34 | 7.57 | 7.05 |
Table 7.
Relative RMSE (%) of different approaches in scenario S2.
p = 10 | p = 400 | p = 1000 | ||||
---|---|---|---|---|---|---|
Model | ||||||
LM | 21.96 | 21.36 | 21.81 | 21.84 | 22.65 | 22.34 |
MI-PMM | 22.64 | 21.49 | 16.11 | 16.68 | NA | NA |
MI-Norm | 22.70 | 21.95 | 1510.97 | 925.63 | NA | NA |
MI-Mean | 17.67 | 14.73 | 17.73 | 17.77 | NA | NA |
LASSO-DB | 20.66 | 20.38 | 18.39 | 18.29 | 17.93 | 18.11 |
LASSO-PS | 11.53 | 10.51 | 11.91 | 11.90 | 11.95 | 12.27 |
LASSO-DR | 21.95 | 21.47 | 20.19 | 20.11 | 19.78 | 20.00 |
SCAD-DB | 21.95 | 21.44 | 20.78 | 20.85 | 20.04 | 20.31 |
SCAD-PS | 10.60 | 9.87 | 12.06 | 12.00 | 12.17 | 12.53 |
SCAD-DR | 23.32 | 22.97 | 22.45 | 22.49 | 21.64 | 21.95 |
RR | 22.01 | 21.47 | 18.61 | 18.67 | 18.44 | 18.62 |
RT | 16.76 | 17.07 | 14.49 | 15.11 | 12.98 | 13.55 |
RF | 23.07 | 21.97 | 17.13 | 17.12 | 16.91 | 17.09 |
XGB | 9.63 | 11.22 | 10.19 | 10.04 | 10.19 | 10.03 |
DL-1r | 13.95 | 15.68 | 9.08 | 8.45 | 7.33 | 7.87 |
DL-2r | 11.13 | 10.91 | 7.70 | 7.10 | 6.14 | 6.29 |
DL-1s | 10.12 | 10.06 | 8.09 | 14.36 | 9.27 | 21.18 |
DL-2s | 8.47 | 11.78 | 6.97 | 10.83 | 9.73 | 18.68 |
Table 8.
Relative bias (%) of different approaches in scenario S3.
Model | p = 10 | p = 400 | p = 1000 |
---|---|---|---|
LM | 33.29 | 33.77 | 33.95 |
MI-PMM | 51.15 | 228.38 | NA |
MI-Norm | 30.35 | 247.39 | NA |
MI-Mean | 86.17 | 86.28 | NA |
LASSO-DB | 54.69 | 61.18 | 62.78 |
LASSO-PS | 54.14 | 54.93 | 55.56 |
LASSO-DR | 52.27 | 53.81 | 54.72 |
SCAD-DB | 31.83 | 34.61 | 36.88 |
SCAD-PS | 60.78 | 55.62 | 54.46 |
SCAD-DR | 70.20 | 55.81 | 53.40 |
RR | 36.65 | 80.64 | 82.96 |
RT | 55.08 | 60.46 | 61.32 |
RF | 45.57 | 85.35 | 87.81 |
XGB | 25.65 | 25.96 | 26.05 |
DL-1r | 18.78 | 18.72 | 20.21 |
DL-2r | 21.26 | 21.69 | 22.37 |
DL-1s | 17.08 | 23.50 | 23.57 |
DL-2s | 17.84 | 21.89 | 22.21 |
Table 9.
Relative SD (%) of different approaches in scenario S3.
Model | p = 10 | p = 400 | p = 1000 |
---|---|---|---|
LM | 14.83 | 14.76 | 19.77 |
MI-PMM | 16.80 | 120.90 | NA |
MI-Norm | 19.00 | 2495.34 | NA |
MI-Mean | 17.21 | 17.99 | NA |
LASSO-DB | 25.37 | 25.06 | 25.37 |
LASSO-PS | 14.37 | 14.23 | 14.04 |
LASSO-DR | 15.89 | 16.41 | 17.00 |
SCAD-DB | 15.09 | 14.29 | 14.22 |
SCAD-PS | 23.45 | 18.15 | 19.11 |
SCAD-DR | 28.65 | 18.84 | 20.41 |
RR | 14.82 | 17.11 | 16.96 |
RT | 15.23 | 14.68 | 14.54 |
RF | 15.97 | 19.04 | 18.89 |
XGB | 11.12 | 11.22 | 11.17 |
DL-1r | 15.78 | 18.63 | 18.85 |
DL-2r | 14.66 | 14.70 | 15.52 |
DL-1s | 15.58 | 17.39 | 16.97 |
DL-2s | 14.70 | 15.12 | 15.37 |
7. Real data application
We conducted a real data analysis using a study data set from gene expression omnibus. The study found a total of 24 miRNAs to be significantly associated with metabolic syndrome traits in a population of 200 men. We downloaded the expression profile of 1918 miRNAs in 200 subjects and their body mass index (BMI) as our variable of interest. We removed 883 non-informative miRNAs having variance of expression less than or equal to 0.1, which resulted in 1035 miRNAs in 200 subjects for the real data analysis. In the 200 subjects, BMI likely follows a normal distribution (Shapiro–Wilk normality test p-value 0.164) with mean SD: 26.62 3.32, median of 26.40 (Figure 3). A simple linear regression about the association between BMI and miRNA showed that there were 2 and 25 miRNAs were significant with Benjamini-Hochberg multiple testing correction and without correction at significance level of 0.05. The average pair-wise correlation between miRNAs was 0.26. We generated the response indicators of the subjects using a similar procedure of our simulation. Let represent the miRNA expression, and we randomly picked three miRNAs , , and to generate the response indicator. First, we made the non-linear transformation , , and . Then the non-response model was generated by
We set to have an overall average response rate of .
Figure 3.
Distribution of BMI in real data.
In this high-dimensional real data analysis, we tuned the parameters for machine learning methods again following the procedures in the simulation. The result is presented in Table 11. DL and boosting were top two methods having the least relative bias. The propensity score estimate using SCAD had the highest bias. The LM resulted in a small relative bias, but greatest RMSE about the individual BMI. DL ranked second after RF in RMSE.
Table 4.
Relative RMSE (%) of different approaches in scenario S1.
p = 10 | p = 400 | p = 1000 | ||||
---|---|---|---|---|---|---|
Model | ||||||
LM | 46.90 | 46.36 | 47.48 | 47.25 | 47.34 | 47.59 |
MI-PMM | 46.92 | 46.39 | 42.78 | 42.81 | NA | NA |
MI-Norm | 46.93 | 46.42 | 3447.34 | 4017.37 | NA | NA |
MI-Mean | 40.10 | 40.56 | 40.37 | 40.15 | NA | NA |
LASSO-DB | 48.37 | 47.53 | 50.26 | 49.99 | 50.35 | 50.23 |
LASSO-PS | 38.43 | 38.57 | 37.94 | 37.67 | 37.35 | 37.22 |
LASSO-DR | 48.20 | 47.43 | 49.22 | 48.97 | 49.14 | 49.07 |
SCAD-DB | 46.87 | 46.34 | 47.95 | 47.65 | 47.89 | 47.95 |
SCAD-PS | 51.53 | 50.70 | 45.75 | 45.62 | 42.90 | 43.03 |
SCAD-DR | 49.61 | 48.75 | 49.44 | 49.20 | 49.03 | 49.10 |
RR | 46.26 | 45.66 | 41.25 | 41.26 | 40.76 | 40.72 |
RT | 50.34 | 49.51 | 50.86 | 50.56 | 50.50 | 50.36 |
RF | 48.25 | 47.28 | 41.92 | 41.88 | 40.99 | 40.84 |
XGB | 8.24 | 8.70 | 8.23 | 8.33 | 8.42 | 8.40 |
DL-1r | 8.82 | 8.94 | 9.67 | 9.38 | 9.75 | 9.89 |
DL-2r | 6.25 | 5.99 | 9.08 | 8.65 | 9.40 | 9.07 |
DL-1s | 9.48 | 10.14 | 10.48 | 11.09 | 6.49 | 6.60 |
DL-2s | 11.49 | 12.25 | 11.64 | 12.13 | 7.51 | 7.53 |
Table 11.
Performance summary of different approaches in real data analysis.
Bias | Relative bias | RMSE | Relative RMSE | |
---|---|---|---|---|
LM | −0.191 | −0.72% | 5.17 | 19.43% |
LASSO_DB | −0.260 | −0.98% | 3.24 | 12.18% |
LASSO_PS | −0.487 | −1.83% | NA | NA |
LASSO_DR | −0.272 | −1.02% | NA | NA |
SCAD_DB | −0.395 | −1.48% | 3.30 | 12.39% |
SCAD_PS | −0.919 | −3.45% | NA | NA |
SCAD_DR | −0.358 | −1.34% | NA | NA |
Ridge | −0.275 | −1.03% | 3.23 | 12.12% |
RT | −0.442 | −1.66% | 4.17 | 15.65% |
RF | −0.255 | −0.96% | 3.15 | 11.82% |
XGB | −0.130 | −0.49% | 3.60 | 13.53% |
DL | 0.072 | 0.27% | 3.21 | 12.06% |
8. Discussion
In this paper, we presented imputation, propensity score, and doubly robust approaches based on several machine learning approaches for handling high dimensional data with missing values. Our proposed approaches are very general and can be used for estimating general parameters of interest including population means and quantiles. We compared several missing data approaches for handling high-dimensional data by using both simulation study and real application. XGboosting and DL approaches outperform other methods with high-dimensional non-linear model structure. The key consideration of using DL and other machine learning methods is to avoid overfitting. Statistical inference with machine learning-based approaches is a very challenging research problem. We will pursue the research in our future research.
Acknowledgments
Dr Sixia Chen is partly supported by the National Institute on Minority Health and Health Disparities at National Institutes of Health (1R21MD014658-01A1) and the Oklahoma Shared Clinical and Translational Resources (U54GM104938) with an Institutional Development Award (IDeA) from National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Part of the computing for this project was performed at the OU Supercomputing Center for Education & Research (OSCER) at the University of Oklahoma (OU).
Funding Statement
Dr Sixia Chen is partly supported by the National Institute on Minority Health and Health Disparities at National Institutes of Health (1R21MD014658-01A1) and the Oklahoma Shared Clinical and Translational Resources (U54GM104938) with an Institutional Development Award (IDeA) from National Institute of General Medical Sciences.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Andridge R.R. and Little R.J., A review of hot deck imputation for survey non-response, Int. Stat. Rev. 78 (2010), pp. 40–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bang H. and Robins J.M., Doubly robust estimation in missing data and causal inference models, Biometrics 61 (2005), pp. 962–973. [DOI] [PubMed] [Google Scholar]
- 3.Beaulieu-Jones B.K. and Moore J.H., Missing data imputation in the electronic health record using deeply learned autoencoders, Pacific Symposium on Biocomputing 2017, Big Island, Hawaii, World Scientific, 2017, pp. 207–218. [DOI] [PMC free article] [PubMed]
- 4.Boistard H., Chauvet G., and Haziza D., Doubly robust inference for the distribution function in the presence of missing survey data, Scand. J. Stat. 43 (2016), pp. 683–699. [Google Scholar]
- 5.Bottou L., Large-scale machine learning with stochastic gradient descent, Proceedings of COMPSTAT'2010, Paris, France, Springer, 2010, pp. 177–186.
- 6.Breiman L., Random forests. Mach. Learn. 45 (2001), pp. 5–32. [Google Scholar]
- 7.Burgette L.F. and Reiter J.P., Multiple imputation for missing data via sequential regression trees, Am. J. Epidemiol. 172 (2010), pp. 1070–1076. [DOI] [PubMed] [Google Scholar]
- 8.Chen J. and Shao J., Nearest neighbor imputation for survey data, J. Off. Stat. 16 (2000), pp. 113–131. [Google Scholar]
- 9.Chen J. and Shao J., Jackknife variance estimation for nearest-neighbor imputation, J. Am. Stat. Assoc. 96 (2001), pp. 260–269. [Google Scholar]
- 10.Chen S. and Haziza D., Multiply robust imputation procedures for the treatment of item nonresponse in surveys, Biometrika 104 (2017), pp. 439–453. [Google Scholar]
- 11.Chen S. and Haziza D., Multiply robust nonparametric multiple imputation for the treatment of missing data, Stat. Sin. 29 (2019), pp. 2035–2053. [Google Scholar]
- 12.Chen S. and Haziza D., Recent developments in dealing with item non-response in surveys: a critical review, Int. Stat. Rev. 87 (2019), pp. S192–S218. [Google Scholar]
- 13.Chen S., Haziza D., Léger C., and Mashreghi Z., Pseudo-population bootstrap methods for imputed survey data, Biometrika 106 (2019), pp. 369–384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Chen S. and Kim J.K., Semiparametric fractional imputation using empirical likelihood in survey sampling, Stat. Theor. Relat. Fields 1 (2017), pp. 69–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Chen T. and Guestrin C., Xgboost: a scalable tree boosting system, Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, USA, 2016, pp. 785–794.
- 16.Chen T., He T., Benesty M., Khotilovich V., and Tang Y., Xgboost: Extreme Gradient Boosting, R Package Version 0.4-2, 2015, pp. 1–4.
- 17.Cheng C.-Y., Tseng W.-L., Chang C.-F., Chang C.-H., and Gau S.S.-F., A deep learning approach for missing data imputation of rating scales assessing attention-deficit hyperactivity disorder, Front. Psychiatry. 11 (2020), p. 199. 10.3389/fpsyt.2020.00673 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chung N.C. and Storey J.D., Statistical significance of variables driving systematic variation in high-dimensional data, Bioinformatics 31 (2015), pp. 545–554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Falk M., A simple approach to the generation of uniformly distributed random variables with prescribed correlations, Commun. Stat. Simul. Comput. 28 (1999), pp. 785–791. [Google Scholar]
- 20.Fan J. and Li R., Variable selection via nonconcave penalized likelihood and its oracle properties, J. Am. Stat. Assoc. 96 (2001), pp. 1348–1360. [Google Scholar]
- 21.Fern X.Z. and Brodley C.E., Random projection for high dimensional data clustering: a cluster ensemble approach, Proceedings of the 20th international conference on machine learning (ICML-03), Washington DC, 2003, pp. 186–193.
- 22.Friedman J.H., Greedy function approximation: a gradient boosting machine, Ann. Stat. 29 (2001), pp. 1189–1232. [Google Scholar]
- 23.Gandomi A. and Haider M., Beyond the hype: big data concepts, methods, and analytics, Int. J. Inf. Manage. 35 (2015), pp. 137–144. [Google Scholar]
- 24.Goodfellow I., Bengio Y., Courville A., and Bengio Y., Deep Learning, MIT Press, Cambridge, 2016. [Google Scholar]
- 25.Han P. and Wang L., Estimation with missing data: beyond double robustness, Biometrika 100 (2013), pp. 417–430. [Google Scholar]
- 26.Hastie T., Tibshirani R., Friedman J.H., and Friedman J.H., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, New York, 2009. [Google Scholar]
- 27.Haziza D. and Lesage É., A discussion of weighting procedures for unit nonresponse, J. Off. Stat. 32 (2016), pp. 129–145. [Google Scholar]
- 28.Heitjan D.F. and Little R.J., Multiple imputation for the fatal accident reporting system, Appl. Stat. 40 (1991), pp. 13–29. [Google Scholar]
- 29.Jones A.M., Koolman X., and Rice N., Health-related non-response in the British household panel survey and European community household panel: using inverse-probability-weighted estimators in non-linear models, J. R. Stat. Soc.: Ser. A (Stat. Soc.) 169 (2006), pp. 543–569. [Google Scholar]
- 30.Kang J.D. and Schafer J.L., Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data, Stat. Sci. 22 (2007), pp. 523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kim J.K., Parametric fractional imputation for missing data analysis, Biometrika 98 (2011), pp. 119–132. [Google Scholar]
- 32.Kim J.K. and Fuller W., Fractional hot deck imputation, Biometrika 91 (2004), pp. 559–578. [Google Scholar]
- 33.Kim J.K. and Riddles M.K., Some theory for propensity-score-adjustment estimators in survey sampling, Surv. Methodol. 38 (2012), p. 157. [Google Scholar]
- 34.Kim J.K. and Shao J., Statistical Methods for Handling Incomplete Data, CRC Press, Chapman and Hall, 2013. [Google Scholar]
- 35.Kriegel H.-P., Kröger P., and Zimek A., Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Trans. Knowl. Discov. Data 3 (2009), pp. 1–58. [Google Scholar]
- 36.LeCun Y., Bengio Y., and Hinton G., Deep learning, Nature 521 (2015), pp. 436–444. [DOI] [PubMed] [Google Scholar]
- 37.Li X. and Xu R., High-Dimensional Data Analysis in Cancer Research, Springer Science & Business Media, 2008. [Google Scholar]
- 38.Lin G., Shen C., Shi Q., Van den Hengel A., and Suter D., Fast supervised hashing with decision trees for high-dimensional data, Proceedings of the IEEE conference on computer vision and pattern recognition, Columbus, OH, 2014, pp. 1963–1970.
- 39.Linero A.R., Bayesian regression trees for high-dimensional prediction and variable selection, J. Am. Stat. Assoc. 113 (2018), pp. 626–636. [Google Scholar]
- 40.Little R.J., Missing-data adjustments in large surveys, J. Bus. Econ. Stat. 6 (1988), pp. 287–296. [Google Scholar]
- 41.Little R.J. and Rubin D.B., Statistical Analysis with Missing Data, Wiley, New York, 1987. [Google Scholar]
- 42.Ma X., Sha J., Wang D., Yu Y., Yang Q., and Niu X., Study on a prediction of p2p network loan default based on the machine learning lightgbm and xgboost algorithms according to different high dimensional data cleaning, Electron. Commer. Res. Appl. 31 (2018), pp. 24–39. [Google Scholar]
- 43.Mishina Y., Murata R., Yamauchi Y., Yamashita T., and Fujiyoshi H., Boosted random forest, IEICE Trans. Inf. Syst. 98 (2015), pp. 1630–1636. [Google Scholar]
- 44.Moon H., Ahn H., Kodell R.L., Baek S., Lin C.-J., and Chen J.J., Ensemble methods for classification of patients for personalized medicine with high-dimensional data, Artif. Intell. Med. 41 (2007), pp. 197–207. [DOI] [PubMed] [Google Scholar]
- 45.Musoro J.Z., Zwinderman A.H., Puhan M.A., and Geskus R.B., Validation of prediction models based on LASSO regression with multiply imputed data, BMC Med. Res. Methodol. 14 (2014), p. 116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Qi Y., Random Forest for Bioinformatics, Ensemble Machine Learning, Springer, New York, 2012, pp. 307–323.
- 47.Qiu Y.L., Zheng H., and Gevaert O., A deep learning framework for imputing missing values in genomic data, bioRxiv, 2018. Available at https://www.biorxiv.org/content/early/2018/09/03/406066
- 48.Rao J.N.K. and Shao J., Jackknife variance estimation with survey data under hot deck imputation, Biometrika 79 (1992), pp. 811–822. [Google Scholar]
- 49.Ribeiro M.H.D.M., Ensemble approach based on bagging, boosting and stacking for short-term prediction in agribusiness time series, Appl. Soft Comput. 86 (2020), p. 105837. [Google Scholar]
- 50.Robins J.M., Rotnitzky A., and Zhao L.P., Estimation of regression coefficients when some regressors are not always observed, J. Am. Stat. Assoc. 89 (1994), pp. 846–866. [Google Scholar]
- 51.Rubin D.B., Statistical matching using file concatenation with adjusted weights and multiple imputations, J. Bus. Econ. Stat. 4 (1986), pp. 87–94. [Google Scholar]
- 52.Rubin D.B., Multiple imputation after 18+ years, J. Am. Stat. Assoc. 91 (1996), pp. 473–489. [Google Scholar]
- 53.Rubin D.B., Multiple Imputation for Nonresponse in Surveys, Vol. 81, John Wiley & Sons, 2004. [Google Scholar]
- 54.Rubin D.B. and Schenker N., Multiple imputation for interval estimation from simple random samples with ignorable nonresponse, J. Am. Stat. Assoc. 81 (1986), pp. 366–374. [Google Scholar]
- 55.Saunders C., Gammerman A., and Vovk V., Ridge regression learning algorithm in dual variables, Proceedings of the 15th International Conference on Machine Learning, New York, ICML, 1998. [Google Scholar]
- 56.Schwarz D.F., König I.R., and Ziegler A., On safari to random jungle: a fast implementation of random forests for high-dimensional data, Bioinformatics 26 (2010), pp. 1752–1758. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Shabalin A.A., Weigman V.J., Perou C.M., and Nobel A.B., Finding large average submatrices in high dimensional data, Ann. Appl. Stat. 3 (2009), pp. 985–1012. [Google Scholar]
- 58.Shah A.D., Bartlett J.W., Carpenter J., Nicholas O., and Hemingway H., Comparison of random forest and parametric imputation models for imputing missing data using mice: a caliber study, Am. J. Epidemiol. 179 (2014), pp. 764–774. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Tibshirani R., Regression shrinkage and selection via the LASSO, J. R. Stat. Soc.: Ser. B (Methodol.) 58 (1996), pp. 267–288. [Google Scholar]
- 60.Van der Vaart A.W., Asymptotic Statistics, Vol. 3, Cambridge University Press, 2000. [Google Scholar]
- 61.Wright M.N. and Ziegler A., ranger: a fast implementation of random forests for high dimensional data in c++ and r, preprint, 2015. Available at arXiv:1508.04409
- 62.Yang S. and Kim J.K., Fractional imputation in survey sampling: a comparative review, Stat. Sci. 31 (2016), pp. 415–432. [Google Scholar]
- 63.Yang S. and Kim J.K., Nearest Neighbor Imputation for General Parameter Estimation in Survey Sampling, The Econometrics of Complex Survey Data, Emerald Publishing Limited, 2019.
- 64.Yang S. and Kim J.K., Asymptotic theory and inference of predictive mean matching imputation using a superpopulation model framework, Scand. J. Stat. 47 (2020), pp. 839–861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Zahid F.M. and Heumann C., Multiple imputation with sequential penalized regression, Stat. Methods Med. Res. 28 (2019), pp. 1311–1327. [DOI] [PubMed] [Google Scholar]
- 66.Zhao Y. and Long Q., Multiple imputation in the presence of high-dimensional data, Stat. Methods Med. Res. 25 (2016), pp. 2021–2035. [DOI] [PubMed] [Google Scholar]
- 67.Zhu R. and Kosorok M.R., Recursively imputed survival trees, J. Am. Stat. Assoc. 107 (2012), pp. 331–340. [DOI] [PMC free article] [PubMed] [Google Scholar]