Table 3.
Data processing and machine learning modeling.
| Study | Preprocessing data | Model | Dominant model | Evaluation metrics | Analysis software and package | Findings | ||||||
|
|
Missing data management | Class imbalance |
|
|
|
|
|
|||||
| Weber et al, 2018 [15] | MICEa | —b | Super learning approach using logistic regression, random forest, K-nearest neighbors, LRc (LASSOd, ridge, and an elastic net) | No difference between models | Sensitivity, specificity, PVPe, PVNf, and AUCg | Rstudio (version 3.3.2), SuperLearner package | AUC=0.67, sensitivity=0.61, specificity=0.64 | |||||
| Rawashdeh et al, 2020 [16] | Instances with missing values were removed manually | SMOTEh | Locally weighted learning, Gaussian process, K-star classifier, linear regression, K-nearest neighbor, decision tree, random forest, neural network | Random forest | Accuracy, sensitivity, specificity, AUC, and G-means | WEKAi (version 3.9) | Random forest: G-mean=0.96, sensitivity=1.00, specificity=0.94, accuracy=0.95, AUC=0.98 (oversampling ratio of 200%) | |||||
| Gao et al, 2019 [17] | — | Control group were undersampled | RNNsj, long short-term memory network, logistic regression, SVMk, Gradient boosting | RNN ensembled models on balanced data | Sensitivity, specificity, PVP, and AUC | — | AUC=0.827, sensitivity=0.965, specificity=0.698, PVP=0.033 | |||||
| Lee and Ahn, 2019 [18] | — | — | ANNl, logistic regression, decision tree, naïve Bayes, random forest, SVM | No difference between models | Accuracy | Python (version 3.52) | No difference in accuracy between ANN (0.9115) with logistic regression and the random forest (0.9180 and 0.8918, respectively) | |||||
| Woolery and Grzymala-Busse, 1994 [19] | — | — | LERSm | — | Accuracy | ID3n, LERS CONCLUS | Database 1: accuracy=88.8% accurate for both low-risk and high-risk pregnancy. Database 2: accuracy=59.2% in high-risk pregnant women. Database 3: accuracy=53.4% | |||||
| Grzymala-Busse and Woolery,1994 [20] | — | — | LERS based on the bucket brigade algorithm of genetic algorithms and enhanced by partial matching | — | Accuracy | LERS | Accuracy=68% to 90% | |||||
| Vovsha et al, 2014 [21] | — | Oversampling techniques (Adasyn) | SVMs with linear and nonlinear kernels, LR (forward selection, stepwise selection, L1 LASSO regression, and elastic net regression) | — | Sensitivity, specificity, and G-means | Rstudio, glmnet package | SVM: sensitivity (0.404 to 0.594), specificity (0.621 to 0.84), G-mean (0.575 to 0.652); LR: sensitivity (0.502 to 0.591), specificity (0.587 to 0.731), G-mean (0.586 to 0.604) | |||||
| Esty et al, 2018 [22] | Imputation with the missForest package in R | Not clear | Hybrid C5.0 decision tree−ANN classifier | — | Sensitivity, specificity, and ROCo | R software, missForest Package, FANNp library | Sensitivity: 84.1% to 93.4%, specificity: 70.6% to 76.9%, AUC: 78.5% to 89.4% | |||||
| Frize et al, 2011 [23] | Decision tree | — | Hybrid decision tree–ANN | — | Sensitivity, specificity, ROC for Pq and NPr cases | See5, MATLAB Neural Ware tool | Training (P: sensitivity=66%, specificity=83%, AUC=0.81; NP: sensitivity=62.8%, specificity=71.7%, AUC=0.72), test (P: sensitivity=66.3%, specificity=83.9%, AUC=0.80; NP: sensitivity=65%, specificity=71.3%, AUC=0.73), and verification (P sensitivity=61.4%, specificity=83.3%, AUC=0.79; NP: sensitivity=65.5%, specificity=71.1%, AUC=0.73) | |||||
| Goodwin and Maher, 2000 [24] | PVRuleMinerl or FactMiner | — | Neural networks, LR, CARTs, and software programs called PVRuleMiner and FactMiner | No difference between models | ROC | Custom data mining software (Clinical Miner and PVRuleMiner, FactMiner) | No significant difference between techniques. Neural network (AUC=0.68), stepwise LR (AUC=0.66), CART (AUC=0.65), FactMiner (demographic features only; AUC=0.725), FactMiner (demographic plus other indicator features; AUC=0.757) | |||||
| Tran et al, 2016 [3] | — | Undersampling of the majority class | SSLRt, RGBu | — | Sensitivity, specificity, NPVv, PVP, F-measure, and AUC | — | SSLR: sensitivity=0.698 to 0.734, specificity=0.643 to 0.732, F-measure=0.70 0.73, AUC=0.764 to 0.791, NPV=0.96 to 0.719, PVP=0.679, 0.731; RGB: sensitivity=0.621 to 0.720, specificity=0.74 to 0.841, F-measures=0.693 to 0.732, NPV=0.675 to 0.717, PVP=0.783 to 0.743, AUC=0.782 to 0.807 | |||||
| Koivu and Sairanen, 2020 [9] | — | — | LR, ANN, LGBMw, deep neural network, SELUx network, average ensemble, and weighted average WAy ensemble | — | AUC | Rstudio (version 3.5.1) and Python (version 3.6.9) | AUC for classifiers: LR=0.62 to 0.64; deep neural network: 0.63 to 0.66; SELU network: 0.64 to 0.67; LGBM: 0.64 to 0.67; average ensemble: 0.63 to 0.67; WA ensemble: 0.63 to 0.67 | |||||
| Khatibi et al, 2019 [25] | Map phase module | — | Decision trees, SVMs and random forests, ensemble classifiers | — | Accuracy and AUC | — | Accuracy=81% and AUC=68% | |||||
aMICE: Multiple Imputation by Chained Equations.
bNot reported in the study.
cLR: linear regression.
dLASSO: least absolute shrinkage and selection operator.
ePVP: predictive value positive.
fPVN: predictive value negative.
gAUC: area under the ROC curve.
hSMOTE: Synthetic Minority Oversampling Technique.
iWEKA: Waikato Environment for Knowledge Analysis.
jRNN: recurrent neural network.
kSVM: support vector machine.
lANN: artificial neural network.
mLERS: learning from examples of rough sets.
nID3: iterative dichotomiser 3.
oROC: receiver operating characteristic.
pFANN: Fast Artificial Neural Network.
qP: parous.
rNP: nulliparous.
sCART: classification and regression tree.
tSSLR: stabilized sparse logistic regression.
uRGB: Randomized Gradient Boosting.
vNPV: net present value.
wLGBM: Light Gradient Boosting Machine.
xSELU: scaled exponential linear unit.
yWA: weighted average.