Skip to main content
. 2022 Apr 20;10(4):e33875. doi: 10.2196/33875

Table 3.

Data processing and machine learning modeling.

Study Preprocessing data Model Dominant model Evaluation metrics Analysis software and package Findings

Missing data management Class imbalance




Weber et al, 2018 [15] MICEa b Super learning approach using logistic regression, random forest, K-nearest neighbors, LRc (LASSOd, ridge, and an elastic net) No difference between models Sensitivity, specificity, PVPe, PVNf, and AUCg Rstudio (version 3.3.2), SuperLearner package AUC=0.67, sensitivity=0.61, specificity=0.64
Rawashdeh et al, 2020 [16] Instances with missing values were removed manually SMOTEh Locally weighted learning, Gaussian process, K-star classifier, linear regression, K-nearest neighbor, decision tree, random forest, neural network Random forest Accuracy, sensitivity, specificity, AUC, and G-means WEKAi (version 3.9) Random forest: G-mean=0.96, sensitivity=1.00, specificity=0.94, accuracy=0.95, AUC=0.98 (oversampling ratio of 200%)
Gao et al, 2019 [17] Control group were undersampled RNNsj, long short-term memory network, logistic regression, SVMk, Gradient boosting RNN ensembled models on balanced data Sensitivity, specificity, PVP, and AUC AUC=0.827, sensitivity=0.965, specificity=0.698, PVP=0.033
Lee and Ahn, 2019 [18] ANNl, logistic regression, decision tree, naïve Bayes, random forest, SVM No difference between models Accuracy Python (version 3.52) No difference in accuracy between ANN (0.9115) with logistic regression and the random forest (0.9180 and 0.8918, respectively)
Woolery and Grzymala-Busse, 1994 [19] LERSm Accuracy ID3n, LERS CONCLUS Database 1: accuracy=88.8% accurate for both low-risk and high-risk pregnancy. Database 2: accuracy=59.2% in high-risk pregnant women. Database 3: accuracy=53.4%
Grzymala-Busse and Woolery,1994 [20] LERS based on the bucket brigade algorithm of genetic algorithms and enhanced by partial matching Accuracy LERS Accuracy=68% to 90%
Vovsha et al, 2014 [21] Oversampling techniques (Adasyn) SVMs with linear and nonlinear kernels, LR (forward selection, stepwise selection, L1 LASSO regression, and elastic net regression) Sensitivity, specificity, and G-means Rstudio, glmnet package SVM: sensitivity (0.404 to 0.594), specificity (0.621 to 0.84), G-mean (0.575 to 0.652); LR: sensitivity (0.502 to 0.591), specificity (0.587 to 0.731), G-mean (0.586 to 0.604)
Esty et al, 2018 [22] Imputation with the missForest package in R Not clear Hybrid C5.0 decision tree−ANN classifier Sensitivity, specificity, and ROCo R software, missForest Package, FANNp library Sensitivity: 84.1% to 93.4%, specificity: 70.6% to 76.9%, AUC: 78.5% to 89.4%
Frize et al, 2011 [23] Decision tree Hybrid decision tree–ANN Sensitivity, specificity, ROC for Pq and NPr cases See5, MATLAB Neural Ware tool Training (P: sensitivity=66%, specificity=83%, AUC=0.81; NP: sensitivity=62.8%, specificity=71.7%, AUC=0.72), test (P: sensitivity=66.3%, specificity=83.9%, AUC=0.80; NP: sensitivity=65%, specificity=71.3%, AUC=0.73), and verification (P sensitivity=61.4%, specificity=83.3%, AUC=0.79; NP: sensitivity=65.5%, specificity=71.1%, AUC=0.73)
Goodwin and Maher, 2000 [24] PVRuleMinerl or FactMiner Neural networks, LR, CARTs, and software programs called PVRuleMiner and FactMiner No difference between models ROC Custom data mining software (Clinical Miner and PVRuleMiner, FactMiner) No significant difference between techniques. Neural network (AUC=0.68), stepwise LR (AUC=0.66), CART (AUC=0.65), FactMiner (demographic features only; AUC=0.725), FactMiner (demographic plus other indicator features; AUC=0.757)
Tran et al, 2016 [3] Undersampling of the majority class SSLRt, RGBu Sensitivity, specificity, NPVv, PVP, F-measure, and AUC SSLR: sensitivity=0.698 to 0.734, specificity=0.643 to 0.732, F-measure=0.70 0.73, AUC=0.764 to 0.791, NPV=0.96 to 0.719, PVP=0.679, 0.731; RGB: sensitivity=0.621 to 0.720, specificity=0.74 to 0.841, F-measures=0.693 to 0.732, NPV=0.675 to 0.717, PVP=0.783 to 0.743, AUC=0.782 to 0.807
Koivu and Sairanen, 2020 [9] LR, ANN, LGBMw, deep neural network, SELUx network, average ensemble, and weighted average WAy ensemble AUC Rstudio (version 3.5.1) and Python (version 3.6.9) AUC for classifiers: LR=0.62 to 0.64; deep neural network: 0.63 to 0.66; SELU network: 0.64 to 0.67; LGBM: 0.64 to 0.67; average ensemble: 0.63 to 0.67; WA ensemble: 0.63 to 0.67
Khatibi et al, 2019 [25] Map phase module Decision trees, SVMs and random forests, ensemble classifiers Accuracy and AUC Accuracy=81% and AUC=68%

aMICE: Multiple Imputation by Chained Equations.

bNot reported in the study.

cLR: linear regression.

dLASSO: least absolute shrinkage and selection operator.

ePVP: predictive value positive.

fPVN: predictive value negative.

gAUC: area under the ROC curve.

hSMOTE: Synthetic Minority Oversampling Technique.

iWEKA: Waikato Environment for Knowledge Analysis.

jRNN: recurrent neural network.

kSVM: support vector machine.

lANN: artificial neural network.

mLERS: learning from examples of rough sets.

nID3: iterative dichotomiser 3.

oROC: receiver operating characteristic.

pFANN: Fast Artificial Neural Network.

qP: parous.

rNP: nulliparous.

sCART: classification and regression tree.

tSSLR: stabilized sparse logistic regression.

uRGB: Randomized Gradient Boosting.

vNPV: net present value.

wLGBM: Light Gradient Boosting Machine.

xSELU: scaled exponential linear unit.

yWA: weighted average.