Skip to main content
. 2022 Mar 23;63(3):250–261. doi: 10.1002/jmd2.12285

TABLE 1.

Summary of all reviewed studies on applied data imbalance, feature construction, feature selection and ML classification methods

Author Disease Data imbalance Feature construction Feature selection ML classification
Baumgartner et al. 13 PKU Random sampling Information gain DT, LRA
Baumgartner et al. 14 MCADD, PKU Random sampling Information gain, relief‐based LDA, DT, KNN, LRA, NN, SVM
Baumgartner et al. 15 3‐MCCD*, MCADD, PKU Random sampling Diagnostic flag DT, LRA
Baumgartner et al. 16 3‐MCCD*, PKU, GA1, MMA, PA, MCADD, LCHADD Random sampling Discriminatory threshold KNN, LRA, Naive Bayes, NN, SVM
Ho et al. 12 MCADD Informed sampling Arithmetic ratio χ 2 Rule learner
Hsieh et al. 17 MMA Pearson coefficient SVM
Hsieh et al. 18 MMA Random sampling Pearson coefficient SVM
Van den Bulcke et al. 19 MCADD Oversampling Arithmetic ratio Variable set optimization DT, LRA, Ridge‐LRA
Chen et al. 20 PKU Fisher score SVM
Chen et al. 21 3‐MCCD*, PKU, MET Arithmetic ratio Fisher score, Variable set optimization SVM
Lin et al. 7 CIT1, CIT2, CPT1D, GA1, IBDD, IVA, MADD, MET, MMA, MSUD, PA, PKU, PTPSD, SCADD*, VLCADD Random sampling, oversampling, informed sampling χ 2, ANOVA, mutual information, L1‐norm, tree‐based Bagging, Boosting, DT, KNN, LDA, LRA, RF, SVM
Peng et al. 22 MMA Oversampling RF
Wang et al. 23 SCADD*, MCADD, VLCADD Arithmetic ratio Discriminatory threshold LRA
Zarin Mousavi et al. 24 CH χ 2, information gain, expert consultation Bagging, Boosting, DT, NN, SVM
Peng et al. 25 GA1, MMA, OTCD, VLCADD Second tier RF
Zhu et al. 26 PKU Arithmetic ratio Pearson coefficient, LVQ LRA
Lasarev et al. 27 CAH Informed sampling PCA DT

Note: Diseases with * are biochemical variations nowadays known as nondiseases.

Abbreviations: CAH, congenital adrenal hyperplasia; CH, congenital hypothyroidism; CIT1, citrullinemia type I; CIT2, citrullinemia type II; CPT1D, carnitine palmitoyltransferase I deficiency; DT, decision tree; GA1, glutaric aciduria type I; IBDD, isobutyryl‐CoA dehydrogenase deficiency; IVA, isovaleric aciduria; KNN, K‐nearest neighbors; LCHADD, long‐chain hydroxyacyl‐CoA deficiency; LDA, linear discriminant analysis; LRA, logistic regression analysis; LVQ, learned vector quantization; MADD, multiple acyl‐CoA dehydrogenase deficiency; MCADD, medium‐chain acyl‐CoA dehydrogenase deficiency; 3‐MCCD, 3‐methylcrotonyl‐CoA carboxylase deficiency; MET, hypermethioninemia; MMA, methylmalonic aciduria; MSUD, maple syrup urine disease; NN, neural network; OTCD, ornithine transcarbamylase deficiency; PA, propionic aciduria; PCA, principal component analysis; PKU, phenylketonuria; PTPSD, 6‐pyruvoyl‐tetrahydrobiopterin synthetase deficiency; RF, random forest; Ridge‐LRA, logistic ridge regression; SCADD, short‐chain acyl‐CoA dehydrogenase deficiency; SVM, support vector machine; VLCADD, very long‐chain acyl‐CoA dehydrogenase deficiency.