TABLE 1.
Author | Disease | Data imbalance | Feature construction | Feature selection | ML classification |
---|---|---|---|---|---|
Baumgartner et al. 13 | PKU | Random sampling | Information gain | DT, LRA | |
Baumgartner et al. 14 | MCADD, PKU | Random sampling | Information gain, relief‐based | LDA, DT, KNN, LRA, NN, SVM | |
Baumgartner et al. 15 | 3‐MCCD*, MCADD, PKU | Random sampling | Diagnostic flag | DT, LRA | |
Baumgartner et al. 16 | 3‐MCCD*, PKU, GA1, MMA, PA, MCADD, LCHADD | Random sampling | Discriminatory threshold | KNN, LRA, Naive Bayes, NN, SVM | |
Ho et al. 12 | MCADD | Informed sampling | Arithmetic ratio | χ 2 | Rule learner |
Hsieh et al. 17 | MMA | Pearson coefficient | SVM | ||
Hsieh et al. 18 | MMA | Random sampling | Pearson coefficient | SVM | |
Van den Bulcke et al. 19 | MCADD | Oversampling | Arithmetic ratio | Variable set optimization | DT, LRA, Ridge‐LRA |
Chen et al. 20 | PKU | Fisher score | SVM | ||
Chen et al. 21 | 3‐MCCD*, PKU, MET | Arithmetic ratio | Fisher score, Variable set optimization | SVM | |
Lin et al. 7 | CIT1, CIT2, CPT1D, GA1, IBDD, IVA, MADD, MET, MMA, MSUD, PA, PKU, PTPSD, SCADD*, VLCADD | Random sampling, oversampling, informed sampling | χ 2, ANOVA, mutual information, L1‐norm, tree‐based | Bagging, Boosting, DT, KNN, LDA, LRA, RF, SVM | |
Peng et al. 22 | MMA | Oversampling | RF | ||
Wang et al. 23 | SCADD*, MCADD, VLCADD | Arithmetic ratio | Discriminatory threshold | LRA | |
Zarin Mousavi et al. 24 | CH | χ 2, information gain, expert consultation | Bagging, Boosting, DT, NN, SVM | ||
Peng et al. 25 | GA1, MMA, OTCD, VLCADD | Second tier | RF | ||
Zhu et al. 26 | PKU | Arithmetic ratio | Pearson coefficient, LVQ | LRA | |
Lasarev et al. 27 | CAH | Informed sampling | PCA | DT |
Note: Diseases with * are biochemical variations nowadays known as nondiseases.
Abbreviations: CAH, congenital adrenal hyperplasia; CH, congenital hypothyroidism; CIT1, citrullinemia type I; CIT2, citrullinemia type II; CPT1D, carnitine palmitoyltransferase I deficiency; DT, decision tree; GA1, glutaric aciduria type I; IBDD, isobutyryl‐CoA dehydrogenase deficiency; IVA, isovaleric aciduria; KNN, K‐nearest neighbors; LCHADD, long‐chain hydroxyacyl‐CoA deficiency; LDA, linear discriminant analysis; LRA, logistic regression analysis; LVQ, learned vector quantization; MADD, multiple acyl‐CoA dehydrogenase deficiency; MCADD, medium‐chain acyl‐CoA dehydrogenase deficiency; 3‐MCCD, 3‐methylcrotonyl‐CoA carboxylase deficiency; MET, hypermethioninemia; MMA, methylmalonic aciduria; MSUD, maple syrup urine disease; NN, neural network; OTCD, ornithine transcarbamylase deficiency; PA, propionic aciduria; PCA, principal component analysis; PKU, phenylketonuria; PTPSD, 6‐pyruvoyl‐tetrahydrobiopterin synthetase deficiency; RF, random forest; Ridge‐LRA, logistic ridge regression; SCADD, short‐chain acyl‐CoA dehydrogenase deficiency; SVM, support vector machine; VLCADD, very long‐chain acyl‐CoA dehydrogenase deficiency.