Table 1.
# | Year | PMIDs | Diseases | Study objectives | Machine learning and statistical algorithms | Statistical tools, packages and environments | Raw data | Processed data | Secondary, statistical, and downstream data analysis | Subjects | Single/multi-mics | Databases | Ethics |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2017 | PMID:28795970 | IBD | Classification and prioritization of genes to detect new genes connected to IBD | RF, SVM, Extreme gradient boosting (xgbTree) and Elastic net regularized generalized linear model (glmnet) | Benjamini–Hochberg, Bonferroni correction, one-sided Mann–Whitney U test, Fisher’s exact test | RNA-seq | Gene expression data | Gene Ontology (GO) Enrichment analysis (terms and pathway) | 513 (180 CD, 149 UC, 94 colorectal neoplasms, 90 control) | Single omics (transcriptomic) | GEO, GWAS, ClinVar database, MSigDB, KEGG, Pathway interaction database | N/A |
2 | 2019 | PMID: 31270349 | SLE | (1) Stratification of the subject as active and inactive SLE state with the help of raw data and test its potential to stratify, (2) identification of the best classifier/classifiers and (3) identification of the combinations of variables that make classification possible at best | GLM, k-nearest neighbors (K-NN), and RF | False discovery rate, Adjusted Rand Index, Rank–rank hypergeometric overlap | RNA-seq | Gene expression data | DEGS analysis- > Enrichment analysis(WGCNA), GO analysis and GSVA | N/A | Single omics (transcriptomic) | GEO | N/A |
3 | 2017 | PMID: 28269885 | CD | To estimate the performance of predictivity of three different techniques as classifiers to identity extra-intestinal manifestation in CD and compare with the existing method | NB, BART and Bayesian networks (BN) implemented using a Greedy Thick Thinning algorithm- learning, EM algorithm- learning conditional probabilities | Bayesian methods: NB, BART and Bayesian Networks, Greedy Thick Thinning algorithm; EM | WGS | Variant data (SNP) | Statistical analysis | 152 (75 Extra-intestinal manifestation, 77 control) | Single omics (genomic) | Self-generated, dbSNP | N/A |
4 | 2019 | PMID: 31564248 | CD | Disease prediction model by using previously unknown disease genes | AVA,Dx, SVM | VQSR, ANNOVAR | WGS | Variant data (Exonic) | Annotation using ANNOVAR, pathway enrichment analysis (ConsensusPath database) | 2855 (2793 CD, 62 control) | Single omics (genomic) | European Genome-Phenome Archive, Genotype-Tissue Expression Project, PopGen Biobank | Ethically approved |
5 | 2018 | PMID: 30204480 | Obesity | (1) Stratification of individuals into obese and non-obese and evaluation of obesity risk. (2) Comparison of predictive performance of various models | SVM, k-nearest neighbor (K-NN), and DT Feature selecting algorithms-stepwise MLR, DT and genetic algorithms | ANOVA | WGS | Variant data (SNP) | Genome-wide SNP and statistical analysis | 129 (74 obese, 65 control) | Single omics (genomic) | dbSNP | Ethically approved |
6 | 2019 | PMID: 31200905 | Colon cancer | Comparison of various machine learning algorithms for: (1) identification of differential genes of high risk using statistical tests, (2) prediction of cancer genes by using a ML strategy | LDA, QDA, NB, GPC, SVM, ANN, LR, DT, AB and RF | WCSRS, t test, Kruskal–Wallis (KW) and F-test | RNA-seq | Gene expression data | DEG analysis - > Statistical [WCSRS, t test, Kruskal–Wallis (KW) and F-test] | 62 (40 cancer, 22 control) | Single omics (transcriptomic) | Kent ridge biomedical data repository | N/A- No ethics approval is required for this dataset |
7 | 2016 | PMID: 27587275 | Breast cancer | Classification, characterization and prediction of breast cancer using mutation profiles | NMF Clustering, RF, NB, C4.5, SVM and K-NN | Wilcoxon rank-sum, Benjamini–Hochberg (FDR) | WES | Variant data (Somatic and non-synonymous SNV) | DEG analysis- > Enrichment analysis (Gene Set Enrichment Analysis (GSEA)), pathway analysis (Ingenuity Pathway Analysis) | 358 | Single omics (genomic) | TCGA | N/A |
8 | 2021 | PMID: 34332931 | Malignant pleural mesothelioma | Evaluation of risk scores and classification of the patients into low-risk and high-risk groups | OncoCast-MPM machine learning risk prediction model (elastic-net penalized Cox proportional hazard models) | χ2 test in R. lasso-penalized Cox regression. Kaplan–Meier survival curves with log-rank testing Concordance probability estimate | WGS | Variant data | Risk stratification and statistical analysis | 194 | Single omics (genomic) | MSK-IMPACT,TCGA | N/A |
9 | 2018 | PMID: 29298978 | Acute myeloid leukemia (AML) | Identification of molecular gene expression markers for precise treatment of acute myeloid leukemia | MERGE (mutation, expression hubs, known regulators, genomic CNV, and methylation) | Pearson’s, Spearman, ElasticNet | RNA-seq | Gene expression data | Covariate, association and prioritized subset analysis | 30 | Single omics (transcriptomic) | GEO | Ethically approved |
10 | 2020 | PMID: 32318621 | Alzheimer’s disease (AD) | Identification of genomic markers for precise therapy of AD (Phase 2a clinical study) | FCA - unsupervised | Integrated in the Knowledge Extraction and Management (KEM) environment | WES, RNA seq | Variant and Gene expression data | Association rules and linear mixed effect (LME) model analysis | 32 | Multi-omics (genomic & transcriptomic) | Self-generated | N/A |
11 | 2021 | PMID: 34220416 | Major depressive disorder (MDD) | Identification of potential peripheral blood transcriptome biomarkers and development of a MDD prediction model using peripheral blood transcriptomes | SVM, RF, K-Nearest Neighbors (K-NN) and NB | Positive predictive value, Matthews correlation coefficient | RNA-seq | Gene expression data | DEG analysis- > Statistical, KEGG pathway enrichment analysis | 302 (9 different datasets) | Single omics (transcriptomic) | GEO | N/A |
12 | 2020 | PMID: 33099536 | Ulcerative colitis | Identification of susceptibility genes and disease prediction development for ulcerative colitis disease prediction | RF and ANN | Benjamini–Hochberg | RNA-seq | Gene expression data | DEG- > GO enrichment analysis and KEGG enrichment analysis (Pathway) | 2 datasets | Single omics (transcriptomic) | GEO, GO, KEGG | N/A |
13 | 2021 | PMID: 33645908 | Major depressive disorder (MDD) | Stratification of MDD patients and control samples, and understanding pathophysiology | XGBoost (eXtreme Gradient Boosting implementation) | N/A | RNA-seq | Gene expression data | DEGs analysis- > GO enrichment analysis (GSA) | 390 (314 MD, 76 control) | Single omics (transcriptomic) | dbGaP, GEO | Ethically approved |
14 | 2019 | PMID: 29704323 | Schizophrenia (SCZ) | Prediction of high-risk individuals (5090 exomes) | eXtreme Gradient Boosting implementation (XGBoost), L1.logistic regression (lasso regularized), SVM and RF | ANNOVAR, Pearson correlation | WES | SNV(Variant data), DNM (Mutation-small insertions and deletions) | Annotated with ANNOVAR. KEGG enrichment analysis(Pathway) | 5090 (2545 SCZ, 2545 control) | Single omics (genomic) | dbGaP | Ethically approved |
15 | 2020 | PMID: 32111185 | Schizophrenia (SCZ) and Autism spectrum disorder (ASD) | Comparison of the architecture of the genomes of SCZ and ASD. (1) To identify if SCZ and ASD patients can be differentiated just based on supervised learning analysis from WES data. (2) Prioritization of genetic features by supervised learning algorithm and identification of central hub genes using unsupervised clustering | Regularized GBM (XGBoost implementation) and unsupervised hierarchical clustering | N/A | WES | Variant data | GO Enrichment analysis (Pathway) | 2392 | Single omics (genomic) | dbGaP, NDAR | N/A |
16 | 2021 | PMID: 34199109 | Ovarian failure (OF) | Identification of blood-based gene variant profiles for precise treatment | Clustering and RF | Shapiro–Wilk test, Wilcoxon and Fisher | WES | Variant data (non-synonymous rare variants) | Annotated with SnpEff software | 150 (118 OF, 32 controls) | Single omics (genomic) | IGSR, Self-generated, nomAD, dbSNP, Genecards, Uniprot, Gene Ontology | Ethically approved |
17 | 2020 | PMID: 33109206 | Premature ovarian failure (POF) | Identification of novel and candidate variants associated with premature ovarian failure | Bioinformatics analysis, Variant Effect Scoring Tool and CADD | CADD, VEST | WES | Variant data (SNV, InDel variants) | Annotated with MAF ExAC/ gnomAD/ 1000G/ KRGDB and pathogenicity score: CADD, VEST | 44 (34 POF, 10 controls) | Single omics (genomic) | DisGeNET, Monarch, MalaCards, NCBI, USCS Genome Browser, varsome, NCBI SRA, | Ethically approved. |
18 | 2016 | PMID: 27980626 | Hypertension | Disease model to predicting disease risk by genotypes, utilizing gene expression and rare variant data | Radial and linear SVM and LR | N/A | WGS and Microarray | Gene expression data and Variant data (rare variants) | Supervised machine learning methods (MLMs) analysis | N/A | Multi-omics (genomic & transcriptomic) | Genetic Analysis Workshop 19 (GAW19) | N/A |
19 | 2019 | PMID: 30462833 | Risk of Illness | Evaluation of microbial risk assessment (MRA) | RF, SVM, Neural Networks(NN), Gradient boosting (GBM), and Logit Boost (LB) | N/A | WGS | Variant data | Microbial risk and pathway analysis | 245 strains | Single omics (genomic) | WGS (Leekitcharoenphon, Nielsen, Kaas, Lund, & Aarestrup, 2014; Pielaat et al., 2013) and phenotypic data (Pielaat et al., 2013) | N/A |
20 | 2021 | PMID: 33999966 | Sepsis | Classification of individuals and comparison of different algorithms | DT, RF, SVM and DNNs | Benjamini–Hochberg | Microarray | Gene expression data | DE, resilience and prediction analysis | 1786 (1354 sepsis, 86 SIRS, 346 control) | Single omics (transcriptomic) | NCBI, GEO and EMBL-EBI ArrayExpress | Ethically approved |
21 | 2021 | PMID: 33570011 | Prostate cancer | Classification and detection of prostate cancer in medical diagnosis (normal or tumor cases) | Random committee ensemble learning | CFS method | Microarray | Gene expression data | Statistical analysis | N/A | Single omics (transcriptomic) | The European Nucleotide Archive (ENA)-PRJEB19256 | N/A |
22 | 2021 | PMID: 34054599 | Autism | Identification of subgroups of patients | RF classification and SVM, | Robust multi-array analysis, CPDB analysis | WGS | Gene expression data | DE analysis- > GO enrichment analysis and Pathway Analysis (KEGG, WikiPathways, BioCarta, and Reactome pathway database) and Transcriptome-Wide Association Analysis | 31 | Single omics (transcriptomic) | NCBI, GEO -(U48705, M87338, X51757, X69699), KEGG, WikiPathways, BioCarta, and Reactome pathway database. GO Biological Process database, ConsensusPathDB | Ethically approved |
23 | 2021 | PMID: 33681364 | Cancer | Identification of tumor tissue of origin | GBDT, K-Nearest neighbor (K-NN), DT, AB and SVM | N/A | RNA-seq | Gene expression data | DEG analysis- > GO enrichment analysis (pathway) | 9 datasets | Single omics (transcriptomic) | ICGC Data Portal, GEO, TCGA | N/A |
24 | 2021 | PMID: 34343245 | Ovarian Cancer | Identification of combinational therapies for ovarian cancer | RF, Gradient boosting (GB) and XGBoost | Wilcoxon Test, PRISM, Spearman correlation–Pearson correlation, mean-squared error and mean absolute error | DSS,RNA-seq, WGS, scRNA-seq | Gene expression data | Predictive analysis | 4 datasets | Single omics (transcriptomic) | HERCULES project | N/A |
Table includes number of studies, year published, PMID, disease, study objectives, machine learning algorithms applied, statistical tools and packages used, raw data used, processed data generated, secondary and downstream data analysis, subjects involved, single/multi-omics, databases and ethics.