Skip to main content
. 2022 May 21;23(5):bbac191. doi: 10.1093/bib/bbac191

Table 1.

Comparative analysis of AI/ML approaches using gene expression and gene variant data

# Year PMIDs Diseases Study objectives Machine learning and statistical algorithms Statistical tools, packages and environments Raw data Processed data Secondary, statistical, and downstream data analysis Subjects Single/multi-mics Databases Ethics
1 2017 PMID:28795970 IBD Classification and prioritization of genes to detect new genes connected to IBD RF, SVM, Extreme gradient boosting (xgbTree) and Elastic net regularized generalized linear model (glmnet) Benjamini–Hochberg, Bonferroni correction, one-sided Mann–Whitney U test, Fisher’s exact test RNA-seq Gene expression data Gene Ontology (GO) Enrichment analysis (terms and pathway) 513 (180 CD, 149 UC, 94 colorectal neoplasms, 90 control) Single omics (transcriptomic) GEO, GWAS, ClinVar database, MSigDB, KEGG, Pathway interaction database N/A
2 2019 PMID: 31270349 SLE (1) Stratification of the subject as active and inactive SLE state with the help of raw data and test its potential to stratify, (2) identification of the best classifier/classifiers and (3) identification of the combinations of variables that make classification possible at best GLM, k-nearest neighbors (K-NN), and RF False discovery rate, Adjusted Rand Index, Rank–rank hypergeometric overlap RNA-seq Gene expression data DEGS analysis- > Enrichment analysis(WGCNA), GO analysis and GSVA N/A Single omics (transcriptomic) GEO N/A
3 2017 PMID: 28269885 CD To estimate the performance of predictivity of three different techniques as classifiers to identity extra-intestinal manifestation in CD and compare with the existing method NB, BART and Bayesian networks (BN) implemented using a Greedy Thick Thinning algorithm- learning, EM algorithm- learning conditional probabilities Bayesian methods: NB, BART and Bayesian Networks, Greedy Thick Thinning algorithm; EM WGS Variant data (SNP) Statistical analysis 152 (75 Extra-intestinal manifestation, 77 control) Single omics (genomic) Self-generated, dbSNP N/A
4 2019 PMID: 31564248 CD Disease prediction model by using previously unknown disease genes AVA,Dx, SVM VQSR, ANNOVAR WGS Variant data (Exonic) Annotation using ANNOVAR, pathway enrichment analysis (ConsensusPath database) 2855 (2793 CD, 62 control) Single omics (genomic) European Genome-Phenome Archive, Genotype-Tissue Expression Project, PopGen Biobank Ethically approved
5 2018 PMID: 30204480 Obesity (1) Stratification of individuals into obese and non-obese and evaluation of obesity risk. (2) Comparison of predictive performance of various models SVM, k-nearest neighbor (K-NN), and DT Feature selecting algorithms-stepwise MLR, DT and genetic algorithms ANOVA WGS Variant data (SNP) Genome-wide SNP and statistical analysis 129 (74 obese, 65 control) Single omics (genomic) dbSNP Ethically approved
6 2019 PMID: 31200905 Colon cancer Comparison of various machine learning algorithms for: (1) identification of differential genes of high risk using statistical tests, (2) prediction of cancer genes by using a ML strategy LDA, QDA, NB, GPC, SVM, ANN, LR, DT, AB and RF WCSRS, t test, Kruskal–Wallis (KW) and F-test RNA-seq Gene expression data DEG analysis - > Statistical [WCSRS, t test, Kruskal–Wallis (KW) and F-test] 62 (40 cancer, 22 control) Single omics (transcriptomic) Kent ridge biomedical data repository N/A- No ethics approval is required for this dataset
7 2016 PMID: 27587275 Breast cancer Classification, characterization and prediction of breast cancer using mutation profiles NMF Clustering, RF, NB, C4.5, SVM and K-NN Wilcoxon rank-sum, Benjamini–Hochberg (FDR) WES Variant data (Somatic and non-synonymous SNV) DEG analysis- > Enrichment analysis (Gene Set Enrichment Analysis (GSEA)), pathway analysis (Ingenuity Pathway Analysis) 358 Single omics (genomic) TCGA N/A
8 2021 PMID: 34332931 Malignant pleural mesothelioma Evaluation of risk scores and classification of the patients into low-risk and high-risk groups OncoCast-MPM machine learning risk prediction model (elastic-net penalized Cox proportional hazard models) χ2 test in R. lasso-penalized Cox regression. Kaplan–Meier survival curves with log-rank testing Concordance probability estimate WGS Variant data Risk stratification and statistical analysis 194 Single omics (genomic) MSK-IMPACT,TCGA N/A
9 2018 PMID: 29298978 Acute myeloid leukemia (AML) Identification of molecular gene expression markers for precise treatment of acute myeloid leukemia MERGE (mutation, expression hubs, known regulators, genomic CNV, and methylation) Pearson’s, Spearman, ElasticNet RNA-seq Gene expression data Covariate, association and prioritized subset analysis 30 Single omics (transcriptomic) GEO Ethically approved
10 2020 PMID: 32318621 Alzheimer’s disease (AD) Identification of genomic markers for precise therapy of AD (Phase 2a clinical study) FCA - unsupervised Integrated in the Knowledge Extraction and Management (KEM) environment WES, RNA seq Variant and Gene expression data Association rules and linear mixed effect (LME) model analysis 32 Multi-omics (genomic & transcriptomic) Self-generated N/A
11 2021 PMID: 34220416 Major depressive disorder (MDD) Identification of potential peripheral blood transcriptome biomarkers and development of a MDD prediction model using peripheral blood transcriptomes SVM, RF, K-Nearest Neighbors (K-NN) and NB Positive predictive value, Matthews correlation coefficient RNA-seq Gene expression data DEG analysis- > Statistical, KEGG pathway enrichment analysis 302 (9 different datasets) Single omics (transcriptomic) GEO N/A
12 2020 PMID: 33099536 Ulcerative colitis Identification of susceptibility genes and disease prediction development for ulcerative colitis disease prediction RF and ANN Benjamini–Hochberg RNA-seq Gene expression data DEG- > GO enrichment analysis and KEGG enrichment analysis (Pathway) 2 datasets Single omics (transcriptomic) GEO, GO, KEGG N/A
13 2021 PMID: 33645908 Major depressive disorder (MDD) Stratification of MDD patients and control samples, and understanding pathophysiology XGBoost (eXtreme Gradient Boosting implementation) N/A RNA-seq Gene expression data DEGs analysis- > GO enrichment analysis (GSA) 390 (314 MD, 76 control) Single omics (transcriptomic) dbGaP, GEO Ethically approved
14 2019 PMID: 29704323 Schizophrenia (SCZ) Prediction of high-risk individuals (5090 exomes) eXtreme Gradient Boosting implementation (XGBoost), L1.logistic regression (lasso regularized), SVM and RF ANNOVAR, Pearson correlation WES SNV(Variant data), DNM (Mutation-small insertions and deletions) Annotated with ANNOVAR. KEGG enrichment analysis(Pathway) 5090 (2545 SCZ, 2545 control) Single omics (genomic) dbGaP Ethically approved
15 2020 PMID: 32111185 Schizophrenia (SCZ) and Autism spectrum disorder (ASD) Comparison of the architecture of the genomes of SCZ and ASD. (1) To identify if SCZ and ASD patients can be differentiated just based on supervised learning analysis from WES data. (2) Prioritization of genetic features by supervised learning algorithm and identification of central hub genes using unsupervised clustering Regularized GBM (XGBoost implementation) and unsupervised hierarchical clustering N/A WES Variant data GO Enrichment analysis (Pathway) 2392 Single omics (genomic) dbGaP, NDAR N/A
16 2021 PMID: 34199109 Ovarian failure (OF) Identification of blood-based gene variant profiles for precise treatment Clustering and RF Shapiro–Wilk test, Wilcoxon and Fisher WES Variant data (non-synonymous rare variants) Annotated with SnpEff software 150 (118 OF, 32 controls) Single omics (genomic) IGSR, Self-generated, nomAD, dbSNP, Genecards, Uniprot, Gene Ontology Ethically approved
17 2020 PMID: 33109206 Premature ovarian failure (POF) Identification of novel and candidate variants associated with premature ovarian failure Bioinformatics analysis, Variant Effect Scoring Tool and CADD CADD, VEST WES Variant data (SNV, InDel variants) Annotated with MAF ExAC/ gnomAD/ 1000G/ KRGDB and pathogenicity score: CADD, VEST 44 (34 POF, 10 controls) Single omics (genomic) DisGeNET, Monarch, MalaCards, NCBI, USCS Genome Browser, varsome, NCBI SRA, Ethically approved.
18 2016 PMID: 27980626 Hypertension Disease model to predicting disease risk by genotypes, utilizing gene expression and rare variant data Radial and linear SVM and LR N/A WGS and Microarray Gene expression data and Variant data (rare variants) Supervised machine learning methods (MLMs) analysis N/A Multi-omics (genomic & transcriptomic) Genetic Analysis Workshop 19 (GAW19) N/A
19 2019 PMID: 30462833 Risk of Illness Evaluation of microbial risk assessment (MRA) RF, SVM, Neural Networks(NN), Gradient boosting (GBM), and Logit Boost (LB) N/A WGS Variant data Microbial risk and pathway analysis 245 strains Single omics (genomic) WGS (Leekitcharoenphon, Nielsen, Kaas, Lund, & Aarestrup, 2014; Pielaat et al., 2013) and phenotypic data (Pielaat et al., 2013) N/A
20 2021 PMID: 33999966 Sepsis Classification of individuals and comparison of different algorithms DT, RF, SVM and DNNs Benjamini–Hochberg Microarray Gene expression data DE, resilience and prediction analysis 1786 (1354 sepsis, 86 SIRS, 346 control) Single omics (transcriptomic) NCBI, GEO and EMBL-EBI ArrayExpress Ethically approved
21 2021 PMID: 33570011 Prostate cancer Classification and detection of prostate cancer in medical diagnosis (normal or tumor cases) Random committee ensemble learning CFS method Microarray Gene expression data Statistical analysis N/A Single omics (transcriptomic) The European Nucleotide Archive (ENA)-PRJEB19256 N/A
22 2021 PMID: 34054599 Autism Identification of subgroups of patients RF classification and SVM, Robust multi-array analysis, CPDB analysis WGS Gene expression data DE analysis- > GO enrichment analysis and Pathway Analysis (KEGG, WikiPathways, BioCarta, and Reactome pathway database) and Transcriptome-Wide Association Analysis 31 Single omics (transcriptomic) NCBI, GEO -(U48705, M87338, X51757, X69699), KEGG, WikiPathways, BioCarta, and Reactome pathway database. GO Biological Process database, ConsensusPathDB Ethically approved
23 2021 PMID: 33681364 Cancer Identification of tumor tissue of origin GBDT, K-Nearest neighbor (K-NN), DT, AB and SVM N/A RNA-seq Gene expression data DEG analysis- > GO enrichment analysis (pathway) 9 datasets Single omics (transcriptomic) ICGC Data Portal, GEO, TCGA N/A
24 2021 PMID: 34343245 Ovarian Cancer Identification of combinational therapies for ovarian cancer RF, Gradient boosting (GB) and XGBoost Wilcoxon Test, PRISM, Spearman correlation–Pearson correlation, mean-squared error and mean absolute error DSS,RNA-seq, WGS, scRNA-seq Gene expression data Predictive analysis 4 datasets Single omics (transcriptomic) HERCULES project N/A

Table includes number of studies, year published, PMID, disease, study objectives, machine learning algorithms applied, statistical tools and packages used, raw data used, processed data generated, secondary and downstream data analysis, subjects involved, single/multi-omics, databases and ethics.