. 2022 May 21;23(5):bbac191. doi: 10.1093/bib/bbac191

Table 1.

Comparative analysis of AI/ML approaches using gene expression and gene variant data

#	Year	PMIDs	Diseases	Study objectives	Machine learning and statistical algorithms	Statistical tools, packages and environments	Raw data	Processed data	Secondary, statistical, and downstream data analysis	Subjects	Single/multi-mics	Databases	Ethics
1	2017	PMID:28795970	IBD	Classification and prioritization of genes to detect new genes connected to IBD	RF, SVM, Extreme gradient boosting (xgbTree) and Elastic net regularized generalized linear model (glmnet)	Benjamini–Hochberg, Bonferroni correction, one-sided Mann–Whitney U test, Fisher’s exact test	RNA-seq	Gene expression data	Gene Ontology (GO) Enrichment analysis (terms and pathway)	513 (180 CD, 149 UC, 94 colorectal neoplasms, 90 control)	Single omics (transcriptomic)	GEO, GWAS, ClinVar database, MSigDB, KEGG, Pathway interaction database	N/A
2	2019	PMID: 31270349	SLE	(1) Stratification of the subject as active and inactive SLE state with the help of raw data and test its potential to stratify, (2) identification of the best classifier/classifiers and (3) identification of the combinations of variables that make classification possible at best	GLM, k-nearest neighbors (K-NN), and RF	False discovery rate, Adjusted Rand Index, Rank–rank hypergeometric overlap	RNA-seq	Gene expression data	DEGS analysis- > Enrichment analysis(WGCNA), GO analysis and GSVA	N/A	Single omics (transcriptomic)	GEO	N/A
3	2017	PMID: 28269885	CD	To estimate the performance of predictivity of three different techniques as classifiers to identity extra-intestinal manifestation in CD and compare with the existing method	NB, BART and Bayesian networks (BN) implemented using a Greedy Thick Thinning algorithm- learning, EM algorithm- learning conditional probabilities	Bayesian methods: NB, BART and Bayesian Networks, Greedy Thick Thinning algorithm; EM	WGS	Variant data (SNP)	Statistical analysis	152 (75 Extra-intestinal manifestation, 77 control)	Single omics (genomic)	Self-generated, dbSNP	N/A
4	2019	PMID: 31564248	CD	Disease prediction model by using previously unknown disease genes	AVA,Dx, SVM	VQSR, ANNOVAR	WGS	Variant data (Exonic)	Annotation using ANNOVAR, pathway enrichment analysis (ConsensusPath database)	2855 (2793 CD, 62 control)	Single omics (genomic)	European Genome-Phenome Archive, Genotype-Tissue Expression Project, PopGen Biobank	Ethically approved
5	2018	PMID: 30204480	Obesity	(1) Stratification of individuals into obese and non-obese and evaluation of obesity risk. (2) Comparison of predictive performance of various models	SVM, k-nearest neighbor (K-NN), and DT Feature selecting algorithms-stepwise MLR, DT and genetic algorithms	ANOVA	WGS	Variant data (SNP)	Genome-wide SNP and statistical analysis	129 (74 obese, 65 control)	Single omics (genomic)	dbSNP	Ethically approved
6	2019	PMID: 31200905	Colon cancer	Comparison of various machine learning algorithms for: (1) identification of differential genes of high risk using statistical tests, (2) prediction of cancer genes by using a ML strategy	LDA, QDA, NB, GPC, SVM, ANN, LR, DT, AB and RF	WCSRS, t test, Kruskal–Wallis (KW) and F-test	RNA-seq	Gene expression data	DEG analysis - > Statistical [WCSRS, t test, Kruskal–Wallis (KW) and F-test]	62 (40 cancer, 22 control)	Single omics (transcriptomic)	Kent ridge biomedical data repository	N/A- No ethics approval is required for this dataset
7	2016	PMID: 27587275	Breast cancer	Classification, characterization and prediction of breast cancer using mutation profiles	NMF Clustering, RF, NB, C4.5, SVM and K-NN	Wilcoxon rank-sum, Benjamini–Hochberg (FDR)	WES	Variant data (Somatic and non-synonymous SNV)	DEG analysis- > Enrichment analysis (Gene Set Enrichment Analysis (GSEA)), pathway analysis (Ingenuity Pathway Analysis)	358	Single omics (genomic)	TCGA	N/A
8	2021	PMID: 34332931	Malignant pleural mesothelioma	Evaluation of risk scores and classification of the patients into low-risk and high-risk groups	OncoCast-MPM machine learning risk prediction model (elastic-net penalized Cox proportional hazard models)	χ² test in R. lasso-penalized Cox regression. Kaplan–Meier survival curves with log-rank testing Concordance probability estimate	WGS	Variant data	Risk stratification and statistical analysis	194	Single omics (genomic)	MSK-IMPACT,TCGA	N/A
9	2018	PMID: 29298978	Acute myeloid leukemia (AML)	Identification of molecular gene expression markers for precise treatment of acute myeloid leukemia	MERGE (mutation, expression hubs, known regulators, genomic CNV, and methylation)	Pearson’s, Spearman, ElasticNet	RNA-seq	Gene expression data	Covariate, association and prioritized subset analysis	30	Single omics (transcriptomic)	GEO	Ethically approved
10	2020	PMID: 32318621	Alzheimer’s disease (AD)	Identification of genomic markers for precise therapy of AD (Phase 2a clinical study)	FCA - unsupervised	Integrated in the Knowledge Extraction and Management (KEM) environment	WES, RNA seq	Variant and Gene expression data	Association rules and linear mixed effect (LME) model analysis	32	Multi-omics (genomic & transcriptomic)	Self-generated	N/A
11	2021	PMID: 34220416	Major depressive disorder (MDD)	Identification of potential peripheral blood transcriptome biomarkers and development of a MDD prediction model using peripheral blood transcriptomes	SVM, RF, K-Nearest Neighbors (K-NN) and NB	Positive predictive value, Matthews correlation coefficient	RNA-seq	Gene expression data	DEG analysis- > Statistical, KEGG pathway enrichment analysis	302 (9 different datasets)	Single omics (transcriptomic)	GEO	N/A
12	2020	PMID: 33099536	Ulcerative colitis	Identification of susceptibility genes and disease prediction development for ulcerative colitis disease prediction	RF and ANN	Benjamini–Hochberg	RNA-seq	Gene expression data	DEG- > GO enrichment analysis and KEGG enrichment analysis (Pathway)	2 datasets	Single omics (transcriptomic)	GEO, GO, KEGG	N/A
13	2021	PMID: 33645908	Major depressive disorder (MDD)	Stratification of MDD patients and control samples, and understanding pathophysiology	XGBoost (eXtreme Gradient Boosting implementation)	N/A	RNA-seq	Gene expression data	DEGs analysis- > GO enrichment analysis (GSA)	390 (314 MD, 76 control)	Single omics (transcriptomic)	dbGaP, GEO	Ethically approved
14	2019	PMID: 29704323	Schizophrenia (SCZ)	Prediction of high-risk individuals (5090 exomes)	eXtreme Gradient Boosting implementation (XGBoost), L1.logistic regression (lasso regularized), SVM and RF	ANNOVAR, Pearson correlation	WES	SNV(Variant data), DNM (Mutation-small insertions and deletions)	Annotated with ANNOVAR. KEGG enrichment analysis(Pathway)	5090 (2545 SCZ, 2545 control)	Single omics (genomic)	dbGaP	Ethically approved
15	2020	PMID: 32111185	Schizophrenia (SCZ) and Autism spectrum disorder (ASD)	Comparison of the architecture of the genomes of SCZ and ASD. (1) To identify if SCZ and ASD patients can be differentiated just based on supervised learning analysis from WES data. (2) Prioritization of genetic features by supervised learning algorithm and identification of central hub genes using unsupervised clustering	Regularized GBM (XGBoost implementation) and unsupervised hierarchical clustering	N/A	WES	Variant data	GO Enrichment analysis (Pathway)	2392	Single omics (genomic)	dbGaP, NDAR	N/A
16	2021	PMID: 34199109	Ovarian failure (OF)	Identification of blood-based gene variant profiles for precise treatment	Clustering and RF	Shapiro–Wilk test, Wilcoxon and Fisher	WES	Variant data (non-synonymous rare variants)	Annotated with SnpEff software	150 (118 OF, 32 controls)	Single omics (genomic)	IGSR, Self-generated, nomAD, dbSNP, Genecards, Uniprot, Gene Ontology	Ethically approved
17	2020	PMID: 33109206	Premature ovarian failure (POF)	Identification of novel and candidate variants associated with premature ovarian failure	Bioinformatics analysis, Variant Effect Scoring Tool and CADD	CADD, VEST	WES	Variant data (SNV, InDel variants)	Annotated with MAF ExAC/ gnomAD/ 1000G/ KRGDB and pathogenicity score: CADD, VEST	44 (34 POF, 10 controls)	Single omics (genomic)	DisGeNET, Monarch, MalaCards, NCBI, USCS Genome Browser, varsome, NCBI SRA,	Ethically approved.
18	2016	PMID: 27980626	Hypertension	Disease model to predicting disease risk by genotypes, utilizing gene expression and rare variant data	Radial and linear SVM and LR	N/A	WGS and Microarray	Gene expression data and Variant data (rare variants)	Supervised machine learning methods (MLMs) analysis	N/A	Multi-omics (genomic & transcriptomic)	Genetic Analysis Workshop 19 (GAW19)	N/A
19	2019	PMID: 30462833	Risk of Illness	Evaluation of microbial risk assessment (MRA)	RF, SVM, Neural Networks(NN), Gradient boosting (GBM), and Logit Boost (LB)	N/A	WGS	Variant data	Microbial risk and pathway analysis	245 strains	Single omics (genomic)	WGS (Leekitcharoenphon, Nielsen, Kaas, Lund, & Aarestrup, 2014; Pielaat et al., 2013) and phenotypic data (Pielaat et al., 2013)	N/A
20	2021	PMID: 33999966	Sepsis	Classification of individuals and comparison of different algorithms	DT, RF, SVM and DNNs	Benjamini–Hochberg	Microarray	Gene expression data	DE, resilience and prediction analysis	1786 (1354 sepsis, 86 SIRS, 346 control)	Single omics (transcriptomic)	NCBI, GEO and EMBL-EBI ArrayExpress	Ethically approved
21	2021	PMID: 33570011	Prostate cancer	Classification and detection of prostate cancer in medical diagnosis (normal or tumor cases)	Random committee ensemble learning	CFS method	Microarray	Gene expression data	Statistical analysis	N/A	Single omics (transcriptomic)	The European Nucleotide Archive (ENA)-PRJEB19256	N/A
22	2021	PMID: 34054599	Autism	Identification of subgroups of patients	RF classification and SVM,	Robust multi-array analysis, CPDB analysis	WGS	Gene expression data	DE analysis- > GO enrichment analysis and Pathway Analysis (KEGG, WikiPathways, BioCarta, and Reactome pathway database) and Transcriptome-Wide Association Analysis	31	Single omics (transcriptomic)	NCBI, GEO -(U48705, M87338, X51757, X69699), KEGG, WikiPathways, BioCarta, and Reactome pathway database. GO Biological Process database, ConsensusPathDB	Ethically approved
23	2021	PMID: 33681364	Cancer	Identification of tumor tissue of origin	GBDT, K-Nearest neighbor (K-NN), DT, AB and SVM	N/A	RNA-seq	Gene expression data	DEG analysis- > GO enrichment analysis (pathway)	9 datasets	Single omics (transcriptomic)	ICGC Data Portal, GEO, TCGA	N/A
24	2021	PMID: 34343245	Ovarian Cancer	Identification of combinational therapies for ovarian cancer	RF, Gradient boosting (GB) and XGBoost	Wilcoxon Test, PRISM, Spearman correlation–Pearson correlation, mean-squared error and mean absolute error	DSS,RNA-seq, WGS, scRNA-seq	Gene expression data	Predictive analysis	4 datasets	Single omics (transcriptomic)	HERCULES project	N/A

Table includes number of studies, year published, PMID, disease, study objectives, machine learning algorithms applied, statistical tools and packages used, raw data used, processed data generated, secondary and downstream data analysis, subjects involved, single/multi-omics, databases and ethics.