Table 1.
Representative problems and methods addressing them by incorporating machine learning (ML) with bioinformatics tools in four areas.
Bioinformatics Area | Problem Category | Goal | ML Method | Bioinformatic Tools |
---|---|---|---|---|
Molecular evolution | Biological sequence clustering | Protein family prediction | CNN | Clusters of Orthologous Groups (COGs) and G protein-coupled receptor (GPCR) dataset [30] |
Protein function prediction | deep RNN | BLAST and HMMER search [32] | ||
Anti-CRISPR proteins identification | Random forest | MSA and PSI-BLAST [24] | ||
EXtreme Gradient Boosting | K-mer based clustering (CD-HIT), BLAST [25] | |||
Viral pathogenicity feature identification | SVM | MSA, phylogenetic tree construction [20,21] | ||
Alignment free biological sequence analysis | Identification of viral genomes | RNN | BLAST, Sequence clustering, HHPRED [27] | |
CNN | BLAST [28] | |||
protein structure analysis | Post translational modifications | Phosphorylation sites prediction | KNN | Local sequence similarity [53] |
CNN | K-mer based clustering (CD-HIT), BLAST [55] | |||
Glycosylation sites prediction | ensemble SVM | curated glycosylated protein database (O-GLYCBASE) [54] | ||
Protein structure prediction | Protein contact prediction | CNN | MSA [72] | |
Prediction of distances between pairs of residues | CNN | MSA, HHPRED, PSI-BLAST [77] | ||
systems biology | inference of biological networks | Gene regulatory network prediction | SVM | GeneNetWeaver, RegulonDB [81] |
Protein-protein interaction network prediction | SVM | Domain affinity and frequency tables [90] | ||
Elastic-net regression | Protein descriptors [91] | |||
Analysis of biological networks | Drug target prediction | K-means | Network analysis tools [98] | |
Drug side effect prediction | SVM | Genome scale metabolic modeling [112] | ||
Drug Synergism prediction | Random Forest Ensemble | A chemical-genetic interaction matrix [117] | ||
Multi-omics integration | Cancer subtype prediction | Neighborhood based clustering | Similarity based integration [141] | |
Drug response prediction | logistic regression | Cancer hallmarks datasets, pathway data [144] | ||
biomarker analysis for disease research | Disease-associated genes investigation | Pulmonary sarcoidosis genes identification | Hierarchical clustering | Differential expression analysis [150] |
Identification of miRNA-disease association | NMF | Disease semantic information and miRNA functional information [151] | ||
Disease-phenotype visualization | t-SNE | OMIM database and human disease networks [154] | ||
Biomarker discovery | Cancer diagnosis | SVM | Reference gene selection [170] | |
Biomarker signature identification | SVM | Network-based gene selection [167] | ||
Cancer outcome prediction | Random forest | Evolutionary conservation estimation [181] |