Table 4.
Supervised ML method | Description | Strength | Weaknesses | Application examples |
---|---|---|---|---|
Linear Regression | Data Regression to its mean value Best Fit Line (Mean Pattern of the dataset) Gradient Descent Least Square Function Continuous Output Normality Assumption Linearity Assumption |
Computationally inexpensive Weighed Sum Prediction Reduces complex dataset to a singular function Less prone to overfitting |
Reduces larger complex dataset to a singular function Assumption of Linearity Relationship is seldom applicable Does not distinguish Outliers which might bias regression |
Prediction and Evolutionary info analysis of protein structure [137] Prediction and evolutionary information analysis of protein solvent accessibility [138] Genetic Expression inference [139] Genotype Prediction based on Single Nucleotide Polymorphism [140] Prediction of protein secondary structure [141] |
Logistic Regression | Extension of Linear Regression Logistic line fitting Probability modelling Non-linearity acceptance |
Probability-based classification (rather than final classifications) Fast Training Extension to Multiclass Classifications Less prone to overfitting |
Complex Multiplicative weighted function Complete Separation of classes Unrepresentative for classes that highly overlap |
Cellular Phenotype classification based on gene expression profile [142] Gene Selection [143] Disease Classification from microarray data [144] Molecular Classification of Cancer [145] |
Support Vector Machine | Hyperplane Data classification Classes separation on higher dimensionality Kernel Transformation |
Easy implementation to well defined classified categories Effective in high dimensional spaces Non-linear input acceptance |
Not suitable for overlapping classes Can be prone to overfitting when number of features exceeds the number of samples No probabilistic explanation for classification |
Classification on Gene functional annotations from a combination of protein sequence and structure data [146] Cancer Classification from genetic expression [147] Protein subcellular classification prediction [148] Structural Classification of proteins [149] |
Naïve- Bayers | Probabilistic Classification Probabilistic Bayer’s theorem Conditional Independence between variables Most used for classification |
Reduced risk of overfitting on small datasets Probabilistic classification Fast training Computationally inexpensive Scales linearly |
Does not incorporate feature interactions Performance sensitive to skewed data Requires assumption that variables are conditionally independent |
MicroRNA target prediction [150] Prediction of Protein Interaction Sites [151] Prediction of Protein coupling specificity [152] |
Decision Tree Classifier & Forest Tree |
Classification or Regression modelling Parameter based Data splitting of variable with highest information gain Data entropy Information Gain Theory Gini Coefficient |
Easy interpretation and analysis Valued on smaller datasets Multiclassification applicability White box model highlighting classification pattern Easily assembled |
Tendency to overfit Lack of linear smoothness |
Prediction of microorganism growth temperatures and enzyme catalytic optima [153] Protein Structure Prediction from enzymatic turnovers [129] Microbial Genome prediction [154] MS cancer data classification [155] Gene Selection for Cancer identification [156] Human protein function prediction [157] Prediction of Protein interaction [158] |
k-NN | Classification and Regression modelling Clustering classification Instance-based learning, i.e. lazy learning Parameter selection on Kernel basis Higher dimensions for clustering |
Simple to implement Learn non-linear boundary Robust to noise in input data |
Inefficiency on training larger datasets Expensive computational cost K value evaluation based on mixed heuristics Unclear |
Gene selection for sample classification based on gene expression data [159] Classification for Cancer diagnosis [160] Prediction of Metabolic pathways dynamics [161] |