Skip to main content
. 2020 Oct 16;18:3287–3300. doi: 10.1016/j.csbj.2020.10.011

Table 4.

Supervised machine learning methods and examples of their application to bioinformatics.

Supervised ML method Description Strength Weaknesses Application examples
Linear Regression Data Regression to its mean value

Best Fit Line (Mean Pattern of the dataset)

Gradient Descent

Least Square Function

Continuous Output

Normality Assumption

Linearity Assumption
Computationally inexpensive

Weighed Sum Prediction

Reduces complex dataset to a singular function

Less prone to overfitting
Reduces larger complex dataset to a singular function

Assumption of Linearity Relationship is seldom applicable

Does not distinguish Outliers which might bias regression
Prediction and Evolutionary info analysis of protein structure [137]

Prediction and evolutionary information analysis of protein solvent accessibility [138]

Genetic Expression inference [139]

Genotype Prediction based on Single Nucleotide Polymorphism [140]

Prediction of protein secondary structure [141]
Logistic Regression Extension of Linear Regression

Logistic line fitting

Probability modelling

Non-linearity acceptance
Probability-based classification (rather than final classifications)

Fast Training

Extension to Multiclass Classifications

Less prone to overfitting
Complex Multiplicative weighted function

Complete Separation of classes

Unrepresentative for classes that highly overlap
Cellular Phenotype classification based on gene expression profile [142]

Gene Selection [143]

Disease Classification from microarray data
[144]

Molecular Classification of Cancer [145]
Support Vector Machine Hyperplane Data classification

Classes separation on higher dimensionality

Kernel Transformation
Easy implementation to well defined classified categories

Effective in high dimensional spaces

Non-linear input acceptance
Not suitable for overlapping classes

Can be prone to overfitting when number of features exceeds the number of samples

No probabilistic explanation for classification
Classification on Gene functional annotations from a combination of protein sequence and structure data [146]

Cancer Classification from genetic expression [147]

Protein subcellular classification prediction [148]

Structural Classification of proteins [149]
Naïve- Bayers Probabilistic Classification

Probabilistic Bayer’s theorem

Conditional Independence between variables

Most used for classification
Reduced risk of overfitting on small datasets

Probabilistic classification

Fast training

Computationally inexpensive

Scales linearly
Does not incorporate feature interactions

Performance sensitive to skewed data


Requires assumption that variables are conditionally independent
MicroRNA target prediction [150]

Prediction of Protein Interaction Sites [151]

Prediction of Protein coupling specificity [152]
Decision Tree Classifier
&
Forest Tree
Classification or Regression modelling

Parameter based Data splitting of variable with highest information gain

Data entropy

Information Gain Theory

Gini Coefficient
Easy interpretation and analysis

Valued on smaller datasets

Multiclassification applicability

White box model highlighting classification pattern

Easily assembled
Tendency to overfit
Lack of linear smoothness
Prediction of microorganism growth temperatures and enzyme catalytic optima [153]

Protein Structure Prediction from enzymatic turnovers [129]

Microbial Genome prediction [154]

MS cancer data classification [155]

Gene Selection for Cancer identification [156]

Human protein function prediction [157]

Prediction of Protein interaction [158]
k-NN Classification and Regression modelling

Clustering classification

Instance-based learning, i.e. lazy learning

Parameter selection on Kernel basis

Higher dimensions for clustering
Simple to implement

Learn non-linear boundary

Robust to noise in input data
Inefficiency on training larger datasets

Expensive computational cost

K value evaluation based on mixed heuristics

Unclear
Gene selection for sample classification based on gene expression data [159]

Classification for Cancer diagnosis [160]

Prediction of Metabolic pathways dynamics [161]