Skip to main content
. 2017 May 19;12(7):505–514. doi: 10.1080/15592294.2017.1329068

Table 1.

Machine learning approaches for biological data sets, along with their function, advantages, disadvantages, and recent examples.

Machine Learning Approach Function Advantages Disadvantages
Supervised Learning (e.g., support vector machine,84 random forest85) Learn a model discriminating one class of biological phenomena from one or more other classes. Precise model with predictive and interpretative properties. Requires equally large number of examples from each class.
Unsupervised Learning (e.g., K-means,86 hierarchical clustering87) Learn a model descriptive of the biological phenomena in the data. Does not require class labels on data. Sensitive to similarity measure; results difficult to interpret.
Semi-supervised Learning (e.g., transduction88) Learn model from mixture of labeled and unlabeled data. Utilize all available data; typically outperforms use just labeled data. Sensitive to errors in propagating class labels from labeled to unlabeled data.
Feature Selection (e.g., PCA,89 LDA,90 wrapper91) Reduce large number of features to fewer, more informative features. Improves efficiency and accuracy of learning. Sensitive to feature evaluation metric; may discard informative features.
Active Learning (e.g., uncertainty sampling,92 most informative instance93) Identify most informative instances to label for accurate model learning. Reduces number of examples needed to learn model; reduces burden on human expert and experiment cost. May focus learner on outliers rather than prominent classes.
Imbalanced class Learning (e.g., minority over-sampling,94 boosting95) Learn in the presence of large skew in the number of examples of each class. Learn with relatively few examples of biological phenomenon of interest. May underfit or overfit data depending on bias toward minority class.
Deep Learning (DeepBind,14 DeepMotif15) Learns complex representations of concepts in the data. General purpose and high accuracy. Sensitive to parameter choices; long training times.