Supervised learning |
Regression analysis |
Widely used form of machine learning that generates an algorithm to define a relationship between multiple predictor variables and an outcome of interest. Stepwise models iteratively add or remove predictors based on the strength of their association with the outcome in order to find the best performing model. |
Support vector machines |
This algorithm defines linear or higher dimension “hyperplanes” that separate training data into classes to predict outcomes. When presented with new data, these models can predict the class to which it belongs. |
Random forest |
Consists of a multitude of individual decision trees that operate as an “ensemble”; an item is classified according to the most common output from all of the decision trees. |
Artificial neural networks (ANN) |
Modeled on human neural processing, these nodal networks are comprised of interconnected layers (input layer, multiple hidden layers, output layer) which analyze and classify the input data. A mathematical model is used to pass on the results to successive layers. ANNs can learn which connections are most useful for classifying data and weight these accordingly. |
Unsupervised learning |
Principal component analysis (PCA) |
Data is summarized by combining existing features to create a new set of independent variables, or principal components. The process of focusing in on only a few variables is called dimensionality reduction. PCA allows the features that account for the most variation in the dataset to be identified. |
Clustering techniques |
The algorithm identifies inherent groupings, or clusters, in the data, and uses these clusters to classify new data. |
Hierarchical clustering |
A hierarchical decomposition of the data is created by sequentially merging similar clusters (an agglomerative approach), or beginning with a single cluster which is split into smaller clusters at each step (a divisive approach). |
K-means clustering |
A distance-based algorithm, in which similarity is defined by the proximity to the “centroid” (mean, median, or medoid) of the cluster. The ‘K’ in K-means denotes the number of clusters. |
Model-based clustering |
Based on the assumption that data originates from a distribution (normal or Gaussian) that is a mixture of two or more clusters, and that each data point has a probability of belonging to each cluster. |
Density-based spatial clustering |
The algorithm focuses on the proximity and density of data to find arbitrarily shaped clusters and outliers (noise). For each point of a cluster, the “neighborhood” must contain at least a minimum number of points. |
Reinforcement learning |
The machine learns how to interact with its environment through trial and error, in order to maximize the total reward. |
Deep learning (DL) |
A class of ANN algorithms in which more hidden layers are used than in traditional neural networks; “deep” simply describes a multilayer separation approach. Most DL models use convolutional neural networks, in which the hidden layers contain one or more convolutional layers that apply a filter to an input to create a feature map. As with machine learning, DL can be classified as supervised, unsupervised, or reinforcement learning. |