TABLE 1.
Supervised and Unsupervised Approaches to Machine Learning
Supervised (Ref. #) | |
Regression analysis (13) | Uncomplicated form of supervised machine learning that generates an algorithm to describe a relationship between multiple variables and an outcome of interest. Stepwise models “automatically” add or remove variables based on the strength of their association with the outcome variable, until a significant model is developed or “learned.” |
Support vector machines (14,15) | Whereas regression analysis may identify linear associations, “support vector machines” provide nonlinear models by defining “planes” in higher-dimension that best separate out features into groups that predict certain outcomes. |
Random forests (14,15) | Identify the best cutpoint values in different features of individual groups of related data to be able to separate them out to predict a particular outcome. |
Neural networks | Features are fed through a nodal network of decision points, meant to mimic human neural processing. |
Convoluted neural networks | A multilayered network, often applied to image processing, simulating some of the properties of the human visual cortex. A mathematical model is used to pass on results to successive layers. |
Deep learning (DL) (12,16) | DL is defined as a class of artificial neural network algorithms, in which more internal layers are used than in traditional neural network approaches (“deep” merely describes a multilayered separation approach). Often described as convolutional neural networks. |
Unsupervised (Ref. #) | |
Principal component analysis (11) | Simple form of unsupervised learning in which the features that account for the most variation in a dataset can be identified. |
Hierarchical clustering (e.g., agglomerative hierarchical clustering, divisive hierarchical clustering) | Creates a hierarchical decomposition of the data based on similarity with another cluster by aggregating them (an agglomerative approach) or dividing them as it moves down in hierarchy (a divisive approach). Strength—easy comprehension and illustration using dendrograms, insensitive to outliers. Difficulty—arbitrary metric and linkage criteria, does not work with missing data, may lead to misinterpretation of the dendrogram, difficulty finding optimal solution. |
Partitioning algorithms (e.g., K-means clustering) | Form of cluster analysis that identifies degree of separation of different features within a dataset and tries to find groupings in which features are most differentiated. It does this by defining similarity on the basis of proximity to the centroid (mean, median, or medoid) of the cluster. The algorithm modulates the data to build the cluster by iteratively evaluating the distance from the centroid. Strengths—simple and easy to implement, easy to interpret, fast, and efficient. Has remained relatively underutilized in cardiology despite its simplicity in implementation and interpretation. Difficulty—uniform cluster size, may work poorly with different densities of clusters, difficult to find k, sensitive to outliers. |
Model-based clustering (e.g., Expectation-Maximization Algorithm) | This clustering algorithm makes a general assumption that the data in each cluster is generated from probabilistic (primarily Gaussian) model. Strengths—distributional description for components, possible to assess clusterswithin-clusters, can make inference about the number of clusters. Difficulties—computationally intensive, can be slow to converge. |
Grid-based algorithms (e.g., Statistical Information Grid-Based Clustering [STING], OptiGrid, WaveCluster, GRIDCLUS, GDILC) | Strengths—can work on large multidimensional space, reduction in computational complexity, data space partitioned to finite cells to form grid structure. Difficulties—difficult to handle irregular data distributions; limited by predefined cell sizes, borders, and density threshold; difficult to cluster highdimensional data. |
Density-based spatial clustering of applications with noise | Strengths—does not require number of clusters, can find clusters of arbitrary shapes, robust to outliers. Difficulties—not deterministic, quality depends on the distance measure, cannot handle large differences in densities. |