Table 1.

Common ML algorithms and their advantages and disadvantages

Algorithm	Acronyms/Variations	Description	Advantages	Disadvantages	References
Regressions	LASSO, Ridge, Elastic	Fit a simple function’s parameters to minimize the sum of squared distances to the observed data	Easily interpretable Uncovers causal relationships	Computationally impractical for large, high-dimensional datasets	Tibshirani¹⁴, Kennedy¹³, Zou & Hastie¹⁵
Support Vector Machines	SVM, SVR	Use support vectors to identify decision boundaries in the data	Memory efficient Robust, based on the Vapnik Chervonenkis theory	Not suitable for large datasets No probabilistic interpretation of classifiers	Cortes & Vapnik¹⁶
k-nearest neighbor	k-NN	Classify new data based on labels of surrounding neighbors	Strong theoretical underpinnings Easily interpretable	Not suitable for large datasets Large memory footprint	Cover & Hart¹⁷
k-means clustering	kMeans	Assign each point in the dataset to one of k clusters to minimize within cluster- variance relative to cluster’s centroid	Requires apriori specifying k Good convergence	Not suitable for high dimensional data Sensitive to outliers	Lloyd¹⁸
Principal Component Analysis	PCA, POD, SVD, EVD, KLT	Change coordinates to orthonormal basis that maximizes the variance of the data along these new coordinates	Can be used as feature extraction Computationally efficient algorithms available Extensively studied led to many generalizations	Principal components can depend on all input variables. Learned subspace is linear	Bishop⁸, Wold et al.¹⁵⁷
Decision Tree and Random Forest	RF, AdaBoost, XGBoost	Build flowchart-like decision trees for which the questions are learned from data, then ensemble multiple to form random forests	Very easy to interpret, as intermediary decisions can be read directly Performs well on large data sets	Requires manually crafted features Not suitable for perceptual data (e.g., images)	Breiman²⁰
Artificial Neural Networks	(A)NN, CNN, RNN, DNN, GAN	Directed, weighted acyclic graph of neurons arranged in layers, using a propagation function to transmit information	Automatic feature learning Very good performance on imaging data Applicable to a wide range of problems Easy to continue training on additional data	May require large amounts of data Prone to overfitting on small datasets Hard to interpret	Chollet⁷, Bishop⁸
Naïve Bayes	N/A	Classification using Bayes’ theorem by assuming independence between features to model class conditional probability	Requires small number of training samples imaging data Easy to interpret	Assumes features are independent Not suitable for high dimensional data	Bishop⁸
Linear discriminant analysis	LDA, NDA	Find a linear combination of features that that separates input data in classes	Strong performance when assumptions met	Independent variables assumed normal Sensitive to outliers	Duda et al.¹⁵⁸
Gaussian Mixture Model	GMM	Assume data follows a linear combination of Gaussian distributions with parameters estimated from data	Fastest algorithm for learning mixture models Simple likelihood-based optimization	Covariance matrix estimation can be difficult Number of components is aproiri specified	Bishop⁸
Spectral Clustering	N/A	Use eigenvalue decomposition to cluster based on the similarity matrix whose entries A_ij express degree of similarity between points i and j	Simple to implement Can be solved efficiently with linear algebra methods	Number of clusters needs to be specified in advance	Ng et al.¹⁵⁹
Mean Shift	N/A	A centroid-based clustering method using an iterative approach to search through neighborhood of points and locate modes of density functions	No need to specify number of clusters The bandwidth parameter has physical meaning	Not scalable, as it requires many nearest neighbor searches	Comaniciu & Meer¹⁶⁰
Isomap	N/A	Non-linear dimensionality reduction using an isometric mapping (distance-preserving transformation between metric spaces)	High computational efficiency Nonlinear Globally optimal	Sensitive to the parameter governing the connectivity of each point	Tenenbaum et al.¹⁶¹
Local Linear Embedding	LLE, HLLE, MLLE	Non-linear dimensionality reduction by using linear combinations of projected neighborhood points to reconstruct data	Faster than Isomap Can take advantage of sparse matrix algorithms	Sensitive to sampling density (i.e., performs poorly on non-uniform densities)	Roweis & Saul¹⁶²
Diffusion Maps	N/A	Feature extraction and dimensionality reduction based on a nonlinear approach, in which distances between points are defined in terms of probabilities of diffusion	Nonlinear Computation is insensitive to distribution of the points	Scaling parameter ε requires tuning	Coifman et al.¹⁶³
t-distributed stochastic neighbor embedding	tSNE	Data visualization tool which defines similarity between two points as the conditional probability one would pick the other as neighbor if neighbors were picked based on Student t probabilities centered at the first point	Constructs 2- or 3-dimensional representations of the data for easy visualization Nonlinear	Highly computationally expensive Due to stochastic nature, sensitive to initial conditions	Roweis & Hinton¹⁶⁴