Skip to main content
. 2023 Dec 19;41(2):534–552. doi: 10.1007/s12325-023-02743-3

Table 1.

Commonly used machine learning algorithms (with Python and R codes)

Algorithms Objectives References
Linear regression

Used to estimate real values (independent variables such as predicted FEV1) based on continues variables (dependent variable such as height, age, etc.) [46]

It can be simple or multiple if there is more than one independent variable

Frank E. Harrell Jr

Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis

Springer International; 2015

Logistic regressiona

A classification algorithm used to estimate discrete values (yes/no, true/false, etc.) [46]

Since it predicts the probability of developing a certain condition, the values are between 0 and 1

Frank E. Harrell Jr

Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis

Springer International; 2015

Support vector machine (SVM)

A classification algorithm in which the value of each feature is plotted in a particular coordinate (known as support vectors) [47]

Data are split by a line (called classifier) equidistant to the closest point from each group

Ingo Steinwart, Andreas Christmann

Support vector machines

Springer New York; 2008

Naïve Bayes

A classification algorithm that assumes independence between predictors [48]

It is simple, useful in large datasets, and known to outperform sophisticated classification methods

Thomas Mitchell

Machine learning

McGraw-Hill Education; 1997

Decision tree

A supervised learning algorithm mostly used for the classification of problems with both categorical and continuous dependent variables [49]

It splits populations into homogenous sets based on significant attributes (independent variables)

Clinton Sheppard

Tree-based machine learning algorithms: decision trees, random forests, and boosting

CreateSpace; 2017

Random foresta

An ensemble of decision trees

It classifies a new object on the basis of attributes and the tree “votes” with forest choosing the classification having the most votes

Combining trees improves the accuracy of the model [50]

Chris Smith, Mark Koning

Decision trees and random forests: a visual introduction for beginners

Blue Windmill Media; 2017

Gradient boosting machines (GBM)a

Has high prediction power and is used on large datasets

By combining learning algorithms, gradient boosting works well with scientific data

XGBoost has high predictive power and accuracy; regularized boosting helps reduce overfitting [51]

LightGBM is a faster algorithm that uses tree-based algorithms [52]

Corey Wade, Kevin Glynn

Hands-on gradient boosting with XGBoost and Scikit-learn: perform accessible machine learning and extreme gradient boosting with Python

Packt; 2020

Ke G, Meng Q, Finley T, et al.

Lightgbm: a highly efficient gradient boosting decision tree

Adv Neural Inf Process Syst 2017;30:3149–3157

Recurrent neural networks (RNN)a

Designed to process sequential data such as time series

Have been used to predict hospital readmissions [53]

Ralf C. Staudemeyer, Eric Rothstein Morris

Understanding LSTM – a tutorial into long short-term memory recurrent neural networks

arXiv 1909.09586 [preprint]. 2019

Long short-term memory (LSTM)a

A type of RNN designed to handle long-term dependencies in sequential data [53]

Has been used to predict disease progression

Ralf C. Staudemeyer, Eric Rothstein Morris

Understanding LSTM – a tutorial into long short-term memory recurrent neural networks

arXiv 1909.09586 [preprint]. 2019

K-nearest neighbors (KNN)

Can be used for classification or regression problems

In classification, it stores all cases and classifies new cases by a majority vote of k-neighbors measured by a distance function [54]

The sum of the square of the difference between the centroid and the data points within a cluster determines the value of k for that cluster

Antonio Mucherino, Petraq J. Papajorgji, Panos M. Pardalos

Data mining in agriculture

Springer Science & Business Media; 2009

K-means clustering An unsupervised algorithm that solves clustering problems by picking a k number known as centroid, and each data point forms a cluster with the closest centroids [55]

Swati Patel

K-means clustering algorithm: implementation and critical analysis

Scholars; 2019

Dimensionality reduction It identifies highly significant variables from vast data sets where the data are unstructured or in great detail [56]

Benyamin Ghojogh, Mark Crowley, Fakhri Karray, Ali Ghodsi

Elements of dimensionality reduction and manifold learning

Springer Nature; 2023

aMost commonly used to predict medical events