Skip to main content
. 2024 May 17;24:114. doi: 10.1186/s12874-024-02231-4

Table 2.

Description of machine learning algorithms used and the stacking-based ensemble method (SE). K-Nearest Neighbors (KNN). Decision Tree (DT). Extreme Gradient Boosting (XG-Boost). Random Forest (RF).

Model Description
KNN [31] K-Nearest Neighbor (KNN) is a classifier in the category of non-parametric methods. Its approach to classifying an unknown instance involves examining the classification of its neighboring data points. Specifically, it labels the target by evaluating the class labels of k number of the closest points in the feature space. The classification of the target is then determined by assigning it to the most frequently occurring class among its k-nearest neighbors.
DT [32] The Decision Tree (DT) algorithm is a recursive and greedy approach that uses a tree data structure where nodes and branches represent targets and features, respectively. The initial node in the tree is the root node from which other nodes branch out. The algorithm uses all nodes, including the leaves, to determine the best class for the target. The DT algorithm constructs the tree by first growing it to its maximum depth to ensure that each leaf node is pure. It then performs a pruning upwards, optimizing the classification error as well as the proportion of final nodes in the tree.
RF [33] Random Forest (RF) is a commonly used bagging ensemble algorithm in health-related research. Essentially, RF is a group of classifiers made up of decision trees generated from two different sources of randomization. Firstly, a random sample is trained on each decision tree with the original data replaced by new information on the same size as the supplied training set. It is estimated that the resulting bootstrapping process includes approximately 37% redundant instances.
XGBoost [34] XGBoost is an ensemble method that employs decision trees and utilizes the gradient boosting framework. It is renowned for its versatility, portability, and efficiency. Unlike traditional gradient boosting, XGBoost approximates the optimization of the objective function using the second-order derivative (or Taylor expansion). It offers a variety of hyperparameters that give practitioners a fine-tuned control over model training. Initially, the algorithm makes a naive prediction for the target variable. To enhance the prediction’s accuracy, XGBoost iteratively constructs new trees that focus on the residuals or errors of the preceding trees. When each tree has been trained, its contribution to the final prediction is moderated by a learning rate, preventing overfitting and ensuring more robust generalization.
LR [35] Logistic Regression is a statistical method used for modeling the probability of a binary outcome based on one or more predictor variables. The logistic function transforms any linear combination of the predictors into a value between 0 and 1 suitable for estimating probabilities, which can then be translated into class predictions. Due to its interpretability and simplicity, Logistic Regression is commonly employed in various fields, including medicine, where it is used to relate patient characteristics to outcomes.
SE [36] The stacking method is a popular type of heterogeneous ensemble learning that uses meta-models to combine various base classifiers to generate more accurate predictions. The primary advantage of the stacking method is that it can leverage the strengths of multiple effective models to produce more precise forecasts. The stacking method is trained on the complete training set, and a meta estimator is used to learn how to combine the base classifiers, which is distinct from other ensemble learning techniques such as Random Forest. The stacking method can assess the error of all base classifiers separately using basic learning processes and subsequently reduce residual errors using meta-learning steps.