Skip to main content
PLOS One logoLink to PLOS One
. 2024 Apr 18;19(4):e0301541. doi: 10.1371/journal.pone.0301541

Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data

Shahadat Uddin 1,*, Haohui Lu 1
Editor: Nagarajan Raju2
PMCID: PMC11025817  PMID: 38635591

Abstract

Many individual studies in the literature observed the superiority of tree-based machine learning (ML) algorithms. However, the current body of literature lacks statistical validation of this superiority. This study addresses this gap by employing five ML algorithms on 200 open-access datasets from a wide range of research contexts to statistically confirm the superiority of tree-based ML algorithms over their counterparts. Specifically, it examines two tree-based ML (Decision tree and Random forest) and three non-tree-based ML (Support vector machine, Logistic regression and k-nearest neighbour) algorithms. Results from paired-sample t-tests show that both tree-based ML algorithms reveal better performance than each non-tree-based ML algorithm for the four ML performance measures (accuracy, precision, recall and F1 score) considered in this study, each at p<0.001 significance level. This performance superiority is consistent across both the model development and test phases. This study also used paired-sample t-tests for the subsets of the research datasets from disease prediction (66) and university-ranking (50) research contexts for further validation. The observed superiority of the tree-based ML algorithms remains valid for these subsets. Tree-based ML algorithms significantly outperformed non-tree-based algorithms for these two research contexts for all four performance measures. We discuss the research implications of these findings in detail in this article.

1. Introduction

Machine learning (ML) algorithms harness various statistical, probabilistic, and optimisation techniques to extrapolate historical data and discern valuable insights from vast, intricate, unstructured datasets [1]. ML has shown significant promise across various studies, including advancements in semantic embeddings [2], unsupervised learning techniques [35], disease prediction [68] and visual recognition optimisations [911]. Among the ML models, tree-based algorithms have gained prominence for their effectiveness, particularly in dealing with tabular data. These algorithms, including Decision Trees (DT) and Random Forests (RF), operate through a hierarchical structure that enables transparent, criteria-based decision-making, adeptly managing both categorical and continuous inputs [12]. Such an inherent structure allows these models to compartmentalise the predictor space into simple regions, proving especially advantageous when addressing data embodying complex, non-linear relationships.

Tree-based algorithms partition the feature space into distinct and mutually exclusive regions. It forecasts outcomes using a series of test conditions arranged hierarchically. Each node in this structure evaluates a feature value, directing input from its root to the terminal leaves [13]. The ultimate prediction is typically derived from the dominant class or the average forecast of the samples within that leaf. This foundational approach in tree-based methods allows for capturing subtle data patterns, striking a harmonious balance between model robustness and explanatory clarity. On the other hand, non-tree-based algorithms such as Logistic regression (LR) and Support vector machine (SVM) adopt distinct methodologies for classification tasks. For instance, LR, a popular supervised learning classification algorithm developed in the 1940s, differentiates from linear regression by using a binomial output and employing the natural logarithm of the odds for its response variable to produce continuous criteria [14]. Further, SVM can address non-linear relationships using kernel methods. They might not consistently offer the level of interpretability found in tree-based models [15].

The choice between tree-based and non-tree-based algorithms depends on various considerations, such as the type of data, the importance of model clarity, and the need for generalisation. Given the wide-ranging usage of ML algorithms across different fields, it is essential to grasp the merits and drawbacks of each method. In this study, we conduct an in-depth comparison of tree-based algorithms with non-tree-based methods across various tabular datasets, aiming to highlight where tree-based algorithms excel.

Although studies in the current literature empirically showed the superiority of tree-based ML algorithms, no study shows such superiority through a statistical significance test. Following a classical comparative statistical significance test (paired-sample t-test), this study will show the performance superiority of tree-based ML algorithms over non-tree-based ML algorithms. We will use four ML performance measures (accuracy, precision, recall and F1 score) for this.

2. Literature review

Tabular data has long been the cornerstone of data analytics. Before the tree-based algorithms, k-nearest neighbour (KNN), LR and SVM were the standard choices for processing tabular data. While these methods excel in certain situations, they frequently struggle with non-linear data patterns and complex feature interplay [12].

Tree-based models like DT and RF have since filled this gap. They adeptly capture non-linear relationships and intricate data patterns by hierarchically partitioning the feature space [13]. DT is an interpretable model that can manage both numerical and categorical data. Further, RF is an ensemble method that boosts prediction accuracy and combats overfitting by averaging multiple decision tree outcomes. The advantages of tree-based models are numerous. For example, they naturally handle feature interactions, eliminating the need for tedious feature engineering [13]. Also, their interpretability has been enhanced by techniques, such as SHAP values, ensuring that complex models remain transparent [16].

In the literature, evidence from many benchmarking individual studies provides proof of the superiority of tree-based models. Perlich et al. [17] compared tree-based methods and logistic regression, evaluating their classification accuracy and the quality of rankings based on class membership probabilities. Their findings highlighted the superiority of the tree-based model. Later, Caruana and Niculescu-Mizil [18] compared different ML methods on 30 datasets. Tree-based models consistently outperformed non-tree-based algorithms regarding eight performance metrics. Although they illustrated empirical evidence of the superiority of tree-based ML algorithms, they did not show any statistical significance evidence, either at p≤0.05 or p≤0.1 levels, for their findings.

Like Caruana and Niculescu-Mizil [18], Fernández-Delgado et al. [19] experienced the same empirical evidence. They evaluated 179 classifiers from 17 families across various platforms, using 121 datasets primarily from the UCI database and some proprietary real-world problems. The results indicate that RF are the most likely top performers. Recently, deep learning has gained popularity. Yet, its dominance over tabular data remains debatable, even though it has shown success with text and image datasets. Uddin et al. [20] analysed the findings from 48 studies. They found that SVM is the most used ML algorithm, and RF is the one showing the best accuracy at most times. Grinsztajn et al. [21] compared tree-based models such as XGBoost and RF across 45 diverse datasets. The results revealed that tree-based models outperformed deep learning in general, especially on medium-sized data sets. Tree-based models excel on tabular data due to their inductive biases, while deep learning struggles with irregular target patterns and rotation invariance, especially when dealing with extraneous features in tabular datasets.

Many studies in the current literature empirically demonstrated the superiority of tree-based ML algorithms. They primarily used one or more datasets for descriptive statistical comparisons [e.g., 22]. Yet, employing statistical significance comparisons like t-tests to demonstrate such supremacy is not widespread. This study analyses the performance of five ML algorithms (two tree-based and three non-tree-based) over 200 datasets to demonstrate the superiority of tree-based algorithms over their counterparts at a statistical significance level.

3. Materials and methods

3.1 Data source

This study uses 200 open-access tabular datasets from the UCI Machine Learning Repository (53) and Kaggle (147). These two repositories host various research datasets that researchers can access for their research studies without any ethical obligation [23, 24]. The latter one hosts more datasets than the former one. The 200 datasets that this study considered are from 132 unique sources (S2 Table). Some of these sources contain more than one dataset. For example, we evaluated 50 datasets on university-ranking data from the Ultimate University Ranking source [25], containing ranking data from eight ranking-producing organisations, including QS and Times Higher Education, for several years. These 200 datasets are from various research contexts (Fig 1), including disease prediction (66), university- ranking (50), sports (23), finance (15) and academia (14). Acknowledging the potential for selection bias, we have carefully considered the diversity and representativeness of the datasets concerning the research questions addressed in this study. Our study meticulously selected a diverse range of datasets from various domains while ensuring a balance in dataset sizes and types to robustly mitigate the impact of selection bias on our research findings. For example, we opted for mean-based imputation to address missing data in numerical attributes within unskewed datasets [26]. For categorical data, we followed the mode-based imputation. This study followed the statistical approach of the synthetic minority oversampling technique [27] to make an unbalanced dataset a balanced one.

Fig 1. Frequency and percentage of different dataset contexts.

Fig 1

The percentage figure within the bracket indicates the corresponding percentage value of the left number.

3.2 Machine learning algorithms

This study considers five ML algorithms balancing tree-based and non-tree-based approaches to harness their complementary strengths in addressing our research question.

Tree-based

Tree-based ML algorithms can map non-linear relationships well, making them more adaptable in solving classification and regression problems. They use several if-then conditional rules to develop prediction models. The study used two tree-based ML algorithms. They are Decision tree (DT) and Random Forest (RF). A DT, resembling a natural tree, is a hierarchical tree-like model consisting of multiple levels. In each level, different conditions are being tested and based on these test outcomes, the algorithm either reaches a final decision or jumps to a test condition of the next level [28]. A DT primarily consists of decision nodes and leaves. A sub-node which further divided into multiple sub-nodes is called a decision node. A sub-node is called a leaf node when it does not further split into additional sub-nodes. Leaf nodes contain the final decision outcome. RF is a commonly used ML algorithm that combines outputs from multiple DTs to reach a single result. Like the fact that a forest has many trees, an RF consists of several DTs [13]. Depending on the underlying problem, the determination of the results will vary. For a classification task, the majority voting will yield the final predicted categorical outcome. Outcomes from individual DTs are averaged for a regression task. These algorithms allow us to model complex interactions effectively, providing a solid basis for comparison against linear approaches.

Non-tree-based

The three non-tree-based ML algorithms considered in this study are Support vector machine (SVM), Logistic regression (LR) and K-nearest neighbours (KNN). SVM is the most used supervised ML algorithm. An SVM operates a decision boundary, known as the hyperplane, for classification [29]. Data points on either side of this boundary line belong to different classes. Data points closer to the hyperplane on both sides are called support vectors. These points define the orientation and positioning of the hyperplane. SVM often uses kernel functions to handle the non-linearity of the decision surface for classification. LR estimates the probability of an event occurring on a scale between 0 and 1 [30]. Therefore, we must set a threshold value to use LR for a binary classification task. For example, a value ≤0.5 for a data instance will classify it as ‘class A’; otherwise, ‘class B’. LR can also be used for problems with more than two classes through its generalised version, multinomial LR. KNN seeks votes from its k nearest neighbours to determine the class of a new data instance [31]. The class suggested by most of these votes will be the class of that new instance. There are many algorithms to quantify the nearest neighbours, including Euclidian distance and cosine similarity. The selection of these algorithms allows us to explore a range of decision boundaries from linear to non-linear, assessing their performance and interpretability in the context of our specific research question.

By considering both tree-based and non-tree-based algorithms, this study aims to comprehensively evaluate ML strategies, ensuring a robust and versatile analysis that can navigate the varied landscape of our research challenges. This deliberate choice of algorithms facilitates a nuanced understanding of different ML approaches, their strengths and limitations, enabling us to offer more grounded and practical insights into their applicability and performance.

3.3 Confusion matrix and performance measure

A confusion matrix, also known as an error matrix, is a tabular tool to demonstrate the performance of a classification algorithm [32]. Values in a confusion matrix can be of four types (Fig 2). True-positive (TP) is when a model correctly predicts a positive class. Similarly, True-negative (TN) is when a model correctly predicts a negative class. False-positive and False-negative happen when a model incorrectly predicts a positive and negative class, respectively.

Fig 2. Confusion matrix.

Fig 2

Following the approaches followed in previous studies [e.g., 33, 34], this study uses four performance measures based on these four confusion matrix values. They are accuracy, precision, recall and F1 score. Here are their eqs.

Accuracy=TP+TNTP+TN+FP+FN
Precision=TPTP+FP
Recall=TPTP+FN
F1=2×Precision×RecallPrecision+Recall

3.4 Paired-sample t-test

The paired sample t-test, also known as the dependent sample t-test, is a statistical method used to check whether the mean difference between two observed groups is statistically significant [35]. Below is the formula for the paired sample t-test.

t=d¯sn

Where d¯ and s are the mean and standard deviation of all pairwise difference values, and n is the total number of pairs in the dataset.

3.5 Experimental setup

This study used the Scikit-learn library [36] to implement the five ML algorithms with each research dataset considered in this study. Each dataset underwent an 80:20 split for the training and test data separation, and the training model development followed a five-fold cross-validation. To promote reproducibility, we adhered to Scikit-learn default parameters for all algorithms, ensuring a transparent and standardised experimental framework. While we applied the preprocessing steps uniformly across datasets, these were limited to essential procedures like normalisation, handled intrinsically by Scikit-learn [36]. For the paired sample t-test, we used IBM SPSS Statistics, version 28.0.0.0 [35], with its default setup for this test.

4. Results

Table 1 presents the paired-sample t-test results between the trained tree-based and non-tree-based ML algorithms for the four performance measures we used in this study. As illustrated in the mean column of this table, tree-based DT and RF show much higher values than the non-tree-based SVM, LR and KNN for all four performance measures. The differences are at p<0.001 significant level for each performance measure. Notably, the mean difference between two tree-based ML algorithms is minimal for each performance measure, leading to identical t-values in many cases.

Table 1. Paired-sample t-test results for the four performance measures between trained tree-based and non-tree-based supervised machine learning algorithms for the 200 datasets considered in this study.

Test Group details Mean STD N t Sig.
Tree-based Non-tree-based Mean 1 Mean 2 Std 1 Std 2
(a) Accuracy
1 Random forest Support vector machine 0.99838 0.92889 0.89796 8.57100 200 11.67 <0.001
2 Random forest Logistic regression 0.99838 0.90277 0.89796 10.64105 200 12.86 <0.001
3 Random forest K-nearest neighbour 0.99838 0.91919 0.89796 8.28808 200 13.77 <0.001
4 Decision tree Support vector machine 0.99839 0.92889 0.89801 8.57100 200 11.67 <0.001
5 Decision tree Logistic regression 0.99839 0.90277 0.89801 10.64105 200 12.86 <0.001
6 Decision tree K-nearest neighbour 0.99839 0.91919 0.89801 8.28808 200 13.77 <0.001
(b) Precision
1 Random forest Support vector machine 0.99843 0.92166 0.86829 10.28040 200 10.74 <0.001
2 Random forest Logistic regression 0.99843 0.89307 0.86829 12.27288 200 12.29 <0.001
3 Random forest K-nearest neighbour 0.99843 0.91798 0.86829 8.39003 200 13.82 <0.001
4 Decision tree Support vector machine 0.99845 0.92166 0.85806 10.28040 200 10.74 <0.001
5 Decision tree Logistic regression 0.99845 0.89307 0.85806 12.27288 200 12.30 <0.001
6 Decision tree K-nearest neighbour 0.99845 0.91798 0.85806 8.39003 200 13.82 <0.001
(c) Recall
1 Random forest Support vector machine 0.99839 0.92889 0.89796 8.57087 200 11.67 <0.001
2 Random forest Logistic regression 0.99839 0.90277 0.89796 10.64105 200 12.86 <0.001
3 Random forest K-nearest neighbour 0.99839 0.91910 0.89796 8.28808 200 13.77 <0.001
4 Decision tree Support vector machine 0.99840 0.92889 0.89801 8.57087 200 11.67 <0.001
5 Decision tree Logistic regression 0.99840 0.90277 0.89801 10.64105 200 12.86 <0.001
6 Decision tree K-nearest neighbour 0.99840 0.91910 0.89801 8.28808 200 13.77 <0.001
(d) F1 score
1 Random forest Support vector machine 0.99838 0.91907 0.90220 10.21913 200 11.16 <0.001
2 Random forest Logistic regression 0.99838 0.89171 0.90220 12.15084 200 12.57 <0.001
3 Random forest K-nearest neighbour 0.99838 0.91569 0.90220 8.62228 200 13.82 <0.001
4 Decision tree Support vector machine 0.99836 0.91907 0.92129 10.21913 200 11.17 <0.001
5 Decision tree Logistic regression 0.99836 0.89171 0.92129 12.15084 200 12.58 <0.001
6 Decision tree K-nearest neighbour 0.99836 0.91569 0.92129 8.62228 200 13.83 <0.001

Table 2 represents the number of times (out of 200 datasets) each ML algorithm performed best during the training phase concerning the four performance metrics when applied to the 200 research datasets considered in this study. Tree-based RF and DT outperformed the three non-tree-based ones by a significant margin. Interestingly, non-tree-based SVM, LR and KNN have the same score against all four performance measures. The row sum for each row in this table is higher than 200 since, in many cases, multiple ML algorithms performed best for the same measure against the same dataset. For example, DT showed the best accuracy score for all 200 datasets. RF also revealed the same best accuracy score for 194 datasets. For SVM, it is only 17.

Table 2. Frequency statistics of the best performance of different trained machine learning algorithms against four performance metrics.

Tree-based Non-tree-based
Random forest Decision tree Support vector machine Logistic regression K-nearest neighbours
Accuracy 194 200 17 13 9
Precision 188 198 17 13 9
Recall 194 200 17 13 9
F1 score 193 193 17 13 9

Of the 200 datasets this study considered for research investigation, 66 are from the disease prediction context. The second largest is the university-ranking context, which has 50 datasets. We employed the paired-sample t-tests on the datasets of these two contexts for further in-depth investigation and validation. Table 3 details the corresponding findings only for the accuracy measure. S1(A)–S1(F) Table outlines the results for the precision, recall and F1 score measures. DT and RF have the exact mean accuracy and recall value for the disease prediction datasets. For the university-ranking context, they have the same mean value for all four performance measures. These two tables echo the findings from Tables 1 and 2; tree-based ML algorithms revealed superior performance compared to non-tree-based ones at p<0.001 significant level for the datasets of these two research contexts. It is worth noting that the tree-based ML algorithms attained a possible highest score against all four performance measures for the university ranking datasets.

Table 3. Paired-sample t-test results for the accuracy measure between tree-based and non-tree-based supervised machine learning algorithms for the datasets from disease prediction and university-ranking contexts.

Test Group details Mean STD N t Sig.
Tree-based Non-tree-based Mean 1 Mean 2 Std 1 Std 2
(a) Disease prediction context (66 datasets)
1 Random forest Support vector machine 0.99575 0.89766 1.47404 10.02376 66 8.093 <0.001
2 Random forest Logistic regression 0.99575 0.87265 1.47404 11.62626 66 8.703 <0.001
3 Random forest K-nearest neighbour 0.99575 0.89919 1.47404 9.30741 66 8.630 <0.001
4 Decision tree Support vector machine 0.99575 0.89766 1.47404 10.02376 66 8.093 <0.001
5 Decision tree Logistic regression 0.99575 0.87265 1.47404 11.62626 66 8.703 <0.001
6 Decision tree K-nearest neighbour 0.99575 0.89919 1.47404 9.30741 66 8.630 <0.001
(b) University-ranking context (50 datasets)
1 Random forest Support vector machine 1.00000 0.99189 0.00000 1.10923 50 5.171 <0.001
2 Random forest Logistic regression 1.00000 0.98652 0.00000 1.56146 50 6.105 <0.001
3 Random forest K-nearest neighbour 1.00000 0.97987 0.00000 2.07197 50 6.871 <0.001
4 Decision tree Support vector machine 1.00000 0.99189 0.00000 1.10923 50 5.171 <0.001
5 Decision tree Logistic regression 1.00000 0.98652 0.00000 1.56146 50 6.105 <0.001
6 Decision tree K-nearest neighbour 1.00000 0.97987 0.00000 2.07197 50 6.871 <0.001

To ensure validation, we conducted t-tests between tree-based and non-tree-based ML algorithms for their performance obtained during the test phase. The outcomes of these t-tests resembled those that resulted when comparing their performance during the training phase (Table 1). Table 4 presents an instance of such results for the accuracy measure. This study observed similar findings from the test phase for the other three performance measures (precision, recall and F1 score).

Table 4. Paired-sample t-test results for the accuracy measure between tree-based and non-tree-based supervised machine learning algorithms during the test phase for the 200 datasets considered in this study.

Test Group details Mean STD N t Sig.
Tree-based Non-tree-based Mean 1 Mean 2 Std 1 Std 2
1 Random forest Support vector machine 0.92455 0.89635 0.10654 0.13134 200 6.150 <0.001
2 Random forest Logistic regression 0.92455 0.89065 0.10654 0.13665 200 4.491 <0.001
3 Random forest K-nearest neighbour 0.92455 0.88230 0.10654 0.12493 200 5.914 <0.001
4 Decision tree Support vector machine 0.90969 0.89635 0.12295 0.13134 200 2.637 0.005
5 Decision tree Logistic regression 0.90969 0.89065 0.12295 0.13665 200 2.354 0.100
6 Decision tree K-nearest neighbour 0.90969 0.88230 0.12295 0.12493 200 3.445 <0.001

5. Discussion

This study uses statistical tests to compare the performance between tree-based ML and non-tree-based algorithms. The results consistently indicate that tree-based ML algorithms outperform their non-tree-based counterparts across all four performance measures, with the differences being statistically significant at the p < 0.001 level. This superiority remains consistent when delving deeper into specific datasets, especially those related to disease prediction and university ranking. Tree-based algorithms attained an impeccable score for the university-ranking datasets, hinting at their effectiveness in this domain. The exceptional accuracy scores achieved in our models prompt considerations of overfitting. However, through stringent cross-validation, we have ensured the robustness and generalisability of our results. Furthermore, the evident distinctions among classification groups identified in our analysis greatly enhance the discernibility, affirming the prediction accuracy of the models considered in the study within the specific attributes of the underlying dataset.

There could be several reasons for the superior performance of the tree-based ML algorithms compared with their counterparts. The most notable one is that, unlike linear models, they can map non-linear relations very well, empowering them with excellent prediction accuracy and greater stability [37]. Moreover, they can better accommodate categorical and numerical data than others [38]. Tree-based ML algorithms can be described as sets of if-else statements, enabling them to assimilate non-linear and categorical data during the model learning process. This results in enhanced predictive accuracy.

Datasets from Kaggle and UCI Machine Learning Repository often exhibit high prediction accuracy. The 50 University-ranking datasets show 100% prediction accuracy, as detailed in Table 3(b). There are several reasons behind the presence of such high accuracy. Datasets are carefully preprocessed, cleaned, and removed inconsistencies, missing values and outliers before making them available on these two open-access platforms. They may also have well-engineered features that simplify the modelling process and enhance prediction performance. For these reasons, models often developed and tested based on these open-access datasets reveal superior accuracy. However, caution should be taken for their real-world applications that often encounter diverse, unclean and complex data.

The implications of our findings are manifold. Confirming the superiority of the tree-based ML algorithms will help future researchers make appropriate data analysis plans for their studies. For classification, some studies in the literature [e.g., 39] employed only the non-tree-based ML algorithms on tabular data. Uddin et al. [20] pointed out that SVM is the most used ML algorithm in the disease prediction literature. Our findings suggest that, along with other ML algorithms, researchers should consider at least one tree-based one for addressing a classification task using a tabular dataset. This suggestion offers a clear indication towards the choice of algorithm, especially when tackling tabular datasets.

One of the limitations of this study is that it considered only classical tree-based and non-tree-based supervised ML algorithms for comparison. It did not evaluate other ML algorithms, such as deep learning ones. Although RF is one of the classical ML algorithms, it is an ensemble approach. Our study did not consider other practical ensemble approaches. For example, this study did not consider tree-based AdaBoost and XGBoost ensemble ML algorithms like other studies [e.g., 40] in the literature. Similarly, we did not consider unsupervised ML algorithms such as k-means clustering for performance comparison. Another notable limitation of this study is that although it statistically demonstrated the superiority of tree-based ML algorithms, it did not explore the underlying reasoning behind such dominance. It is evident in the literature that tree-based ML algorithms can handle non-linear classification datasets better [13], which could be a possible reason for their performance superiority. Uddin and Lu [41] noticed that dataset meta-level and statistical attributes do not impact the performance of tree-based MLs. However, they have a statistically significant impact on non-tree-based ML algorithms. Further, we recognise that our focus on standard performance metrics like accuracy, precision, recall, and F1 score may not fully capture the complexity of model evaluation, omitting metrics such as Area under the receiver operating characteristic curve and specificity. Future studies will aim to incorporate a broader array of metrics, ensuring a more comprehensive assessment of classification model performance in alignment with research objectives.

These limitations could help define new future research scopes and opportunities. Exploring the performance of ensemble tree-based algorithms with the results of this study can offer more comprehensive insights. While our findings suggest tree-based algorithms outperform non-tree-based ones across multiple datasets, we recognise the importance of considering dataset-specific characteristics, such as feature distribution and complexity, that could influence algorithm performance. Uddin and Lu [41] discovered that ML algorithms exhibit varying performances when applied to datasets with distinct meta-level and statistical attributes. Moreover, an explanatory approach, combined with domain expertise, could unearth the factors contributing to the superiority of tree-based algorithms. Such insights can refine algorithmic choices and expand our understanding of the underlying mechanics of these algorithms. To further enrich the field, future studies should investigate the efficacy of deep learning (e.g., convolutional neural networks), unsupervised learning (e.g., k-means clustering) and ensemble tree-based algorithms (e.g., AdaBoost and XGBoost), and dissect the factors influencing algorithmic performance disparities. Detailed comparative analyses and examining feature importance across varied datasets will be crucial in these endeavours, offering a more straightforward path toward optimising ML applications in various domains.

6. Conclusion

Non-tree-based ML algorithms performed inferior to tree-based ones at a statistically significant level. Many individual studies in the current literature also pointed out this kind of superiority of tree-based ones over non-tree-based ones. To our knowledge, this study is the first one that confirmed such supremacy statistically by employing two tree-based and three non-tree-based classical ML algorithms on 200 datasets from various research contexts. Future studies can consider other tree-based and non-tree-based ML algorithms (e.g., ensemble ones) to explore such dominance of the former ones using research datasets from different contexts. Until then, our findings can provide insightful details in selecting appropriate ML algorithms for future researchers to design their research analyses and experimental setups.

Supporting information

S1 Table. Paired-sample t-test results for the precision, recall and F1 score measures between tree-based and non-tree-based trained supervised machine learning algorithms for the datasets from disease prediction (66) and university-ranking contexts (50).

(DOCX)

pone.0301541.s001.docx (34.6KB, docx)
S2 Table. Dataset source information.

(DOCX)

pone.0301541.s002.docx (47.2KB, docx)

Acknowledgments

We acknowledge the University of Sydney’s Vacation Research Internship recipient, Palak Mahajan, for her contribution to dataset extraction and preprocessing.

Data Availability

The 200 datasets used in this study are publicly available from open-source repositories. S2 Table contains the web address of each dataset.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Jordan M.I. and Mitchell T.M., Machine learning: Trends, perspectives, and prospects. Science, 2015. 349(6245): p. 255–260. [DOI] [PubMed] [Google Scholar]
  • 2.Rehman S.U., Tu S., Huang Y., and Rehman O.U., A benchmark dataset and learning high-level semantic embeddings of multimedia for cross-media retrieval. IEEE Access, 2018. 6: p. 67176–67188. [Google Scholar]
  • 3.Rehman S.U., Tu S., Huang Y., and Yang Z. Face recognition: A novel un-supervised convolutional neural network method. in 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS). 2016. IEEE. [Google Scholar]
  • 4.Li N., Shepperd M., and Guo Y., A systematic review of unsupervised learning techniques for software defect prediction. Information and Software Technology, 2020. 122: p. 106287. [Google Scholar]
  • 5.Lu H. and Uddin S., Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets. Health and Technology, 2024. 14(1): p. 141–154. [Google Scholar]
  • 6.Uddin S., Wang S., Lu H., Khan A., Hajati F., and Khushi M., Comorbidity and multimorbidity prediction of major chronic diseases using machine learning and network analytics. Expert Systems with Applications, 2022. 205: p. 117761. [Google Scholar]
  • 7.Hossain M.E., Khan A., and Uddin S. Understanding the progression of congestive heart failure of type 2 diabetes patient using disease network and hospital claim data. in Complex Networks and Their Applications VIII: Volume 2 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019 8. 2020. Springer. [Google Scholar]
  • 8.Hossain M.E., Khan A., and Uddin S. Understanding the comorbidity of multiple chronic diseases using a network approach. in Proceedings of the Australasian Computer Science Week Multiconference. 2019. [Google Scholar]
  • 9.Rehman S.U., Tu S., Rehman O.U., Huang Y., Magurawalage C.M.S., and Chang C.-C., Optimization of CNN through novel training strategy for visual classification problems. Entropy, 2018. 20(4): p. 290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Tu S., Huang Y., and Liu G., CSFL: A novel unsupervised convolution neural network approach for visual pattern classification. Ai Communications, 2017. 30(5): p. 311–324. [Google Scholar]
  • 11.Tu S., Rehman S.U., Waqas M., Rehman O.U., Shah Z., Yang Z., et al., ModPSO-CNN: an evolutionary convolution neural network with application to visual recognition. Soft Computing, 2021. 25: p. 2165–2176. [Google Scholar]
  • 12.James G., Witten D., Hastie T., and Tibshirani R., An introduction to statistical learning. Vol. 112. 2013: Springer. [Google Scholar]
  • 13.Breiman L., Random Forests. Machine Learning, 2001. 45(1): p. 5–32. [Google Scholar]
  • 14.Kleinbaum D.G., Dietz K., Gail M., Klein M., and Klein M., Logistic regression. 2002: Springer. [Google Scholar]
  • 15.Cortes C. and Vapnik V., Support-vector networks. Machine learning, 1995. 20(3): p. 273–297. [Google Scholar]
  • 16.Lundberg S.M. and Lee S.-I., A unified approach to interpreting model predictions. Advances in neural information processing systems, 2017. 30. [Google Scholar]
  • 17.Perlich C., Provost F., and Simonoff J., Tree induction vs. logistic regression: A learning-curve analysis. 2003. [Google Scholar]
  • 18.Caruana R. and Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. in Proceedings of the 23rd international conference on Machine learning. 2006. [Google Scholar]
  • 19.Fernández-Delgado M., Cernadas E., Barro S., and Amorim D., Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 2014. 15(1): p. 3133–3181. [Google Scholar]
  • 20.Uddin S., Khan A., Hossain M.E., and Moni M.A., Comparing different supervised machine learning algorithms for disease prediction. BMC Medical Informatics and Decision Making, 2019. 19(1): p. 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Grinsztajn L., Oyallon E., and Varoquaux G., Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 2022. 35: p. 507–520. [Google Scholar]
  • 22.Farias F.M., Salomão R.C., Santos E., Caires A.S., Sampaio G.S.A., Rosa A.A.M., et al., Sex-related difference in the retinal structure of young adults: a machine learning approach. Frontiers in Medicine, 2023. 10: p. 1275308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Frank A., UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010. [Google Scholar]
  • 24.Kaggle. 2023; Available from: https://www.kaggle.com/ [Google Scholar]
  • 25.Ultimate University Ranking. [cited 2023; Available from: https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking/data
  • 26.Wei R., Wang J., Su M., Jia E., Chen S., Chen T., et al., Missing value imputation approach for mass spectrometry-based metabolomics data. Scientific reports, 2018. 8(1): p. 663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ishaq A., Sadiq S., Umer M., Ullah S., Mirjalili S., Rupapara V., et al., Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. IEEE access, 2021. 9: p. 39707–39716. [Google Scholar]
  • 28.Quinlan J.R., Induction of decision trees. Machine learning, 1986. 1(1): p. 81–106. [Google Scholar]
  • 29.Noble W.S., What is a support vector machine? Nature biotechnology, 2006. 24(12): p. 1565–1567. [DOI] [PubMed] [Google Scholar]
  • 30.Stoltzfus J.C., Logistic regression: a brief primer. Academic emergency medicine, 2011. 18(10): p. 1099–1104. [DOI] [PubMed] [Google Scholar]
  • 31.Peterson L.E., K-nearest neighbor. Scholarpedia, 2009. 4(2): p. 1883. [Google Scholar]
  • 32.Ting K.M., Confusion matrix. Encyclopedia of Machine Learning and Data Mining, 2017: p. 260–260. [Google Scholar]
  • 33.Yuan Q., Chen K., Yu Y., Le N.Q.K., and Chua M.C.H., Prediction of anticancer peptides based on an ensemble model of deep learning and machine learning using ordinal positional encoding. Briefings in Bioinformatics, 2023. 24(1): p. bbac630. [DOI] [PubMed] [Google Scholar]
  • 34.Le N.-Q.-K. and Ou Y.-Y., Incorporating efficient radial basis function networks and significant amino acid pairs for predicting GTP binding sites in transport proteins. BMC bioinformatics, 2016. 17(19): p. 183–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Field A., Discovering statistics using SPSS. 2013, London: Sage Publications Ltd. [Google Scholar]
  • 36.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., et al., Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 2011. 12(Oct): p. 2825–2830. [Google Scholar]
  • 37.Dumitrescu E., Hué S., Hurlin C., and Tokpavi S., Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects. European Journal of Operational Research, 2022. 297(3): p. 1178–1192. [Google Scholar]
  • 38.Song Y.-Y. and Ying L., Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 2015. 27(2): p. 130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Farran B., Channanath A.M., Behbehani K., and Thanaraj T.A., Predictive models to assess risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and validation using national health data from Kuwait—a cohort study. BMJ open, 2013. 3(5): p. e002457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Mahajan P., Uddin S., Hajati F., and Moni M.A., Ensemble Learning for Disease Prediction: A Review. Healthcare, 2023. 11(12): p. 1808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Uddin S. and Lu H., Dataset meta-level and statistical features affect machine learning performance. Scientific Reports, 2024. 14(1): p. 1670. [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Nagarajan Raju

22 Feb 2024

PONE-D-24-03825Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular dataPLOS ONE

Dear Dr. Uddin,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Apr 07 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Nagarajan Raju

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

Additional Editor Comments (if provided):

I suggest authors to go through the reviewers comments and address them properly in the revised manuscript.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

Reviewer #4: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

Reviewer #3: No

Reviewer #4: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

Reviewer #4: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

Reviewer #4: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. The study relies on tabular datasets from the UCI Machine Learning Repository and Kaggle, which are popular repositories for machine learning datasets. However, there's a risk of selection bias as these datasets may not be representative of real-world data or may have biases inherent in their collection process.

2. The study chooses five machine learning algorithms, but the rationale for selecting these specific algorithms is not thoroughly justified. While tree-based and non-tree-based algorithms are commonly used, there should be a discussion on why these particular algorithms were chosen over others and how they complement each other in addressing the research question.

3. While the study uses common performance metrics such as accuracy, precision, recall, and F1 score, there's limited discussion on why these metrics were chosen and how they align with the research objectives. Additionally, there's no mention of other important metrics such as area under the receiver operating characteristic curve (AUC-ROC) or specificity, which are crucial for evaluating classification models.

4. The study mentions using the Scikit-learn library for implementing machine learning algorithms and IBM SPSS Statistics for conducting paired-sample t-tests. While these are widely used tools, the lack of detailed information on specific parameter settings and preprocessing techniques could hinder reproducibility. Providing a clear and detailed description of the experimental setup would enhance the study's transparency and reproducibility.

5. The study splits the data into training and test sets using an 80:20 ratio and performs five-fold cross-validation during model development. While cross-validation helps assess the model's performance, there's no external validation using independent datasets.

6. The study focuses solely on classical tree-based and non-tree-based supervised ML algorithms, neglecting other important techniques such as deep learning algorithms or unsupervised learning algorithms.

7. Measurement metrics (i.e., accuracy, recall, etc.) are well-known and have been used in previous biomedical studies such as PMID: 36642410, PMID: 28155651. Therefore, the authors are suggested to refer to more works in this description to attract a broader readership.

8. The study does not consider ensemble approaches beyond Random Forest (RF), such as AdaBoost or XGBoost, which have shown significant performance improvements in various classification tasks.

9. While the study demonstrates the superiority of tree-based ML algorithms, it fails to explore the underlying reasons behind this dominance.

10. The study suggests that tree-based algorithms consistently outperform non-tree-based algorithms across all datasets, without considering potential dataset-specific factors that may influence algorithm performance.

11. While the study briefly mentions future research opportunities, such as exploring ensemble tree-based algorithms and investigating the underlying reasons for algorithmic performance, it lacks depth in discussing specific research avenues and methodologies.

Reviewer #2: My Comments are as follow:

1) The study focuses on classical algorithms and may not include recent advancements in machine learning, such as deep learning techniques that have shown promise in handling tabular data. It is highly recommended to include these for more contemporary perspective.

2) Abstract does not highlight novelty of the proposed work. It’s better to add more specific details of your work.

3) Introduction is not focused and literature can be reorganised to strengthen literature review following contributions and discuss few relevant works i.e.,

a) A Benchmark Dataset and Learning High-level Semantic Embeddings of Multimedia for Cross-media Retrieval

b) Unsupervised pre-trained filter learning approach for efficient convolution neural network

c) CSFL: A novel unsupervised convolution neural network approach for visual pattern classification

d) Optimization of CNN through novel training strategy for visual classification problems

e) Face recognition: A novel un-supervised convolutional neural network method

f) ModPSO-CNN: an evolutionary convolution neural network with application to visual recognition

g) Two-stage domain adaptation for infrared ship target segmentation

4) The work does not delve deeply into the impact of feature engineering and data preprocessing steps, which are crucial for the performance of machine learning algorithms. Add a detail discussion on it.

5) While the proposed work effectively compares tree-based algorithms with non-tree-based counterparts, it might lack a deeper analysis of why certain algorithms perform better than others. A more thorough investigation into the intrinsic properties of the datasets that favour tree-based methods is needed.

Reviewer #3: The paper is not scentifically sound to be published in this form.

Reviewer #4: The study aims to investigate the statistical significance of the performance of decision tree-based algorithms over other classical machine learning algorithms. Some points need modification in a final version. The manuscript's idea is interesting, since it seems inappropriate for articles on machine learning algorithms not to conduct statistical comparisons between the accuracies obtained by these algorithms in classification tasks.

Abstract and Introduction

-"no study has shown such supremacy through a statistical significance test." and "However, none shows such supremacy by employing any statistical significance comparison, such as a t-test." It's not true; below I can indicate an example that used statistics to compare the accuracy of machine learning algorithms, and it is possible that others have proceeded similarly. I suggest the authors rewrite the sentence and indicate that it is not usual to find statistical comparisons between the classification performance of machine learning algorithms.

Farias, F. M., Salomão, R. C., Rocha Santos, E. G., Sousa Caires, A., Sampaio, G. S. A., Rosa, A. A. M., Costa, M. F., & Silva Souza, G. (2023). Sex-related difference in the retinal structure of young adults: a machine learning approach. Frontiers in medicine, 10, 1275308. https://doi.org/10.3389/fmed.2023.1275308.

Methods

-Figure 1: Use a dot instead of a comma for decimal numbers. Include the label name for the X-axis.

-It would be important to provide more information about the type of data used. Time series for subsequent feature extraction? Was feature extraction performed? If yes, how many and which ones were extracted? Were they the same for all comparisons? How many groups were used in different datasets?

-Why was the t-test chosen over an analysis of variance? I think it would be more appropriate to use an analysis of variance or Kruskal-Wallis or perform a Bonferroni correction for the t-test results.

-I suggest performing at least a 10-fold cross-validation.

-Was there data preprocessing? Any normalization? I think it would be important.

-Does it make sense to compare the performance of random forest and decision tree?

Results

- Indicate the standard deviation of the mean values in Table 1 and Table 3.

- Table 3 shows accuracy of 1. Does it imply overfitting? Or do the groups exhibit very large differences, leading to easier classification? This debate could be done in Discussion section

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Apr 18;19(4):e0301541. doi: 10.1371/journal.pone.0301541.r002

Author response to Decision Letter 0


8 Mar 2024

Reviewer response

Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data

We sincerely thank the reviewers and editor for their insightful suggestions and comments. Here is our response to the corrections suggested by each of them. Changes are marked in red colour in the revised main manuscript file.

Suggestions from the Editor

Comment 1:

Please ensure that your manuscript meets PLOS ONE’s style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Our response: Thank you very much for this suggestion. We paid particular attention to meeting the PLOS ONE style requirements while revising this manuscript.

Comment 2:

I suggest authors go through the reviewers' comments and address them adequately in the revised manuscript.

Our response: Thank you for this suggestion. We gave our best effort to address all comments by the reviewers adequately.

Reviewer: 01

Comment 1: The study relies on tabular datasets from the UCI Machine Learning Repository and Kaggle, which are popular repositories for machine learning datasets. However, there is a risk of selection bias as these datasets may not represent real-world data or may have inherent biases in their collection process.

Our response: Thank you for pointing out this selection bias issue. We have taken steps to avoid this kind of bias as much as possible. The selected datasets considered in our study are from a wide range of contexts, as outlined in Figure 1 (page 6). In addition, we followed different statistical approaches, such as mean-based imputation [1] for handling the missing data problem and the synthetic minority oversampling technique [2] to make an unbalanced dataset a balanced one. Please see lines 140-147 on page 5 for further information.

Comment 2: The study chooses five machine learning algorithms, but the rationale for selecting these specific algorithms is not thoroughly justified. While tree-based and non-tree-based algorithms are commonly used, there should be a discussion on why these particular algorithms were chosen over others and how they complement each other in addressing the research question.

Our response: We appreciate this comment. Accordingly, we have revised Section 3.2 to clarify our rationale for selecting specific tree-based and non-tree-based algorithms, highlighting their complementary strengths in addressing our research question. This revision ensures a balanced exploration of machine learning strategies, enhancing the methodological rigour of our study and the relevance of our findings. Please see lines 161-162 (page 6), 177 –179 (page 6), and 194 – 201 (page 7) for further details.

Comment 3: While the study uses standard performance metrics such as accuracy, precision, recall, and F1 score, there is limited discussion on why these metrics were chosen and how they align with the research objectives. Additionally, there is no mention of other important metrics, such as area under the receiver operating characteristic curve (AUC-ROC) or specificity, which are crucial for evaluating classification models.

Our response: Thank you for highlighting the importance of diverse evaluation metrics in addition to the commonly used four we considered (i.e., accuracy, precision, recall and F1 score). In response, we have acknowledged this limitation in our manuscript and emphasised our intention to explore additional metrics such as AUC-ROC and specificity in future research to provide a more holistic evaluation of model performance. We added these to the limitations and future study scope of our study. Please see lines 368 – 372 on page 14.

Comment 4: The study mentions using the Scikit-learn library to implement machine learning algorithms and IBM SPSS Statistics to conduct paired-sample t-tests. While these are widely used, the lack of detailed information on specific parameter settings and preprocessing techniques hinders reproducibility. A clear and precise description of the experimental setup would enhance transparency and reproducibility.

Our response: Thank you for emphasising the importance of detailing our experimental setup. We have clarified in the manuscript that we used default Scikit-learn and IBM SPSS settings for all algorithms and statistical tests to ensure straightforward reproducibility. For the t-test, we used the SPSS default parameter setting for this test. Our approach aims for maximum transparency and ease of replication. Please check lines 238 – 243 on page 8 for further details.

Comment 5: The study splits the data into training and test sets using an 80:20 ratio and performs five-fold cross-validation during model development. While cross-validation helps assess the model performance, there is no external validation using independent datasets.

Our response: We acknowledge your feedback. In this revised edition, we have incorporated external validation. To achieve this, we applied the five ML algorithms considered in our study to the test dataset, which was not previously encountered during the training phase. The results we obtained were consistent with those from the training phase (Table 1). Additionally, we have introduced a new table (Table 4 on page 12) illustrating specific outcomes from the new t-tests, focusing solely on the accuracy measure. The other three measures (precision, recall and F1 score) showed similar superiority for the tree-based ML algorithms. For further details, please refer to lines 306-310 on page 12.

Comment 6: The study focuses solely on classical tree-based and non-tree-based supervised ML algorithms, neglecting other essential techniques such as deep learning algorithms or unsupervised learning algorithms.

Our response: Thank you for highlighting the exclusive focus of our study on classical tree-based and non-tree-based supervised machine learning algorithms. The decision to concentrate on these algorithms was deliberate, rooted in our research's specific scope and objectives, which aimed to investigate and compare the effectiveness of traditional ML approaches in our domain. While we acknowledge the potential of deep learning, ensemble ML algorithms and unsupervised ML algorithms in advancing the field, we intentionally leave them as a potential future scope. Please see pages 382-387) on pages 14-15 for more information.

Comment 7:

Measurement metrics (i.e., accuracy, recall, etc.) are well-known and have been used in previous biomedical studies, such as in PMID: 36642410 and PMID: 28155651. Therefore, the authors are suggested to refer to more works in this description to attract a broader readership.

Our response: Thank you for suggesting additional seminal works on measurement metrics. We included references to critical studies to underscore the relevance of our chosen metrics in biomedical research and broaden the manuscript's appeal. Please see line 209 on page 7 for details.

Comment 8: The study does not consider ensemble approaches beyond Random Forest (RF), such as AdaBoost or XGBoost, which have significantly improved performance in various classification tasks.

Our response: We appreciate this comment. The second reviewer also made a similar comment (R2C1). In the revised manuscript, we have discussed this. We also mentioned that deep learning could have a chance of showing such superior performance. We leave them as potential future research scopes in alignment with our study. Please see lines 359-360 on page 13 for details. We also added further studies could delve into these methods. Please see lines 382 – 387 for details.

Comment 9: While the study demonstrates the superiority of tree-based ML algorithms, it fails to explore the underlying reasons behind this dominance.

Our response: We have discussed the possible underlying reasons behind the superiority of tree-based ML algorithms compared to their counterparts. The superior performance of tree-based ML algorithms compared to their counterparts can be attributed to various factors. A prominent reason is their capability to effectively map non-linear relations, providing excellent prediction accuracy and greater stability, especially compared to linear models [3]. Additionally, these algorithms exhibit better categorical and numerical data accommodation than other models [4]. Described as a set of if-else statements, tree-based ML algorithms excel at incorporating non-linear and categorical data during the learning process, contributing to enhanced predictive accuracy. Please see lines 326-331 on page 13 for our detailed discussion.

Comment 10: The study suggests that tree-based algorithms consistently outperform non-tree-based algorithms across all datasets without considering potential dataset-specific factors that may influence algorithm performance.

Our response: Thank you for pointing out the importance of considering dataset-specific factors in evaluating algorithm performance. Acknowledging this, we have updated our discussion to more carefully examine how each dataset's unique attributes could influence the effectiveness of tree-based versus non-tree-based algorithms. Please refer to lines 375-379 in the revised manuscript for an expanded analysis, ensuring a more nuanced and balanced comparison.

Comment 11: While the study briefly mentions future research opportunities, such as exploring ensemble tree-based algorithms and investigating the underlying reasons for algorithmic performance, it lacks depth in discussing specific research avenues and methodologies.

Our response: We acknowledge the reviewer’s feedback on the need for a more detailed exploration of future research directions within our study. Future work will explore ensemble tree-based algorithms and the reasons behind varying algorithmic performances, employing comparative analyses and feature importance studies for deeper insights. We also added further studies for these. For further details, please see lines 382 – 387 on pages 14-15.

Reviewer: 02

Comment 1: The study focuses on classical algorithms and may not include recent advancements in machine learning, such as deep learning techniques that have shown promise in handling tabular data. It is highly recommended to include these for a more contemporary perspective.

Our response: Thank you for this suggestion. The first reviewer also made a similar comment (R1C8). In the revised manuscript, we have discussed this. We also mentioned that ensemble approaches could have a chance of showing such superior performance. We leave them as potential future research scopes in alignment with our study. Please see lines 359-360 on page 13 for details. We also added further studies could delve into these methods. Please see lines 382 – 387 for more information.

Comment 2: The abstract does not highlight the novelty of the proposed work. It is better to add more specific details to your work.

Our response: We have revised the abstract considering our study aims and objectives. Please see page 2 for details.

Comment 3:

The introduction is not focused, and the literature can be reorganised to strengthen the literature review following contributions and discuss a few relevant works, i.e.,

(a) A Benchmark Dataset and Learning High-level Semantic Embeddings of Multimedia for Cross-media Retrieval

(b) Unsupervised pre-trained filter learning approach for efficient convolution neural network

(c) CSFL: A novel unsupervised convolution neural network approach for visual pattern classification

(d) Optimisation of CNN through novel training strategy for visual classification problems

(e) Face recognition: A novel un-supervised convolutional neural network method

(f) ModPSO-CNN: an evolutionary convolution neural network with application to visual recognition

(g) Two-stage domain adaptation for infrared ship target segmentation

Our response: Thank you for your constructive feedback. We have focused our introduction, reorganised the literature review to highlight significant contributions in CNN and machine learning advancements, and meticulously incorporated the suggested references, enhancing the clarity and depth of our study. Please see lines 54 – 56 on page 3.

Comment 4: The work does not delve deeply into the impact of feature engineering and data preprocessing steps, which are crucial for the performance of machine learning algorithms. Add a detailed discussion on it.

Our response: Thank you for raising these issues related to feature engineering and data preprocessing. In this revised version, we briefly outlined the preprocessing steps followed for data analysis in this study. Please see lines 238-243 on page 8 for more information. To promote reproducibility, we adhered to Scikit-learn default parameters for all algorithms, ensuring a transparent and standardised experimental framework.

While our findings suggest tree-based algorithms outperform non-tree-based ones across multiple datasets, we recognise the importance of considering dataset-specific characteristics, such as feature distribution and complexity, that could influence algorithm performance. Uddin and Lu [5] discovered that ML algorithms exhibit varying performances when applied to datasets with distinct meta-level and statistical attributes.

Comment 5: While the proposed work effectively compares tree-based algorithms with non-tree-based counterparts, it might lack a deeper analysis of why certain algorithms perform better than others. A more thorough investigation into the intrinsic properties of the datasets that favour tree-based methods is needed.

Our response: We appreciate your comment, which echoed the tenth comment from the first reviewer (R1C10), highlighting the necessity of acknowledging dataset-specific factors when assessing algorithm performance. In response, we have refined our discussion to more thoroughly explore the impact of each dataset's distinct characteristics on the performance of tree-based versus non-tree-based algorithms. For a detailed expansion of this analysis, which aims to offer a more nuanced and balanced comparison, please see lines 368-372 in the revised manuscript.

Reviewer: 03

Comment 1:

The paper is not scientifically sound to be published in this form.

Our response: We believe that the comments from the other three reviewers and our corresponding responses have significantly improved the scientific merit of this study. Incorporating these changes would make the revised manuscript scientifically rigorous for publication. Please see the revised manuscript for our responses concerning the comments made by the first, second and fourth reviewers.

Reviewer: 04

Comment 1:

Abstract and Introduction

-"no study has shown such supremacy through a statistical significance test." and "However, none shows such supremacy by employing any statistical significance comparison, such as a t-test." It's not true; below I can indicate an example that used statistics to compare the accuracy of machine learning algorithms, and it is possible that others have proceeded similarly. I suggest the authors rewrite the sentence and indicate that it is not usual to find statistical comparisons between the classification performance of machine learning algorithms.

Farias, F. M., Salomão, R. C., Rocha Santos, E. G., Sousa Caires, A., Sampaio, G. S. A., Rosa, A. A. M., Costa, M. F., & Silva Souza, G. (2023). Sex-related difference in the retinal structure of young adults: a machine learning approach. Frontiers in medicine, 10, 1275308. https://doi.org/10.3389/fmed.2023.1275308.

Our response: We appreciate for pointing out this issue. We have reviewed the mentioned article, which primarily followed descriptive statistics for ML performance comparison. We also found that many studies in the current literature empirically demonstrated the superiority of tree-based ML algorithms. They primarily used one or more datasets for descriptive statistical comparisons [e.g., 18]. Yet, employing statistical significance comparisons like t-tests to demonstrate such supremacy is not widespread. Please see lines 123-125 on page 5 for more information.

Comment 2:

Methods

-Figure 1: Use a dot instead of a comma for decimal numbers. Include the label name for the X-axis.

-It would be important to provide more information about the type of data used. Time series for subsequent feature extraction? Was feature extraction performed? If yes, how many and which ones were extracted? Were they the same for all comparisons? How many groups were used in different datasets?

-Why was the t-test chosen over an analysis of variance? I think it would be more appropriate to use an analysis of variance or Kruskal-Wallis or perform a Bonferroni correction for the t-test results.

-I suggest performing at least a 10-fold cross-validation.

-Was there data preprocessing? Any normalisation? I think it would be important.

-Does it make sense to compare the performance of random forest and decision tree?

Our response: Please see below for our responses against each point

- Figure 1: We intended to use a comma to show values both in raw value and its corresponding percentage. We have updated this figure in this revised submission. We put a bracket instead of a comma. We further updated the figure caption accordingly. Please see page 5 for more details.

- Our datasets are from a wide range of contexts. They have attributes ranging from 2 to 2,548. All these datasets have two groups for the target variable.

- Since we have only two groups for all datasets, we considered the independent sample t-test. ANOVA or Kruskal-Wallis is more suitable for datasets with more than two groups [6].

- We explore the size distribution for all 200 datasets to finalise the selection of 5-fold cross-validation. Some datasets are not large, and selecting a 10-fold cross-validation would lead to inappropriate results.

- The second reviewer also raised this point (R2C4). In this revised version, we briefly outlined the preprocessing steps followed for data analysis in this study. Please see lines 238-243 on page 8 for more information. To promote reproducibility, we adhered to Scikit-learn default parameters for all algorithms, ensuring a transparent and standardised experimental framework. While our findings suggest tree-based algorithms outperform non-tree-based ones across multiple datasets, we recognise the importance of considering dataset-specific characteristics, such as feature distribution and complexity, that could influence algorithm performance. Uddin and Lu [5] discovered that ML algorithms exhibit varying performances when applied to datasets with distinct meta-level and statistical attributes.

- We considered all classic supervised ML algorithms. Although RF is an ensemble approach based on DT, we considered it in our study in alignment with numerous studies in the literature.

Comment 3:

Results

- Indicate the standard deviation of the mean values in Table 1 and Table 3.

- Table 3 shows an accuracy of 1. Does it imply overfitting? Or do the groups exhibit very large differences, leading to easier classification? This debate could be discussed in the Discussion section

Our response: All relevant tables, including Tables 1 and 3, have been updated with the standard deviation values. We further update Supplementary Table 1 accordingly. Please see different tables for details. We have discussed the presence of such high accuracy in lines 339-347 on page 14.

Reference

1. Wei, R., Wang, J., Su, M., Jia, E., Chen, S., Chen, T., and Ni, Y., Missing value imputation approach for mass spectrometry-based metabolomics data. Scientific reports, 2018. 8(1): p. 663.

2. Ishaq, A., Sadiq, S., Umer, M., Ullah, S., Mirjalili, S., Rupapara, V., and Nappi, M., Improving the prediction of heart failure patients’ survival using SMOTE and effective data mining techniques. IEEE access, 2021. 9: p. 39707-39716.

3. Dumitrescu, E., Hué, S., Hurlin, C., and Tokpavi, S., Machine learning for credit scoring: Improving logistic regression with non-linear decision-tree effects. European Journal of Operational Research, 2022. 297(3): p. 1178-1192.

4. Song, Y.-Y. and Ying, L., Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 2015. 27(2): p. 130.

5. Uddin, S. and Lu, H., Dataset meta-level and statistical features affect machine learning performance. Scientific Reports, 2024. 14(1): p. 1670.

6. Field, A., Discovering statistics using SPSS. 2013, London: Sage Publications Ltd.

Attachment

Submitted filename: Reviewer response letter (Confirmation PONE) v02.docx

pone.0301541.s003.docx (50KB, docx)

Decision Letter 1

Nagarajan Raju

18 Mar 2024

Confirming the statistically significant superiority of tree-based machine learning algorithms over their counterparts for tabular data

PONE-D-24-03825R1

Dear Dr. Uddin,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at http://www.editorialmanager.com/pone/ and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Nagarajan Raju

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: My previous comments have been addressed, therefore, the manuscript can be accepted in this current form.

Reviewer #2: All my comments are successfully answered. Please take a good look to the grammar and typos while submitting the final version of the manuscript.

Reviewer #4: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #4: No

**********

Acceptance letter

Nagarajan Raju

3 Apr 2024

PONE-D-24-03825R1

PLOS ONE

Dear Dr. Uddin,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Nagarajan Raju

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Paired-sample t-test results for the precision, recall and F1 score measures between tree-based and non-tree-based trained supervised machine learning algorithms for the datasets from disease prediction (66) and university-ranking contexts (50).

    (DOCX)

    pone.0301541.s001.docx (34.6KB, docx)
    S2 Table. Dataset source information.

    (DOCX)

    pone.0301541.s002.docx (47.2KB, docx)
    Attachment

    Submitted filename: Reviewer response letter (Confirmation PONE) v02.docx

    pone.0301541.s003.docx (50KB, docx)

    Data Availability Statement

    The 200 datasets used in this study are publicly available from open-source repositories. S2 Table contains the web address of each dataset.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES