Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Oct 1;15:34218. doi: 10.1038/s41598-025-15971-0

Comparative analysis of algorithmic approaches in ensemble learning: bagging vs. boosting

Hongke Zhao 1,2, Wenhui Liu 1,2, Yaxian Wang 1,2, Likang Wu 1,2,
PMCID: PMC12488980  PMID: 41034276

Abstract

Ensemble learning is widely applied in various real-world settings, with Bagging and Boosting being two core algorithms. Although these techniques have been extensively investigated through experimental comparisons of their performance in various scenarios, few studies have analyzed and quantified their benefits, costs, and complexities to support algorithm-aware decision making. In this study, we develop a theoretical model to compare Bagging and Boosting in terms of performance, computational costs, and ensemble complexity, and validate it through experiments on four datasets (MNIST, CIFAR-10, CIFAR-100, IMDB) with varying data complexity and computational environments. The results show that, for MNIST, as ensemble complexity increases (e.g., from 20 to 200), Bagging’s performance improves from 0.932 to 0.933 before plateauing, while Boosting improves from 0.930 to 0.961 before showing signs of overfitting. At the same ensemble complexity, such as 200 base learners, Boosting requires approximately 14 times more computational time than Bagging, indicating substantially higher computational costs. Similar patterns are observed across the other three datasets, confirming the generality of our findings and revealing consistent trade-offs between performance and computational costs. Taken together, these results confirm the robustness of our theoretical predictions and provide a foundation for practical guidance. Specifically, decision-makers prioritizing cost-efficiency may prefer Bagging, whereas those focusing on maximizing performance might find Boosting more beneficial. For simpler datasets on average-performing devices, Boosting can be effective, whereas Bagging is more suitable for complex datasets on high-performing devices. Overall, this study contributes by integrating analytical modeling with empirical validation across multiple datasets to provide theoretical insights and practical guidance. It systematically compares Bagging and Boosting in terms of performance, computational costs, and ensemble complexity, thereby enabling practitioners to choose the most appropriate method under varying data complexities, performance needs, and resource constraints.

Subject terms: Computer science, Information technology

Introduction

In the rapidly evolving field of data-driven machine learning, ensemble learning has become a key methodology for improving predictive accuracy and model robustness1,2. Their applicability spans diverse fields, including authorship identification3, healthcare4,5, engineering tasks6, and solutions for class imbalance problems7,8. Their effectiveness is further highlighted by their dominance in major competitions such as the KDD Cup9. Among ensemble techniques, Bagging and Boosting are two foundational approaches. Bagging reduces variance and overfitting by training diverse models on bootstrapped subsets of data and aggregating predictions, typically via majority voting10,11. It performs well on high-dimensional datasets but can be computationally intensive and less interpretable. In contrast, Boosting iteratively corrects errors by assigning higher weights to misclassified instances, thereby reducing bias and offering more interpretable models12,13. However, it is prone to overfitting and also computationally demanding due to its sequential nature.

While prior research has compared the performance of Bagging and Boosting across diverse datasets and applications1416, there remains a significant gap in studies examining the trade-off between algorithmic effectiveness and implementation cost-especially from a managerial decision-making perspective. Much of the existing literature focuses on enhancing predictive accuracy or refining algorithmic designs15,17,18, or on theoretical insights such as variance reduction and overfitting control19. Some works have also proposed fine-tuning strategies for learners with low variance or bias20. However, limited attention has been given to how these methods perform under real-world resource constraints or within operational decision-making frameworks.

In many real-world applications, such as data science competitions (e.g., KDD Cup, Kaggle), efficiency in time, computational resources, and memory is crucial. Ensemble learning is widely adopted for its strong predictive performance, yet choosing between Bagging and Boosting remains a key decision that requires balancing model accuracy, implementation cost, and ensemble complexity. From a decision-making perspective, this choice is rarely straightforward. Under Herbert A. Simon’s theory of bounded rationality21, algorithm users cannot fully explore all possible alternatives or accurately predict outcomes, often leading to suboptimal decisions. While Bagging generally incurs lower computational costs than Boosting under similar conditions, Boosting typically achieves higher accuracy. Yet, the criteria guiding this selection remain poorly understood in practice. Currently, algorithm selection is often made without systematic guidance, especially given the growing number of available AI models. Traditional experimental approaches are costly and lack generalizability. In contrast, this study proposes a novel approach by comparing the performance and cost of Bagging and Boosting through theoretical modeling and analysis, offering both practical insights and strategic value for algorithm deployment in resource-constrained environments.

Specifically, this study explores the choice between Bagging and Boosting algorithms from the perspective of algorithmic profit. In particular, we investigate the following research issues: (1) When time and computational resources are limited, should algorithm decision-makers choose Bagging or Boosting? (2) After selecting an algorithm, how should they determine the number of base learners? (3) How do the ensemble complexity, the preferences of the algorithm decision-makers, and the performance of datasets and equipment impact Bagging and Boosting? To answer these questions, we conduct a systematic comparison of Bagging and Boosting through both theoretical modeling and empirical validation. First, we develop a theoretical framework that models the performance, time cost, and computational cost of both algorithms. This enables us to analyze how ensemble complexity affects overall efficiency. Based on this model, we define algorithmic profit-a measure that incorporates decision-maker preferences-and derive the optimal profit and corresponding ensemble complexity for each algorithm, thereby identifying their most suitable application scenarios. Second, we validate our theoretical findings through experiments on publicly available datasets. The results largely align with our analytical conclusions, supporting the validity of the decision rules derived from our model. Furthermore, this consistency demonstrates the feasibility of using theoretical modeling to analyze algorithm utility-an innovative methodological approach in the field of ensemble learning.

Our analysis shows that the choice between Bagging and Boosting should depend on time and computational resource constraints. When these costs are relatively balanced, Boosting is preferred for its higher accuracy. However, when time efficiency is critical and computational resources are limited or costly, Bagging is the better option. Boosting generally requires more base learners than Bagging, especially with complex datasets or low-performance hardware. Ensemble complexity, decision-maker preferences, dataset characteristics, and computing resources all influence the performance and cost trade-offs of both methods. As ensemble complexity increases, so do performance, time cost, and computational demand. Once performance plateaus, Bagging outperforms Boosting on complex datasets, while Boosting performs better on simpler ones. Notably, Boosting’s time cost rises sharply with complexity, whereas Bagging’s remains nearly constant. Computational resource consumption grows quadratically for Boosting but only linearly for Bagging.

Our paper makes three main contributions. Firstly, it compares Bagging and Boosting algorithms to reveal their theoretical foundations. Previous research has mainly focused on comparing the accuracy of these algorithms with different types of base learners in various scenarios. However, this study compares their performance and costs with different ensemble complexities. Secondly, it provides theoretical guidance for practitioners who need to choose the appropriate ensemble learning algorithm for specific machine-learning problems. Although many people use ensemble learning algorithms, few provide detailed theoretical guidance for practitioners from the perspective of balancing algorithm performance and cost. Thirdly, it offers a new research paradigm for comparing and selecting between different algorithms. This study defines the performance and costs of Bagging and Boosting algorithms, compares the two from the perspective of algorithm profit through modeling, and validates the theoretical conclusions through experiments on public datasets. This provides a new method for comparing other machine learning algorithms, which can reduce the experimental costs of algorithm selection traditionally. It is relevant to structuring and modeling of software development/maintenance operations and the value of information in operational decision-making.

Model setting

The introduction of ensemble learning and the Related Work section can be found in the Supplementary. In this section, we will make model settings. Specifically, we provide the problem description and hypotheses of the model. Figure 1 provides an overview of the modeling framework and validation process, summarizing the key steps from problem formulation to hypothesis testing and result interpretation.

Fig. 1.

Fig. 1

Overview of the modeling framework and experimental validation process.

Problem description

We explore the practical challenges of decision-making in machine learning, particularly in the context of ensemble learning. Machine learning models are often used to inform decisions by predicting outcomes as a function of the choices made, with the advantage of capturing complex, nonlinear relationships present in real-world problems. While this can lead to improved predictive accuracy, the resulting complexity also increases the difficulty of selecting optimal decisions22. In this setting, decision-making involves optimizing an objective function shaped by model predictions, while also balancing performance, cost, and algorithmic complexity. This creates a multi-dimensional trade-off that decision-makers must carefully navigate.

In the context of ensemble learning, decision-makers face the challenge of balancing performance and cost in order to maximize profits. They must not only consider the predictive accuracy of the algorithm but also its complexity and associated costs. For example, an e-commerce platform improved profitability by adjusting recommendation outputs to prioritize high-margin products, even without gains in predictive accuracy23. Similarly, Booking.com applies uplift modeling to guide promotion decisions based on the estimated net effect, taking into account both potential gains and associated costs24. These cases highlight the practical need to optimize algorithmic profit, the net value generated after accounting for both performance and cost25.

In this paper, we define ensemble complexity as the number of base learners used in techniques such as Bagging and Boosting. We assume that decision-makers are rational and risk-neutral, and that they have access to a training set D, from which multiple subsets are generated using a specific sampling method, each corresponding to a base learner. The number of base learners, m, represents the complexity of the algorithm (hereafter referred to as complexity). The decision-maker aims to maximize algorithmic profit, defined as performance minus cost. This linear form is commonly used in decision analysis for its clarity and interpretability25. While non-linear utility functions could be considered in more complex settings26, we leave them for future research. Importantly, the relationships between complexity, performance, and cost may themselves be non-linear, which we explore in the following.

Relationship between algorithm performance and complexity

We hypothesize the relationship between the performance and complexity m of Bagging is Inline graphic, reflecting stable but diminishing returns with increased ensemble size. For Boosting, we hypothesize the relationship between the performance and complexity m is Inline graphic, where Inline graphic and Inline graphic,capturing rapid early gains and performance decline due to overfitting at higher complexity. These assumptions are supported by prior theoretical and empirical findings. Foundational work has shown that Bagging reduces variance via bootstrapped resampling, leading to steady, diminishing performance gains as more base learners are added27. In contrast, Boosting is more sensitive to iteration count, achieving rapid early accuracy gains but experiencing degradation when overfitting occurs28. A comprehensive review confirms that Bagging generally shows monotonically increasing accuracy, while Boosting often follows an inverted-U performance curve as ensemble size grows19. These findings support the functional forms adopted in our model and provide a theoretical basis for analyzing the trade-off between complexity and performance.

Figure 2 illustrates the hypothesized relationships between algorithm performance (P) and ensemble complexity (m) for Bagging and Boosting algorithms.

Fig. 2.

Fig. 2

Hypothesis of algorithm performance vs. complexity.

Hypothesis 1

As the value of m increases, Bagging shows a relatively slow increase in accuracy, while the accuracy of Boosting increases rapidly but is prone to overfitting.

Relationship between algorithm cost and complexity

We assume that the total algorithmic cost consists of two components: time cost and computational resource cost. For Boosting, which operates sequentially, the time cost increases linearly with the number of base learners, modeled as Inline graphic. For Bagging, which supports parallel execution, we assume time cost remains constant regardless of ensemble size, modeled as Inline graphic. Regarding computational resources, Boosting involves iterative reweighting, leading to a quadratic cost increase Inline graphic, while Bagging, due to its independent learners, incurs only linear computational cost Inline graphic, where Inline graphic and Inline graphic are constants.

These assumptions are grounded in prior theoretical and empirical findings. Research shows that Bagging benefits from parallelization, resulting in stable training time regardless of ensemble size, whereas Boosting’s sequential learning process leads to a linear increase in time cost as the number of base learners grows10. Furthermore, Boosting’s iterative reweighting process introduces nonlinear computational complexity, while Bagging maintains relatively lower and more predictable computational demands19,29. We further validate these assumptions through controlled experiments under various datasets, parameter settings, and model configurations. The results consistently confirm the robustness of our cost assumptions across different environments.

Hypothesis 2a

With the increase of m, Boosting’s time cost increases linearly, while Bagging’s time cost remains relatively constant.

Hypothesis 2b

With an increase in m, Boosting incurs a quadratic increase in computational resource cost, while Bagging exhibits a linear increase.

Model analysis

In this section, we first define the profit of Bagging and calculate the optimal profit and optimal complexity of Bagging. Then, we define the profit of Boosting and calculate its optimal profit and optimal complexity. At last, we conduct parameters sensitivity analysis and comparison on the optimal profit and optimal complexity of Bagging and Boosting.

Bagging

Consider a scenario where the decision-makers opt for the Bagging algorithm to achieve the goal of maximizing profits. It is assumed that Inline graphic determines how much unit performance affects profits, while Inline graphic reflects the influence of unit cost on profits. Based on these hypotheses, the optimization problem for decision-makers can be defined as follows:

graphic file with name d33e469.gif 1

Based on Hypothesis 1 and 2, the equation can be further developed as:

graphic file with name d33e476.gif 2

Proposition 1

In the context of the Bagging algorithm, it can be established that an optimal solution exists, and the solution is

graphic file with name d33e487.gif 3

The optimal profit is then:

graphic file with name d33e496.gif 4

The proof process is as follows. The first derivative of the profit function is given by: Inline graphic. The second derivative is: Inline graphic. Since Inline graphic, the function is concave, indicating a maximum. Setting the first derivative to zero yields the optimal value of Inline graphic: Inline graphic. Substituting Inline graphic into Inline graphic gives the optimal profit: Inline graphic.

Boosting

Imagine a situation where the decision-makers choose the Boosting algorithm to maximize profits. We assume that Inline graphic represents the extent to which unit performance affects profits, while Inline graphic represents the degree of impact of unit cost on profits. At this point, the optimization problem for decision-makers can be formulated as follows:

graphic file with name d33e569.gif 5

Substituting in Hypothesis 1 and 2, we get

graphic file with name d33e576.gif 6

Proposition 2

Under the Boosting algorithm, it can be shown that an optimal solution exists, and the optimal solution is:

graphic file with name d33e588.gif 7

The optimal profit is:

graphic file with name d33e597.gif 8

The proof process is as follows. The first derivative of the profit function is given by: Inline graphic. The second derivative is: Inline graphic Inline graphic. Since Inline graphic, the function is concave. Setting the first derivative to zero yields the optimal value of Inline graphic: Inline graphic. Substituting Inline graphic into Inline graphic gives the optimal profit: Inline graphic.

Model analysis and comparison

Based on the propositions, we carefully conduct parameters sensitivity analysis and comparison on the optimal profit and optimal complexity of Bagging and Boosting. We derive 6 corollaries and divide them into two categories related to parameter Inline graphic and parameters Inline graphic and Inline graphic. The detailed analysis process and corollaries are as follows.

Corollary of Inline graphic

Corollary 1

The impact of Inline graphic on optimal profit and optimal complexity under two algorithms:

  1. Under the Bagging algorithm, the optimal complexity increases as Inline graphic increases and the optimal profit increases with an increase in Inline graphic.

  2. Under the Boosting algorithm, the optimal complexity increases as Inline graphic increases, while the optimal profit initially decreases and then increases with an increase in Inline graphic.

In Bagging algorithms, increasing the value of the parameter Inline graphic leads to higher complexity and profit. This is because Bagging uses bootstrapping and aggregation to improve performance, which becomes better as Inline graphic increases. However, in Boosting, the relationship between Inline graphic and profit is more complex. Initially, increasing Inline graphic may cause a drop in profit due to overfitting on challenging instances. But as Inline graphic continues to grow, Boosting promotes a stronger combination of weak learners, leading to increased profit. This highlights the importance of finding the right balance between performance and cost in ensemble methods, with Inline graphic playing a crucial role in determining the optimal trade-offs. Figure 3(a), (b) reveal the analysis of parameter sensitivity of Inline graphic using the Boosting algorithm.

Fig. 3.

Fig. 3

Sensitivity Analysis under boosting algorithm with the key parameter Inline graphic (a,b) and parameters Inline graphic & Inline graphic (c,d).

Corollary 2

The impact of Inline graphic on optimal profit comparison under two algorithms:

  1. The optimal profit of the Bagging algorithm has both positive and negative values, and when Inline graphic is relatively small, the profit is negative. While the optimal profit of boosting consistently remains positive.

  2. Regarding algorithm selection, it is observed that when the value of Inline graphic is relatively small, Boosting is the superior choice. However, as the value of Inline graphic increases, Bagging becomes a more advantageous option.

It is observed that Bagging’s optimal profit is negative when Inline graphic is relatively small, while the optimal profit of boosting consistently remains positive. Remarkably, the negative values of Bagging’s optimal profit do not affect the final decision-making process. Consequently, it has been noticed that when Inline graphic is relatively small, Boosting is a more optimal choice. Conversely, as Inline graphic increases in magnitude, Bagging proves to be the superior selection. In conclusion, Boosting is more applicable as it ensures positive values of optimal profit across a larger range of Inline graphic. This finding informs us that Boosting is a more favorable choice, particularly in scenarios with relatively small Inline graphic. Bagging becomes advantageous in situations with relatively large Inline graphic. Figures 4(a–c) illustrate the profit comparison between Bagging and Boosting under Inline graphic variation.

Fig. 4.

Fig. 4

Profit comparison (a–c) and complexity comparison (d–f) with the key parameter Inline graphic.

Corollary 3

The impact of Inline graphic on optimal complexity comparison under two algorithms:

  1. When Inline graphic is relatively large, the optimal complexity of both Bagging and Boosting is positive. When Inline graphic is relatively small, the optimal complexity is meaningless.

  2. Regarding algorithm selection, it is observed that Boosting has a higher optimal complexity when Inline graphic is relatively small. Conversely, as Inline graphic increases in magnitude, Bagging has a higher optimal complexity.

This corollary illustrates the nuanced impact of Inline graphic on the optimal complexity under Bagging and Boosting algorithms. The parameter Inline graphic can be seen as the preferences of decision-makers. When Inline graphic is relatively small, Boosting tends to exhibit more complex models due to its iterative error-correction mechanism, even when performance is not the primary focus. Conversely, as Inline graphic increases, Bagging’s optimal complexity surpasses that of Boosting, as it aggregates more models to enhance performance. This delineates how Inline graphic affects the complexity adaptation of these algorithms. Boosting is inherently more complex at lower performance thresholds and Bagging becomes more complex when there is more emphasis on performance. Figures 4(d–f) demonstrate a comparison of optimal complexity between Bagging and Boosting with varying values of Inline graphic.

Corollary of Inline graphic and Inline graphic

Corollary 4

The impact of Inline graphic and Inline graphic on optimal profit and optimal complexity under two algorithms:

  1. Under the Bagging algorithm, as the cost coefficient Inline graphic increases, the optimal complexity decreases. The optimal profit shows a quadratic variation (first decreasing, then increasing) with an increase in Inline graphic, while it decreases with an increase in Inline graphic.

  2. Under the Boosting algorithm, the optimal complexity and the optimal profit both decrease as the cost coefficients Inline graphic and Inline graphic increase.

This corollary elucidates how the time cost coefficient Inline graphic and computing resource cost coefficient Inline graphic impact the optimal complexity and optimal profit of Bagging and Boosting algorithms. In Bagging, a higher value of Inline graphic leads to reduced complexity in balancing resource expenditure. Interestingly, Bagging’s optimal profit first declines with rising Inline graphic, and then increases, suggesting an adaptive response to cost pressures. Conversely, an increase in Inline graphic reduces optimal profit due to longer training durations. In Boosting, increased complexity correlates with higher Inline graphic, indicating a preference for more complex models despite escalating costs. However, optimal profit diminishes with greater Inline graphic and Inline graphic, reflecting Boosting’s vulnerability to both time and resource costs. This highlights the strategic interplay between considering cost and ensemble complexity in ensemble learning. Figure 3(c), (d) display the parameter sensitivity analysis of Inline graphic and Inline graphic under Boosting algorithm.

Corollary 5

The impact of Inline graphic and Inline graphic on optimal profit comparison under two algorithms:

  1. The optimal profit under bagging may be either positive or negative. The optimal profit under boosting is always positive.

  2. In the context of algorithm selection, it is observed that choosing the Boosting algorithm is better when there is a relatively small difference between Inline graphic and Inline graphic. The preference for the Bagging algorithm emerges predominantly in scenarios where Inline graphic is relatively small and Inline graphic is significantly high.

When comparing two algorithms based on the parameters Inline graphic and Inline graphic, we found some interesting insights. The Bagging algorithm generates variable profits which can be positive or negative depending on specific conditions or parameter values. On the other hand, the Boosting algorithm consistently yields positive profits, indicating a more stable and reliable outcome regardless of the varying parameters. We also observed that the Boosting algorithm is generally preferred for its wider applicability and benefits. However, the Bagging algorithm remains useful in cases where the cost coefficient Inline graphic is relatively low and Inline graphic is relatively high, highlighting the importance of considering these key parameters in algorithm selection. Figures 5(a-c) show the optimal profit comparison between the Bagging and Boosting algorithms with varying Inline graphic and Inline graphic values.

Fig. 5.

Fig. 5

Profit comparison (a–c) and complexity comparison (d–f) with the key parameters Inline graphic and Inline graphic.

Corollary 6

The impact of Inline graphic and Inline graphic on optimal complexity comparison under two algorithms:

  1. When Inline graphic is relatively small, Bagging’s optimal complexity makes sense. Boosting’s optimal complexity makes sense when Inline graphic is relatively small.

  2. When Inline graphic is relatively high and Inline graphic is relatively small, the optimal complexity of Boosting is greater than that of Bagging. When Inline graphic is relatively small, the optimal complexity of Bagging is greater than that of Boosting.

Corollary 6 explores how two critical parameters, namely time cost coefficient Inline graphic and computing resource cost coefficient Inline graphic, impact the optimal number of base learners in Bagging and Boosting. The optimal co mplexity, influenced by these parameters, can fluctuate positively or negatively under both algorithms. However, only when Inline graphic is relatively small, can Bagging’s optimal complexity make sense. Boosting’s optimal complexity makes sense when Inline graphic is relatively small. Specifically, when Inline graphic is relatively small, Bagging tends to adopt a greater number of base learners irrespective of the variations in Inline graphic. This scenario is likely due to Bagging’s ability to parallelize training, allowing for an increase in base learners without significantly impacting the overall time cost. In situations where Inline graphic is relatively high and Inline graphic is low, Boosting tends to employ a larger number of base learners. This preference might stem from that Boosting is a sequential computing approach to model improvement, which necessitates more judicious use of each learner. Figures 5(d–f) indicate the optimal complexity comparison between Bagging and Boosting with variation in Inline graphic and Inline graphic.

Experimental validation

We present the validation of hypotheses and then demonstrate the validation of corollaries. In the experiments, Bagging employed bootstrap sampling, while Boosting adopted weighted sampling. The source code for this paper can be found at: https://github.com/252820/Bagging-vs-Boosting.

Hypothesis validation

We validate Hypothesis on MNIST30, CIFAR-10, CIFAR-10031, and IMDB32. MNIST is used as a baseline due to its relatively simple structure-focused on handwritten digit recognition. In contrast, CIFAR-10 and CIFAR-100 contain more complex image data, offering a more challenging environment to evaluate model robustness and adaptability. The IMDB dataset, which consists of textual reviews labeled for sentiment polarity, allows us to assess the generalizability of ensemble methods beyond vision tasks. For both Bagging and Boosting algorithms, we employ decision tree as the base learner, owing to its versatility and interpretability across various learning tasks. The Bagging algorithm is implemented using the BaggingClassifier from the scikit-learn library-a widely adopted tool known for its efficiency and ease of use. For Boosting, we use the AdaBoostClassifier, which is renowned for its ability to enhance the performance of weak learners. To evaluate the performance of the ensemble methods, we focus on three key metrics: test set accuracy, training time, and the size of the generated pickle files. These metrics provide insights into model accuracy, computational efficiency, and scalability-essential factors for practical deployment.

We conducted a series of experiments to rigorously evaluate the hypotheses of this study. To ensure robust and generalizable conclusions, we adopted a controlled experimental setup in which one variable was adjusted at a time. This included varying the complexity of the base learners and the random seed settings. These modifications ensured that our findings were not dependent on specific initial conditions or partitioning methods. Moreover, we tested our hypotheses across diverse datasets with different characteristics and levels of complexity, further validating the broad applicability of our results. Detailed descriptions of the experimental setup, including specific parameter configurations such as decision tree depths and random seed values used to ensure reproducibility, are provided in Table 1 of the Supplementary. To assess the statistical significance of performance differences across algorithms and datasets, paired t-tests were conducted on both accuracy metrics and computational costs. The results of these t-tests, reported in Table 2 of the Supplementary for each comparison, show that all p-values were below 0.05, confirming the statistical significance of the observed differences. For greater transparency, we also report the variance and standard deviation of both accuracy and computational time in Table 3 of the Supplementary.

Hypothesis 1

Accuracy on the test set is the primary metric for evaluating Hypothesis 1, reflecting the algorithm’s generalization and predictive performance. It indicates how well the model performs on unseen data and is crucial for assessing overall algorithm effectiveness. In Fig. 6, we present the results for Hypothesis 1. Figure 6(a) shows the experimental outcomes for MNIST across three different depths. Figure 6(b) illustrates the overfitting behavior on MNIST with a decision tree depth of 10. Figure 6(c) displays the results for MNIST under three different random seeds. Figure 6(d–f) show the experimental results for CIFAR-10, CIFAR-100, and IMDB across varying depths, respectively.

Fig. 6.

Fig. 6

Validation of Hypothesis 1.

As shown in Fig. 6, Bagging’s performance increases gradually with more base learners and eventually stabilizes, indicating a relatively slow gain in accuracy. In contrast, Boosting improves rapidly at first but is prone to overfitting-consistent with Hypothesis 1. At the same number of base learners, Boosting outperforms Bagging on MNIST. However, on CIFAR-10 and CIFAR-100, Bagging achieves higher accuracy than Boosting under the same conditions. Both algorithms perform less effectively on these datasets, likely due to their larger size and the limited capacity of the base learners.

Hypothesis 2a

For Hypothesis 2a, we analyze training time-a key consideration in real-world applications as it reflects the computational efficiency and time investment required for effective model training. This metric provides insight into the algorithm’s time cost and processing efficiency. In Fig. 7, we present the results for Hypothesis 2a. Figure 7(a–f) show the experimental outcomes across the MNIST, CIFAR-10, CIFAR-100, and IMDB datasets under varying depths and random seeds.

Fig. 7.

Fig. 7

Validation of Hypothesis 2a (a–f) and Hypothesis 2b (g–l).

From the analysis of Fig. 7(a–f), we observe that as the number of base learners increases, Bagging’s training time remains nearly constant, approximating a horizontal line. In contrast, Boosting’s training time increases linearly, supporting our Hypothesis 2a.

Hypothesis 2b

Regarding Hypothesis 2b, we examine the product of training time and model file size (i.e., the size of the generated pickle files). This metric captures the relationship between training efficiency and model storage requirements, which is essential for deploying models in environments with limited computational resources. Figure 7 presents the results for Hypothesis 2b. Figure 7(g–l) show the experimental outcomes on MNIST, CIFAR-10, CIFAR-100, and IMDB under various settings, consistent with the previous subsection.

As shown in Fig. 7(g–l), when the number of base learners increases, Bagging’s computational cost rises linearly, whereas Boosting’s increases quadratically. This observation supports Hypothesis 2b. To assess the robustness of Hypothesis 2a and Hypothesis 2b across various model implementations, base learners, and hardware configurations, we conducted a series of experiments. The results, presented in Fig. 1 (Supplementary), confirm their resilience to algorithmic and environmental variations, with the visual trends in the curves further supporting this robustness.

Corollary validation

In the corollary validation section, we conducted experiments using three computing devices with different CPUs: 6240, 6320, and 2687. The datasets used were MNIST, CIFAR-10, and CIFAR-100. The detailed validation is described as follows.

Validation of Inline graphic

Using Eqs. (1) and (5), we calculated the profits associated with different values of Inline graphic for both Bagging and Boosting algorithms. In this context, Inline graphic represents the standardized performance of the test set, for various numbers of base learners, while Inline graphic denotes the combined metric of standardized training duration and the product of training time with the size of the generated pickle file. To ensure comparability across different datasets and algorithms, the data were standardized using min-max normalization33. We considered values of Inline graphic ranging from 0 to 1 and determined the maximum profit for each of them. These maximum profits were designated as the optimal profit for a given value of Inline graphic. Subsequently, we identified the optimal complexity associated with these optimal profits.

Figure 8 displays the corollary validation results for optimal profit and optimal complexity with respect to Inline graphic , respectively. Figure 8(a), (g) show the average experimental results for the MNIST dataset on device 6240, based on three different random seeds. Similarly, Fig. 8(b), (h) present the results for CIFAR-10 on device 6320. Figure 8(c), (i) display results for IMDB on the same device. Figure 8(d), and 8(j) show MNIST results on device 2687. For CIFAR-10 on device 2687, results are shown in Fig. 8(e), (k). The results for CIFAR-100 on device 2687 are presented in Fig. 2(b), (c) (in the Supplementary). Figure 8(f) illustrates the difference in optimal profit between Boosting and Bagging, while Figure 8(l) shows the corresponding difference in optimal complexity. The numbers in the legend represent the depth of the decision tree, e.g. Boosting_10 means the experimental results of the Boosting algorithm with a max depth of 10 in the base learner. Avg3 in the legend represents the average results of 3 experiments.

Fig. 8.

Fig. 8

Corollary validation of optimal profit regarding Inline graphic (a–f) and optimal complexity regarding Inline graphic (g–l).

As shown in Fig. 8(a–e), both Bagging and Boosting exhibit an increase in optimal profit as Inline graphic increases. Similarly, Fig. 8(g–k) reveal a rising trend in optimal complexity for both algorithms with higher Inline graphic, which aligns with Corollary 1.

Figure 8f reveals that when Inline graphic is relatively small, the difference is greater than 0. On the other hand, when Inline graphic is relatively large, the difference is less than 0. This implies that when given a dataset, computing device, and depth of decision trees, the Boosting algorithm is more optimal as Inline graphic is relatively small. Conversely, the Bagging algorithm is more optimal when Inline graphic is relatively large. This observation aligns with Corollary 2.

Figure 8l exhibits a similar trend: when Inline graphic is smaller, the difference is greater than 0, and when Inline graphic is larger, the difference is less than 0. This indicates that if the unit performance impact on overall performance is relatively low, Boosting’s optimal complexity exceeds Bagging’s when given a dataset, computing device, and depth of decision trees. Conversely, if the unit performance impact is relatively high, Bagging’s optimal complexity exceeds Boosting’s, which is consistent with Corollary 3.

Validation of Inline graphic and Inline graphic

To verify the corollaries of Inline graphic and Inline graphic, it’s necessary to establish device coefficients and data coefficients, both of which have a value range of 0 to 10. The device coefficient Inline graphic is determined by the computing device running the algorithm. A slower running speed and poorer overall performance result in a lower device coefficient, while a faster running speed and better overall performance result in a higher device coefficient. Here, we utilized three devices labeled as 2687 (Intel Xeon E5-2687W v3), 6240 (Intel Xeon Gold 6240 2.6GHz/18C), and 6320 (Intel Xeon Gold 6320 2.2GHz/26C). The device coefficients were assigned as 3 for the 2687 device, 5 for the 6240 device, and 8 for the 6320 device. The data coefficient Inline graphic is determined by the size of datasets and the number of features. A larger dataset with more features results in a higher data coefficient, while a smaller dataset with fewer features results in a lower data coefficient. In the experiments, we used two datasets. Based on their volume and feature count, we defined the data coefficient for the MNIST dataset as 4 and the CIFAR-10 dataset as 6.

The parameter Inline graphic represents the coefficient of time cost, which is influenced by both the dataset and the computing device. Therefore, we define Inline graphic as the product of the data coefficient and the device coefficient, obtaining Inline graphic. On the other hand, the parameter Inline graphic denotes the coefficient of computational resource cost, which is influenced by both the algorithm’s runtime and the size of the algorithm parameter files. Consequently, Inline graphic is defined as the product of the square of the data coefficient and the device coefficient, encapsulating the dual influence of both data and device factors. Thus we obtain Inline graphic. Table 1 presents the optimal profit and optimal complexity for Bagging and Boosting at different max depths when Inline graphic, along with the device coefficients, data coefficients, and Inline graphic and Inline graphic.

In Table 1, we can see that when the max depth is the same, the optimal complexity for Bagging tends to decrease as Inline graphic increases. The optimal profit for Bagging, on the other hand, shows both increases and decreases as Inline graphic and Inline graphic increase. While both the optimal complexity and optimal profit for Boosting decrease as Inline graphic and Inline graphic increase. This is in line with Corollary 4.

Table 1.

Inline graphic and Inline graphic Validation (Inline graphic).

Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic max depth
40 0.446 4 3 120 0.462 12 48 -0.016 -80 10
120 0.524 6 3 170 0.335 18 108 0.199 -50 10
100 0.502 6 8 140 0.361 48 288 0.141 -40 10
140 0.441 4 5 140 0.476 20 80 -0.036 0 15
100 0.491 6 8 140 0.387 48 288 0.104 -40 15

When Inline graphic and Inline graphic are not significantly different, the optimal profit difference is negative, indicating that Boosting’s optimal profit is greater than Bagging’s. However, when Inline graphic is relatively low and Inline graphic is relatively high, the optimal profit difference is positive, indicating that Bagging’s optimal profit surpasses Boosting’s. This aligns with Corollary 5.

Similarly, when Inline graphic is relatively high and Inline graphic is relatively low, the optimal complexity difference is less than 0, meaning Boosting’s optimal complexity exceeds Bagging’s. When Inline graphic is relatively low, the optimal complexity difference is greater than 0, indicating that Bagging’s optimal complexity is greater than Boosting’s. This aligns with Corollary 6.

Conclusions

The results reveal the following: (1) The optimal profit and complexity of Bagging and Boosting are influenced not only by algorithm preferences, computing devices, and datasets, but also by base learner characteristics, such as decision tree depth and random seed settings. (2) As ensemble complexity increases, both algorithms show improvements in performance, runtime, and computational cost. However, once performance stabilizes, Bagging achieves higher accuracy on complex datasets at equal complexity levels, while Boosting performs better on simpler ones. Boosting’s runtime increases with complexity, whereas Bagging’s remains nearly constant. Moreover, Boosting exhibits a quadratic increase in computational cost, in contrast to Bagging’s linear growth. (3) Given a fixed dataset, device, and tree depth, both algorithms show increasing optimal profit with greater per-unit performance impact. When the impact is relatively low, Boosting yields higher optimal profit but greater optimal complexity; when relatively high, Bagging performs better with similarly increased complexity. (4) When time and computational costs are comparable, Boosting achieves higher optimal profit. However, under relatively low time cost and relatively high computational cost, Bagging yields higher profit while Boosting shows greater complexity. Conversely, when computational cost is relatively low, Bagging exhibits greater optimal complexity than Boosting.

This study provides theoretical guidance for selecting between Bagging and Boosting in ensemble learning. When faced with a dataset requiring ensemble analysis, decision-makers can first estimate algorithm costs based on their preferences, dataset characteristics, and computing resources. They can then determine the sign and magnitude of the difference in optimal profits between the two algorithms to guide their selection. After choosing an algorithm, decision-makers can also decide on the appropriate number of base learners. Specifically, if cost sensitivity is a priority, Bagging is preferred; if performance is the main concern, Boosting is superior. For relatively simple datasets and average device performance, Boosting is recommended. For complex datasets and high-performance devices, Bagging is more suitable. In general, the number of base learners for Boosting should be set higher than for Bagging, especially when the dataset is complex or the device performance is limited.

This study has several limitations that suggest directions for future research. First, our analysis focuses specifically on Bagging and Boosting within ensemble learning, and the proposed framework has not yet been extended to other algorithm families. Applying this model to a broader range of machine learning or optimization algorithms, especially those commonly used in operations research, could enhance its generalizability. Second, the model assumes rational and risk-neutral decision-makers who aim to maximize algorithmic profit. While this assumption facilitates tractable analysis, it may not fully capture real-world decision behavior, where bounded rationality, heuristics, or risk aversion may influence choices. Finally, although we adopt a linear formulation of profit as performance minus cost for the sake of interpretability, alternative utility structures such as non-linear or risk-sensitive functions may better represent decision-maker preferences and offer promising directions for future work.

Supplementary Information

Acknowledgements

This study was partially funded by the support of Natural Science Foundation of Tianjin (No. 24JCQNJC01560) National Natural Science Foundation of China (72471165, 72101176), and Emerging Frontiers Cultivation Program of Tianjin University Interdisciplinary Center.

Author contributions

Hongke Zhao conceived the method and experiment(s), Wenhui Liu and Yaxian Wang conducted the experiment(s), Likang Wu analysed the results. All authors reviewed the manuscript.

Data availability

We have added the section of Data Availability in the manuscript document. Specifically, all data generated or analysed during this study are included in the published articles: MNIST[1], CIFAR-10, CIFAR-100[2] and IMDB[3]. [1]. Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine 29, 141–142 (2012).410 [2]. Krizhevsky, A. Learning multiple layers of features from tiny images. Master’s thesis, Univ. Tront (2009). [3]Pandey, A. et al. Sentiment analysis of imdb movie reviews. In 2024 First International Conference on Software, Systems and Information Technology (SSITCON), 1–6 (IEEE, 2024).437

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-025-15971-0.

References

  • 1.Kunapuli, G. Ensemble Methods for Machine Learning (Simon and Schuster, 2023).
  • 2.Altman, N. & Krzywinski, M. Ensemble methods: Bagging and random forests. Nat. Methods14, 933–935 (2017). [Google Scholar]
  • 3.Abbasi, A. et al. Authorship identification using ensemble learning. Sci. Rep.12, 9537 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wang, J. et al. Comparative performance of multiple ensemble learning models for preoperative prediction of tumor deposits in rectal cancer based on MR imaging. Sci. Rep.15, 4848 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Rahmatinejad, Z. et al. A comparative study of explainable ensemble learning and logistic regression for predicting in-hospital mortality in the emergency department. Sci. Rep.14, 3406 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Vlasenko, T. et al. Ensemble learning based sustainable approach to rebuilding metal structures prediction. Sci. Rep.15, 1210 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Salehi, A. & Khedmati, M. Hybrid clustering strategies for effective oversampling and under sampling in multiclass classification. Sci. Rep.15, 3460 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Salehi, A. R. & Khedmati, M. A cluster-based smote both-sampling (csbboost) ensemble algorithm for classifying imbalanced data. Sci. Rep.14, 5152 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms (CRC Press, 2012).
  • 10.Zhang, T. et al. Bagging-based machine learning algorithms for landslide susceptibility modeling. Nat. Hazards110, 823–846 (2022). [Google Scholar]
  • 11.Bühlmann, P. & Yu, B. Analyzing bagging. Ann. Stat.30, 927–961 (2002). [Google Scholar]
  • 12.Freund, Y., Schapire, R. E. et al. Experiments with a new boosting algorithm. In ICML. Vol. 96. 148–156 (Citeseer, 1996).
  • 13.Guo, J. et al. Boost: A robust ten-fold expansion method on hour-scale. Nat. Commun.16, 2107 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Odegua, R. An empirical study of ensemble techniques (bagging, boosting and stacking). In Proceedings of the Conference on Deep Learning, IndabaXAt (2019).
  • 15.Rekha, G., Tyagi, A. K. & Krishna Reddy, V. Solving class imbalance problem using bagging, boosting techniques, with and without using noise filtering method. Int. J. Hybrid Intell. Syst.15, 67–76 (2019).
  • 16.Colakovic, I. & Karakatič, S. Fairboost: Boosting supervised learning for learning on multiple sensitive features. Knowl.-Based Syst.280, 110999 (2023). [Google Scholar]
  • 17.Sun, J., Li, J. & Fujita, H. Multi-class imbalanced enterprise credit evaluation based on asymmetric bagging combined with light gradient boosting machine. Appl. Soft Comput.130, 109637 (2022). [Google Scholar]
  • 18.González, S., García, S., Del Ser, J., Rokach, L. & Herrera, F. A practical tutorial on bagging and boosting based ensembles for machine learning: Algorithms, software tools, performance study, practical perspectives and opportunities. Inf. Fusion64, 205–237 (2020). [Google Scholar]
  • 19.Ghojogh, B. & Crowley, M. The theory behind overfitting, cross validation, regularization, bagging, and boosting: Tutorial. arXiv preprint arXiv:1905.12787 (2019).
  • 20.Zhao, C., Peng, R. & Wu, D. Bagging and boosting fine-tuning for ensemble learning. IEEE Trans. Artif. Intell.5, 1728–1742 (2024). [Google Scholar]
  • 21.Simon, H. A. Rational decision making in business organizations. Am. Econ. Rev.69, 493–513 (1979). [Google Scholar]
  • 22.Biggs, M., Hariss, R. & Perakis, G. Constrained optimization of objective functions determined from random forests. Prod. Oper. Manag.32, 397–415 (2023). [Google Scholar]
  • 23.Kompan, M., Gaspar, P., Macina, J., Cimerman, M. & Bielikova, M. Exploring customer price preference and product profit role in recommender systems. IEEE Intell. Syst.37, 89–98 (2021). [Google Scholar]
  • 24.Teinemaa, I., Albert, J. & Goldenberg, D. Uplift modeling: From causal inference to personalization. In Companion Proceedings of the Web Conference (2021).
  • 25.Bertsimas, D. & Kallus, N. From predictive to prescriptive analytics. Manag. Sci.66, 1025–1044 (2020). [Google Scholar]
  • 26.Abbas, A. E. Constructing multiattribute utility functions for decision analysis. In Risk and Optimization in an Uncertain World. 62–98 (2010).
  • 27.Breiman, L. Bagging predictors. Mach. Learn.24, 123–140 (1996). [Google Scholar]
  • 28.Schapire, R. E. & Freund, Y. Boosting: Foundations and algorithms. Kybernetes42, 164–166 (2013). [Google Scholar]
  • 29.Dietterich, T. G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach. Learn.40, 139–157 (2000). [Google Scholar]
  • 30.Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process. Mag.29, 141–142 (2012). [Google Scholar]
  • 31.Krizhevsky, A. Learning multiple layers of features from tiny images. Master’s Thesis, University of Tront (2009).
  • 32.Pandey, A. et al. Sentiment analysis of imdb movie reviews. In 2024 First International Conference on Software, Systems and Information Technology (SSITCON). 1–6 (IEEE, 2024).
  • 33.Cao-Van, K. et al. Prediction of heart failure using voting ensemble learning models and novel data normalization techniques. Eng. Appl. Artif. Intell.154, 110888 (2025). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

We have added the section of Data Availability in the manuscript document. Specifically, all data generated or analysed during this study are included in the published articles: MNIST[1], CIFAR-10, CIFAR-100[2] and IMDB[3]. [1]. Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine 29, 141–142 (2012).410 [2]. Krizhevsky, A. Learning multiple layers of features from tiny images. Master’s thesis, Univ. Tront (2009). [3]Pandey, A. et al. Sentiment analysis of imdb movie reviews. In 2024 First International Conference on Software, Systems and Information Technology (SSITCON), 1–6 (IEEE, 2024).437


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES