A new ensemble learning method stratified sampling blending optimizes conventional blending and improves prediction performance

Na Miao; Mengke Yang; Pingping Han; Jiakun Qiao; Zhaoxuan Che; Fangjun Xu; Xiangyu Dai; Mengjin Zhu

doi:10.1093/bioadv/vbaf002

. 2025 Feb 22;5(1):vbaf002. doi: 10.1093/bioadv/vbaf002

A new ensemble learning method stratified sampling blending optimizes conventional blending and improves prediction performance

Na Miao ^1,², Mengke Yang ^3,⁴, Pingping Han ^5,⁶, Jiakun Qiao ^7,⁸, Zhaoxuan Che ^9,¹⁰, Fangjun Xu ^11,¹², Xiangyu Dai ^13,¹⁴, Mengjin Zhu ^15,^16,^17,^18,^✉

Editor: Shanfeng Zhu

¹ Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China

² 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070

³ Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China

⁴ 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070

⁵ Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China

⁶ 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070

⁷ Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China

⁸ 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070

⁹ Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China

¹⁰ 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070

¹¹ Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China

¹² 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070

¹³ Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China

¹⁴ 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070

¹⁵ Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China

¹⁶ The Cooperative Innovation Center for Sustainable Pig Production, Huazhong Agricultural University, Wuhan 430070, China

¹⁷ 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070

¹⁸ 华中农业大学生猪健康养殖协同创新中心，武汉 430070

^✉

Corresponding author. Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; The Cooperative Innovation Center for Sustainable Pig Production, Huazhong Agricultural University, Wuhan 430070, China. E-mail: zhumengjin@mail.hzau.edu.cn

Roles

Shanfeng Zhu: Associate Editor

PMCID: PMC11908643 PMID: 40092529

Abstract

Motivation

Ensemble learning, as a powerful machine learning method, improves overall prediction performance by combining the prediction results of multiple base models. Blending, as a popular ensemble learning method, can train multiple base models, input the resulting prediction results to further train meta model and obtain final prediction results. However, conventional blending divides the training set by simple random sampling, which causes bias and large variance, thus affecting the stability and accuracy of prediction performance. In this study, we propose a new algorithm of stratified sampling blending (ssBlending), which addresses the algorithm instability of conventional blending caused by the random partition of the training set, further improving the prediction accuracy.

Results

We used multiple genotype data sets from different species including animal (pig), plant (loblolly pine), and microorganism (yeast) to test the prediction performance of ssBlending. The across-species multi-dataset verification results reveal that ssBlending is superior to conventional blending in terms of prediction accuracy and stability. In addition, we optimized the training set sampling rate (BestH) to facilitate the practical application of the ssBlending algorithm. In summary, this study proposes a completely new algorithm combing stratification strategy with the conventional blending, which provides more options for ensemble learning in various fields.

Availability and implementation

https://figshare.com/s/23122a18dc8a35f12ff6.

1 Introduction

In the informationized age, the growth rate of data volume far exceeds human processing capabilities, and the emergence of machine learning is expected to address this problem (Samuel 1959). Machine learning is a computing program that automatically mines the hidden feature information in data. Due to its excellent performance, machine learning has received the increasing attention. Machine learning, as one of the core areas of artificial intelligence (Michalski et al. 2013; Helm et al. 2020), can accomplish autonomous prediction, classification (Lau and Wu 2003, Li and Zhou 2007, Huang et al. 2008, Ben-Haim and Tom-Tov 2010), and some challenging tasks by training and analyzing big data (Papadimitriou and Sun 2008, Havens et al. 2012, Zhang et al. 2012). Machine learning algorithms can be divided into supervised learning, unsupervised learning, and reinforcement learning. Common machine learning algorithms mainly include support vector machines, decision trees, linear regression, naive Bayes, random forest, elastic net regression, and so on (Mahesh 2020), which have been widely used in various fields (Jordan and Mitchell 2015, Sarker 2021). However, the prediction accuracy of single machine learning method is far from satisfactory. The high performance of machine learning relies on appropriate parameter settings, but the complexity of parameter tuning greatly hinders its application in practical projects. With the increasing scale and complexity of data, it is impractical to employ a single machine learning model in all scenarios (Töscher et al. 2009).

Ensemble learning combining multiple learners can be categorized into homogeneous ensemble learning and heterogeneous ensemble learning, based on the composition of individual learners (Zhou 2012). The ensemble learning in which individual learners are trained by the same algorithm is known as homogeneous ensemble learning, while that containing multiple individual learners which are trained by different algorithms is known as heterogeneous ensemble learning. Ensemble learning has been widely applied in various fields such as financial field for stock price prediction (Khan et al. 2022, Hao et al. 2023), medical field for risk prediction of prevalent diseases (Huang et al. 2020, Nilashi et al. 2022, Rezaei et al. 2023), and so on (Dong et al. 2020, Dai et al. 2023, Xu et al. 2023, Xu et al. 2024a, 2024b). Common ensemble learning methods include bagging (Breiman 1996), boosting (Freund and Schapire 1996), stacking (Wolpert 1992), and blending (Töscher et al. 2009).

Blending implements a direct random split of the training set into a holdout set and a validation set, based on a predetermined holdout rate. Blending requires training the base learners and the meta learners only once, and thus it significantly reduces computation time in practical applications, compared to stacking. However, the prediction accuracy and stability of blending is compromised due to its randomness of training set partitioning. To overcome this limitation, the study proposes a new algorithm, namely, stratified sampling blending (ssBlending), which combines the stratified sampling strategy with blending. Compared to conventional blending, the ssBlending improves the prediction accuracy, meanwhile maintaining its running speed and stability, thus addressing the computational challenges associated with high-dimensional data.

2 Methods

2.1 Description of datasets

The data used in this study include genotype data for genomic prediction and phenotypic data. The missing phenotypes were excluded, and genotype quality control was conducted using the PLINK software (Purcell et al. 2007). Subsequently, dimensionality reduction of the genotype data was performed utilizing the spls package in R (Chung et al. 2012). Individuals with missing phenotypes for each trait are deleted, and individual number for six traits of different species across different datasets are summarized in Table 1.

Table 1.

Number of individuals across different datasets for six traits of pig, loblolly pine, and yeast.

Pig		Loblolly pine		Yeast
trait	N_ID	trait	N_ID	trait	N_ID
ADG	4046	RT	925	CuSO4	4276
BF	4046	ABA	861	Diamide	4309
BH	3892	THLC	861	BER	4310
OTW	4052	CWAB	861	Formamide	4268
LEA	3866	TSH	861	IAA	4286
LP	3866	C5C6	910	Lactose	3816

Open in a new tab

2.1.1 Real datasets

The real dataset used in the study contains the data of pig, loblolly pine and yeast.

The pig dataset used in the study contains the data of average daily weight gain (ADG), backfat thickness (BF), body height (BH), off test weight (OTW), loin eye area (LEA), and lean percentage (LP). This dataset has 4555 samples, with 47 258 SNPs.

The loblolly pine data of root number (RT), average branch angle (ABA), total height to the base of the live crown (THLC), crown width across the plant beds (CWAB), total stem height (TSH), and 5- and 6-carbon sugar content (C5C6). This dataset has 926 samples, with 2788 SNPs (Resende Jr et al. 2012).

The genotype data of yeast contains 4390 samples, with 28 220 SNPs per sample. Six traits are investigated, namely, the contents of copper sulfate (CuSO4), diamide, berbamine (BER), formamide, indol acetic acid (IAA), and lactose (Bloom et al. 2015).

2.1.2 Simulated datasets

The genotype information is randomly extracted from the real pig gene data to obtain the simulated genotype dataset. Phenotypes corresponding to heritability values of 0.2, 0.5, and 0.8 are generated using the simulation phenotype method.

In the data simulation, we set three gradient quantitative trait nucleotide (QTN) numbers 200, 2000, and 5000 for different heritabilities (h2 = 0.2, 0.5, and 0.8) to represent the effect sizes affecting the phenotypic values of the simulated traits. In the simulation process, the QTN effect follows a normal distribution. The dataset required by the study is then simulated based on the chosen heritability value. The phenotypic simulation involves the additive effect value and the residual effect value. The sum of the two effect values is defined as the final simulated phenotypic value of the individual (Tang and Liu 2019). The simulated phenotype value is calculated according to the following formula:

y = X β + e

(1)

where y represents a phenotype vector with a length of n × 1; X is a genotype matrix of n × m (n, number of simulated individuals; m, number of extracted QTNs); β is a QTN effect vector with a length of n × 1; and e indicates a random residual vector with a length of n × 1.

Finally, we obtained nine simulated datasets with different genetic architectures with each dataset containing 3000 samples and 20 000 SNPs.

2.2 Description of methods

2.2.1 Blending

The blending algorithm consists of two layers of learners. The first layer framework is composed of three models as the basic learners, including support vector regression model (Chang and Lin 2011) implemented by the svm function in the e1071 R package, regression tree model (Grubinger et al. 2014) by the evtree function in the evtree R package, and elastic net regression model (Durbin and Willshaw 1987) by the glmnet function in the glmnet R package. In the second layer framework, a linear model is used as the meta learner implemented by the lm function in the R software.

2.2.2 ssBlending

The ssBlending algorithm is an evolution of the blending algorithm. The ssBlending also follows a two-layer framework and employs meta learning to generate final predicted values. The conventional blending algorithm randomly selects a holdout set, and this randomness tends to result in strong instability. To address this issue, we propose applying the stratified sampling strategy to the holdout set partition in blending.

In our study, we combine the blending with the cluster analysis stratification strategy (CA) and the phenotype probability stratification strategy (PP), respectively. Figure 1 presents the combination of these approaches. In CA strategy, we adopt the KMeans clustering algorithm implemented by the KMeans function in the R software for clustering purposes.

Figure 1. — Schematic diagram of two stratification strategies. (a) Cluster analysis strategy (CA) of ssBlending. The genotype information of the population is divided into N categories, and then sampled from N categories. (b) Penotype probability (PP) stratification strategy of ssBlending. The probability density curve of each phenotype data is plotted. The phenotype is divided into several intervals of equal distance, and the probability of different phenotype intervals is calculated using Riemann integral, and finally stratified sampling is performed according to the desired probability. The proposed ssBlending algorithm uses two stratification strategies to reduce the risk of selecting special samples due to the randomness of conventional Blending in dividing training set, and the samples obtained by the stratification strategy can better represent their original characteristics.

We combine two stratification strategies (CA and PP) with blending for partitioning data, and define them as CAssBlending and PPssBlending, respectively. Subsequently, using Blending, CAssBlending, and PPssBlending, respectively, the phenotype values are predicted on the Test set, and the prediction accuracy of the three Blending methods is evaluated by investigating the Pearson correlation coefficient between the predicted value and the observed value. Figure 2 shows the flowchart of ssBlending.

Figure 2. — The flowchart of our study. The main process is to acquire the Train_Blending for BestH in the Train set. And then use BestH for the Blending and ssBlending in the original data.

3 Results

3.1 Optimization of sampling rate of holdout set

To improve the prediction accuracy of the model, we optimize parameters in model training, since model prediction accuracy is greatly affected by different parameters. For the same computer, the same server, the same computing resources, we set different sampling rates (H = 0.1, 0.15, …, 0.6) in the Blending and ssBlending implementation. We set the number of layers as 8 for both stratification strategies (CA and PP). In addition, we calculate the prediction accuracy on training_test set at different gradient H values. By altering parameter H values we obtain different prediction accuracy values, and the H value corresponding to the highest prediction accuracy is automatically set as BestH. This optimization of H value is conducive to improving prediction accuracy.

3.2 Prediction performance of different methods on simulations

Table 2 shows the prediction performance of the three methods on simulated sets at different heritability values. We found that in most cases, at least one ssBlending strategy (CA and/or PP) outperformed Blending, with an increase in prediction accuracy of 0.1439%–0.8866% (Fig. 3). In the h0.5_q200 dataset, the prediction accuracy of Blending was only 0.5669, whereas that of PPssBlending and CAssBlending reached 0.5758 and 0.5683, respectively, both of which higher than that of Blending.

Table 2.

Prediction accuracy of three methods on different simulated sets.

Trait	Blending	CAssBlending	PPssBlending
h0.2_q200	0.1733	0.1705	0.1641
h0.2_q2000	0.2351	0.2258	0.2409
h0.2_q5000	0.3663	0.3698	0.3624
h0.5_q200	0.5669	0.5683	0.5758
h0.5_q2000	0.4369	0.4329	0.4406
h0.5_q5000	0.4512	0.4518	0.4575
h0.8_q200	0.7870	0.7868	0.7892
h0.8_q2000	0.7865	0.7863	0.7859
h0.8_q5000	0.7566	0.7568	0.7568

Open in a new tab

Notes: h0.2_q200 represents a simulated dataset with the heritability of 0.2 and QTN number of 200; other representations are similar.

Figure 3. — Increase in prediction accuracy of ssBlending, compared with Blending on pig simulated set.

3.3 Prediction performance of different methods on real datasets

In pig data, the prediction accuracy of blending for 6 traits (ADG, BF, LEA, OTW, BH, and LP) was 0.5280, 0.2026, 0.1308, 0.3164, 0.3557, and 0.27670, respectively, while that of PPssBlending was 0.5373, 0.2064, 0.1359, 0.3080, 0.3589, and 0.2883, and that of CAssBlending was 0.5334, 0.2288, 0.0900, 0.3225, 0.3510, and 0.2752, respectively (Table 3). Compared with blending, our ssBlending method exhibited an increase of 0.3242%–2.6236% in prediction accuracy for 6 traits (Fig. 4).

Table 3.

Prediction accuracy of three methods for six different traits on real pig data.

Trait	Blending	CAssBlending	PPssBlending
ADG	0.5280	0.5334	0.5373
BF	0.2026	0.2288	0.2064
BH	0.3557	0.3510	0.3589
LEA	0.1308	0.0900	0.1359
LP	0.2767	0.2752	0.2883
OTW	0.3164	0.3225	0.3080

Open in a new tab

Notes: ADG, average daily weight gain; BF, backfat thickness; BH, body height; OTW, off test weight; LEA, loin eye area; LP, lean percentage.

Figure 4. — Increase in prediction accuracy of ssBlending, compared with Blending, for six pig traits.

The genetic structure of plants differs significantly from that of animals; therefore, we assessed this approach on six traits (RT, ABA, THLC, CWAB, TSH, and C5C6) of loblolly pine. Compared with the blending, the ssBlending exhibited a maximum increase in prediction accuracy of 3.1324% (Fig. 5). The average prediction accuracy of the blending for the six traits was approximately 0.2991, ranging from 0.1448 to 0.4306, while that of CAssBlending and PPssBlending was 0.3002 and 0.3131, respectively (Table 4), further confirming a higher stability and accuracy of our proposed new method, relative to conventional blending.

Figure 5. — Increase in prediction accuracy of ssBlending, compared with Blending, for six loblolly pine traits.

Table 4.

Prediction accuracy of three methods for different traits on loblolly pine data.

Trait	Blending	CAssBlending	PPssBlending
ABA	0.4306	0.4619	0.4467
C5C6	0.1495	0.1435	0.1508
CWAB	0.4244	0.4407	0.4285
TSH	0.3188	0.3398	0.3140
THLC	0.3265	0.3373	0.3350
RN	0.1448	0.1554	0.1261

Open in a new tab

Notes: RN, root number; ABA, average branch angle; THLC, total height to the base of the live crown; CWAB, crown width across the planting beds; TSH, total stem height; C5C6: 5- and 6-carbon suger content.

Further, we performed validation on microbial data, using six yeast traits. The results showed that the prediction accuracy of ssBlending surpassed that of Blending (Table 5). For formamide, IAA, and lactose traits, the ssBlending exhibited an increase in prediction accuracy of 0.1056%–0.4146%, compared with blending (Fig. 6).

Table 5.

Prediction accuracy of three methods for six different traits on yeast data.

Trait	Blending	CAssBlending	PPssBlending
BER	0.7608	0.7615	0.7593
CuSO₄	0.7369	0.7410	0.7332
Diamide	0.7553	0.7557	0.7539
Formamide	0.6053	0.6063	0.6072
IAA	0.6298	0.6340	0.6325
Lactose	0.5979	0.6002	0.6003

Open in a new tab

Notes: CuSO₄, copper sulfate; BER, berbamine; IAA, indol acetic acid.

Figure 6. — Increase in prediction accuracy of ssBlending, compared with blending, for six yeast traits.

In addition, we found in real data that when the distribution of the phenotypic values of the analyzed traits is more in line with the normal distribution, the PP strategy is usually better than the CA strategy; otherwise, the CA strategy is selected. The LP skewness of pigs is 0.0732, which is small (Supplementary Fig. S1), so the PP strategy is better than the CA strategy; TSH and RN of loblolly pine have larger skewnesses compared to other traits. In this case, the PP strategy is not as effective as the original Blending method. The skewness of C5C6 is smaller and closer to the normal distribution, so the PP method is worse than the CA method; the larger the skewness of traits such as ABA, CWAB and THLC, the more reasonable it is to choose the CA strategy (Supplementary Fig. S2). In the yeast data, the phenotypes of the five traits CuSO4, diamide, BER, formamide and IAA show large normal skewnesses. In this case, the PP method has no obvious advantage over the CA method (Supplementary Fig. S3).

Taken together, for different traits across different species (animal, plant, and microbe), our method shows certain advantages in various data with multiple different genetic structures.

4 Discussion

Stratified sampling is an effective method to accurately reflect overall characteristics of samples (Singh et al. 1996). When we apply stratified sampling for samples selection, the group is divided into smaller sub-groups, which are known as strata. By dividing the population into smaller groups or strata based on shared attributes, stratified sampling ensures the representativeness of demographic characteristics. And the studies performed for various learning tasks and many data sets have shown stratified sampling superiority with respect to random sampling (Cervellera and Macciò 2018). The method of selecting data stratification warrants careful consideration. Kim et al utilized K-means clustering and decision tree induction as hierarchical rules generated through data mining technology to segment and classify data, significantly enhancing the percentage of variance interpretation (Kim et al. 2013). Jing et al. (2015) proposed clustering high-dimensional data into several feature layers and employing stratified sampling to generate a component data set through random sampling within the feature layers. And their experimental results demonstrate that the clustering outcomes of stratified sampling outperform those of random sampling. Following these inspirations, we introduce a new variant of blending algorithm, namely, ssBlending, which incorporates stratified sampling strategy. In addition, we employed clustering analysis of genotypic data alongside the probability distribution of phenotypes to stratify the samples. Subsequently, we integrated these stratification strategies into the blending framework, which corresponded to clustering analysis ssBending and phenotypic probability ssBlending, respectively.

In order to reveal the role of stratified sampling in enhancing the prediction performance of the Blending algorithm, we conducted a comparative analysis between the ssBlending and Blending methods. Our results consistently revealed that ssBlending exhibited superior predictive performance, compared with Blending. Unlike blending relying on simple random sampling, ssBlending utilizes stratified sampling to ensure comprehensive coverage of features in the training set, thus increasing prediction accuracy. When the training set adequately represents the entire dataset, the prediction results will be reliable. The influence of four sampling methods, namely random sampling of training set and test set, systematic sampling, stratified random sampling, and stratified system sampling in machine learning is explored by Muhammed Kurs et al. It is demonstrated that the performance of the two stratified sampling methods outperforms others (Uçar et al. 2020). This underscores the importance of a well-structured and diversified training dataset for attaining desirable prediction performance in Blending algorithms. Overall, our study highlights the profound impact of implementing stratified sampling on the enhancement of the predictive performance of the ssBlending algorithm.

The superior prediction performance of ssBlending compared to conventional blending primarily stems from its stratified sampling strategy. Stratified sampling is advantageous in selecting sample sets more representative of the original population, compared to random sampling. By ensuring proper representativeness for each group within the population, stratified sampling enhances the accuracy of estimating model parameters, particularly when the population is divided into homogeneous subgroups. In general, stratified sampling is superior to simple random sampling when a robust machine learning model is trained on the training set of the population, especially when the population is heterogeneous or when the sample size is limited. However, simple random sampling is more advantageous when the population is homogeneous or lacks clear subgroup division.

The optimization of the data structure of the training set by stratified sampling can enhance the prediction accuracy of the ssBlending algorithm. Maximizing the inter-stratum phenotypic variance captured by the training set is critical for improving the prediction accuracy of the models. The comparison of ssBlending and Blending revealed that the prediction performance of ssBlending is better than that of the conventional Blending. Moreover, in machine learning, the selected parameters affect overall model performance, even for the same learner, the prediction results vary with parameters set (Petrak 2000, Thornton et al. 2013). Hence, we optimized parameter settings in this study thought training sets, and applied the BestH obtained in training to the test set for the prediction of unknown phenotypes. The objective of this methodology is to efficiently identify the parameter values that optimally align with the characteristics of the specified dataset, thereby enhancing the accuracy of ensemble learning. Our study also confirmed this optimization process also provides more effective prediction performance for ssBlending.

5 Conclusion

This study addresses a drawback of the blending algorithm, namely, the inaccuracy in prediction performance caused by its random sampling on training sets. We propose stratified sampling approach for training sets including clustering analysis strategy and phenotypic probability strategy. By replacing the simple random sampling with these two stratified sampling strategies, we introduce the ssBlending algorithm, which combines the advantages of heterogeneous ensemble learning algorithm and stratified sampling. Our data across different species reveal that ssBlending exhibits superior prediction performance, compared with the conventional blending algorithm, highlighting the potential of ssBlending in genome selection applications. By reducing the instability and inaccuracy of the blending algorithm via stratified sampling, the ssBlending provides an improved approach with high prediction accuracy for genome selection tasks. Notably, our proposed ssBlending method is not limited to genomic prediction, and the more applications of this method in other fields remain to be further investigated.

Supplentary data

Supplementary data are available at Bioinformatics Advances online.

Supplementary Material

vbaf002_Supplementary_Data

vbaf002_supplementary_data.zip^{(1.3MB, zip)}

Contributor Information

Na Miao, Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070.

Mengke Yang, Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070.

Pingping Han, Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070.

Jiakun Qiao, Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070.

Zhaoxuan Che, Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070.

Fangjun Xu, Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070.

Xiangyu Dai, Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070.

Mengjin Zhu, Key Lab of Agricultural Animal Genetics, Breeding, and Reproduction of Ministry of Education, Huazhong Agricultural University, Wuhan 430070, China; The Cooperative Innovation Center for Sustainable Pig Production, Huazhong Agricultural University, Wuhan 430070, China; 华中农业大学农业动物遗传育种与繁殖教育部重点实验室，武汉 430070; 华中农业大学生猪健康养殖协同创新中心，武汉 430070.

Conflict of interest

None declared.

Funding

This work was supported by the National Key Research and Development Program of China [2021YFD1301201]; the Natural Science Foundation of China [31961143020]; and the Earmarked Fund for China Agriculture Research System [CARS-35].

Data availability

The data underlying this article are available in the article and in its online supplementary material (https://figshare.com/s/23122a18dc8a35f12ff6).

References

Ben-Haim Y, Tom-Tov E. A streaming parallel decision tree algorithm. J Mach Learn Res 2010;11:849–72. [Google Scholar]
Bloom JS, Kotenko I, Sadhu MJ et al. Genetic interactions contribute less than additive effects to quantitative trait variation in yeast. Nat Commun 2015;6:8712. [DOI] [PMC free article] [PubMed] [Google Scholar]
Breiman L. Bagging predictors. Mach Learn 1996;24:123–40. [Google Scholar]
Cervellera C, Macciò D. Distribution-preserving stratified sampling for learning problems. Ieee T Neur Net Lear 2018;29:2886–95. [DOI] [PubMed] [Google Scholar]
Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2011;2:1–27. [Google Scholar]
Chung D, Chun H, Keles S.. An Introduction to the ‘spls’ Package, Version 1.0. Madison, WI: University of Wisconsin, 2012. [Google Scholar]
Dai D et al. Multi-Objective cluster ensemble based on filter refinement scheme. Ieee T Knowl Data En 2023;35:8257–69. [Google Scholar]
Dong X, Yu Z, Cao W et al. A survey on ensemble learning. Front Comput Sci 2020;14:241–58. [Google Scholar]
Durbin R, Willshaw D. An analogue approach to the travelling salesman problem using an elastic net method. Nature 1987;326:689–91. [DOI] [PubMed] [Google Scholar]
Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: ICML'96: Proceedings of the Thirteenth International Conference on International Conference on Machine Learning. Citeseer, 1996:148–56.
Grubinger T, Zeileis A, Pfeiffer K-P. Evtree: evolutionary learning of globally optimal classification and regression trees in R. J Stat Soft 2014;61:1–29. [Google Scholar]
Hao J, He F, Ma F et al. Machine learning vs deep learning in stock market investment: an international evidence. Ann Oper Res 2023:1–23. [Google Scholar]
Havens TC, Bezdek JC, Leckie C et al. Fuzzy c means algorithms for very large data. IEEE Trans Fuzzy Syst 2012;20:1130–46. [Google Scholar]
Helm JM, Swiergosz AM, Haeberle HS et al. Machine learning and artificial intelligence: definitions, applications, and future directions. Curr Rev Musculoskelet Med 2020;13:69–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang KZ et al. Maxi-min margin machine: learning large margin classifiers locally and globally. Ieee T Neural Networ 2008;19:260–72. [DOI] [PubMed] [Google Scholar]
Huang GC Zhou L, Shi JW et al. Application and comparison of machine learning algorithms in disease risk prediction. Healthy Development and Policy Research 2020;23:432–36. [Google Scholar]
Jing LP, Tian K, Huang JZ. Stratified feature sampling method for ensemble clustering of high dimensional data. Pattern Recogn 2015;48:3688–702. [Google Scholar]
Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science 2015;349:255–60. [DOI] [PubMed] [Google Scholar]
K, han W, Ghazanfar MA, Azam MA et al. Stock market prediction using machine learning classifiers and social media, news. J Amb Intel Hum Comp 2022:1–24. [Google Scholar]
Kim YJ, Oh Y, Park S et al. Stratified sampling design based on data mining. Healthc Inform Res 2013;19:186–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lau KW, Wu QH. Online training of support vector classifier. Pattern Recogn 2003;36:1913–20. [Google Scholar]
Li M, Zhou ZH. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst, Man, Cybern A 2007;37:1088–98. [Google Scholar]
Mahesh B. Machine learning algorithms-a review. IJSR 2020;9:381–6. [Google Scholar]
Michalski RS, Carbonell JG, Mitchell TM. Machine Learning: An Artificial Intelligence Approach. Springer Science & Business Media, 2013:572. [Google Scholar]
Nilashi M, Abumalloh RA, Minaei-Bidgoli B et al. Predicting parkinson's disease progression: evaluation of ensemble methods in machine learning. J Healthc Eng 2022;2022:2793361. [DOI] [PMC free article] [PubMed] [Google Scholar]
Papadimitriou S, Sun JD. Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008:512–21.
Petrak J. Fast subsampling performance estimates for classification algorithm selection. In: Proceedings of the ECML-00 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, p. 3–14. 2000.
Purcell S, Neale B, Todd-Brown K et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81:559–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
Resende MFR, Muñoz P, Resende MDV et al. Accuracy of genomic selection methods in a standard data set of loblolly pine (pinus taeda L.). Genetics 2012;190:1503–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rezaei HB, Amjadian A, Sebt MV et al. An ensemble method of the machine learning to prognosticate the gastric cancer. Ann Oper Res 2023;328:151–92. [Google Scholar]
Samuel AL. Some studies in machine learning using the game of checkers. Ibm J Res Dev 1959;3:211. [Google Scholar]
Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci 2021;2:160. [DOI] [PMC free article] [PubMed] [Google Scholar]
Singh R, Mangat NS, Singh R et al. Stratified sampling. In: Elements of Survey Sampling, 1996, pp. 102–44.
Tang Y, Liu XL. G2P: a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation. Bioinformatics 2019;35:3852–4. [DOI] [PubMed] [Google Scholar]
Thornton C, Hutter F, Hoos HH et al. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 847–55. 2013.
Töscher A, Jahrer M, Bell RM. The bigchaos solution to the netflix grand prize. Netflix prize documentation, 2009, 1–52. [Google Scholar]
Uçar MK, Nour M, Sindi H et al. 2020. The Effect of Training and Testing Process on Machine Learning in Biomedical Datasets. Math Probl Eng 2020. 10.1155/2020/2836236 [DOI] [Google Scholar]
Wolpert DH. Stacked generalization. Neural Networks 1992;5:241–59. [Google Scholar]
Xu Y, Yu Z, Cao W et al. A novel classifier ensemble method based on subspace enhancement for High-Dimensional data classification. IEEE Trans Knowl Data Eng 2023;35:16–30. [Google Scholar]
Xu YH, Yu ZW, Chen CLP. Classifier ensemble based on multiview optimization for High-Dimensional imbalanced data classification. IEEE Trans Neural Netw Learning Syst 2024a;35:870–83. [DOI] [PubMed] [Google Scholar]
XU 是的 YU ZW, 陈 Clp。基于多视图优化的分类器合奏高维数据分类不平衡。 IEEE跨神经NetW学习系统 2024a. ; 35 ： 870 - 83 。 [Google Scholar]
Xu YH, Yu ZW, Chen CLP. Improved Contraction-Expansion subspace ensemble for High-Dimensional imbalanced data classification. IEEE Trans Knowl Data Eng 2024b;36:5194–205. [Google Scholar]
Zhang C, Li F, Jestes J. Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology. 2012:38–49.
Zhou Z-H. Ensemble Methods: Foundations and Algorithms. CRC Press, 2012:236. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

vbaf002_Supplementary_Data

vbaf002_supplementary_data.zip^{(1.3MB, zip)}

Data Availability Statement

The data underlying this article are available in the article and in its online supplementary material (https://figshare.com/s/23122a18dc8a35f12ff6).

[vbaf002-B1] Ben-Haim Y, Tom-Tov E. A streaming parallel decision tree algorithm. J Mach Learn Res 2010;11:849–72. [Google Scholar]

[vbaf002-B2] Bloom JS, Kotenko I, Sadhu MJ et al. Genetic interactions contribute less than additive effects to quantitative trait variation in yeast. Nat Commun 2015;6:8712. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf002-B3] Breiman L. Bagging predictors. Mach Learn 1996;24:123–40. [Google Scholar]

[vbaf002-B4] Cervellera C, Macciò D. Distribution-preserving stratified sampling for learning problems. Ieee T Neur Net Lear 2018;29:2886–95. [DOI] [PubMed] [Google Scholar]

[vbaf002-B5] Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2011;2:1–27. [Google Scholar]

[vbaf002-B6] Chung D, Chun H, Keles S.. An Introduction to the ‘spls’ Package, Version 1.0. Madison, WI: University of Wisconsin, 2012. [Google Scholar]

[vbaf002-B7] Dai D et al. Multi-Objective cluster ensemble based on filter refinement scheme. Ieee T Knowl Data En 2023;35:8257–69. [Google Scholar]

[vbaf002-B8] Dong X, Yu Z, Cao W et al. A survey on ensemble learning. Front Comput Sci 2020;14:241–58. [Google Scholar]

[vbaf002-B9] Durbin R, Willshaw D. An analogue approach to the travelling salesman problem using an elastic net method. Nature 1987;326:689–91. [DOI] [PubMed] [Google Scholar]

[vbaf002-B10] Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: ICML'96: Proceedings of the Thirteenth International Conference on International Conference on Machine Learning. Citeseer, 1996:148–56.

[vbaf002-B11] Grubinger T, Zeileis A, Pfeiffer K-P. Evtree: evolutionary learning of globally optimal classification and regression trees in R. J Stat Soft 2014;61:1–29. [Google Scholar]

[vbaf002-B12] Hao J, He F, Ma F et al. Machine learning vs deep learning in stock market investment: an international evidence. Ann Oper Res 2023:1–23. [Google Scholar]

[vbaf002-B13] Havens TC, Bezdek JC, Leckie C et al. Fuzzy c means algorithms for very large data. IEEE Trans Fuzzy Syst 2012;20:1130–46. [Google Scholar]

[vbaf002-B14] Helm JM, Swiergosz AM, Haeberle HS et al. Machine learning and artificial intelligence: definitions, applications, and future directions. Curr Rev Musculoskelet Med 2020;13:69–76. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf002-B15] Huang KZ et al. Maxi-min margin machine: learning large margin classifiers locally and globally. Ieee T Neural Networ 2008;19:260–72. [DOI] [PubMed] [Google Scholar]

[vbaf002-B43] Huang GC Zhou L, Shi JW et al. Application and comparison of machine learning algorithms in disease risk prediction. Healthy Development and Policy Research 2020;23:432–36. [Google Scholar]

[vbaf002-B16] Jing LP, Tian K, Huang JZ. Stratified feature sampling method for ensemble clustering of high dimensional data. Pattern Recogn 2015;48:3688–702. [Google Scholar]

[vbaf002-B17] Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science 2015;349:255–60. [DOI] [PubMed] [Google Scholar]

[vbaf002-B18] K, han W, Ghazanfar MA, Azam MA et al. Stock market prediction using machine learning classifiers and social media, news. J Amb Intel Hum Comp 2022:1–24. [Google Scholar]

[vbaf002-B19] Kim YJ, Oh Y, Park S et al. Stratified sampling design based on data mining. Healthc Inform Res 2013;19:186–95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf002-B20] Lau KW, Wu QH. Online training of support vector classifier. Pattern Recogn 2003;36:1913–20. [Google Scholar]

[vbaf002-B21] Li M, Zhou ZH. Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans Syst, Man, Cybern A 2007;37:1088–98. [Google Scholar]

[vbaf002-B22] Mahesh B. Machine learning algorithms-a review. IJSR 2020;9:381–6. [Google Scholar]

[vbaf002-B23] Michalski RS, Carbonell JG, Mitchell TM. Machine Learning: An Artificial Intelligence Approach. Springer Science & Business Media, 2013:572. [Google Scholar]

[vbaf002-B24] Nilashi M, Abumalloh RA, Minaei-Bidgoli B et al. Predicting parkinson's disease progression: evaluation of ensemble methods in machine learning. J Healthc Eng 2022;2022:2793361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf002-B25] Papadimitriou S, Sun JD. Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In: 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008:512–21.

[vbaf002-B26] Petrak J. Fast subsampling performance estimates for classification algorithm selection. In: Proceedings of the ECML-00 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, p. 3–14. 2000.

[vbaf002-B27] Purcell S, Neale B, Todd-Brown K et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007;81:559–75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf002-B28] Resende MFR, Muñoz P, Resende MDV et al. Accuracy of genomic selection methods in a standard data set of loblolly pine (pinus taeda L.). Genetics 2012;190:1503–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf002-B29] Rezaei HB, Amjadian A, Sebt MV et al. An ensemble method of the machine learning to prognosticate the gastric cancer. Ann Oper Res 2023;328:151–92. [Google Scholar]

[vbaf002-B30] Samuel AL. Some studies in machine learning using the game of checkers. Ibm J Res Dev 1959;3:211. [Google Scholar]

[vbaf002-B31] Sarker IH. Machine learning: algorithms, real-world applications and research directions. SN Comput Sci 2021;2:160. [DOI] [PMC free article] [PubMed] [Google Scholar]

[vbaf002-B32] Singh R, Mangat NS, Singh R et al. Stratified sampling. In: Elements of Survey Sampling, 1996, pp. 102–44.

[vbaf002-B33] Tang Y, Liu XL. G2P: a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation. Bioinformatics 2019;35:3852–4. [DOI] [PubMed] [Google Scholar]

[vbaf002-B34] Thornton C, Hutter F, Hoos HH et al. Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 847–55. 2013.

[vbaf002-B35] Töscher A, Jahrer M, Bell RM. The bigchaos solution to the netflix grand prize. Netflix prize documentation, 2009, 1–52. [Google Scholar]

[vbaf002-B36] Uçar MK, Nour M, Sindi H et al. 2020. The Effect of Training and Testing Process on Machine Learning in Biomedical Datasets. Math Probl Eng 2020. 10.1155/2020/2836236 [DOI] [Google Scholar]

[vbaf002-B37] Wolpert DH. Stacked generalization. Neural Networks 1992;5:241–59. [Google Scholar]

[vbaf002-B38] Xu Y, Yu Z, Cao W et al. A novel classifier ensemble method based on subspace enhancement for High-Dimensional data classification. IEEE Trans Knowl Data Eng 2023;35:16–30. [Google Scholar]

[vbaf002-B39] Xu YH, Yu ZW, Chen CLP. Classifier ensemble based on multiview optimization for High-Dimensional imbalanced data classification. IEEE Trans Neural Netw Learning Syst 2024a;35:870–83. [DOI] [PubMed] [Google Scholar]

[vbaf002-B151] XU 是的 YU ZW, 陈 Clp。基于多视图优化的分类器合奏高维数据分类不平衡。 IEEE跨神经NetW学习系统 2024a. ; 35 ： 870 - 83 。 [Google Scholar]

[vbaf002-B40] Xu YH, Yu ZW, Chen CLP. Improved Contraction-Expansion subspace ensemble for High-Dimensional imbalanced data classification. IEEE Trans Knowl Data Eng 2024b;36:5194–205. [Google Scholar]

[vbaf002-B41] Zhang C, Li F, Jestes J. Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology. 2012:38–49.

[vbaf002-B42] Zhou Z-H. Ensemble Methods: Foundations and Algorithms. CRC Press, 2012:236. [Google Scholar]

PERMALINK

A new ensemble learning method stratified sampling blending optimizes conventional blending and improves prediction performance

Na Miao

Mengke Yang

Pingping Han

Jiakun Qiao

Zhaoxuan Che

Fangjun Xu

Xiangyu Dai

Mengjin Zhu

Roles

Abstract

Motivation

Results

Availability and implementation

1 Introduction

2 Methods

2.1 Description of datasets

Table 1.

2.1.1 Real datasets

2.1.2 Simulated datasets

2.2 Description of methods

2.2.1 Blending

2.2.2 ssBlending

Figure 1.

Figure 2.

3 Results

3.1 Optimization of sampling rate of holdout set

3.2 Prediction performance of different methods on simulations

Table 2.

Figure 3.

3.3 Prediction performance of different methods on real datasets

Table 3.

Figure 4.

Figure 5.

Table 4.

Table 5.

Figure 6.

4 Discussion

5 Conclusion

Supplentary data

Supplementary Material

Contributor Information

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases