Prediction of cardiovascular disease based on multiple feature selection and improved PSO-XGBoost model

Kerang Cao; Chang Liu; Siqi Yang; Yuxin Zhang; Lili Li; Hoekyung Jung; Shuo Zhang

doi:10.1038/s41598-025-96520-7

. 2025 Apr 11;15:12406. doi: 10.1038/s41598-025-96520-7

Prediction of cardiovascular disease based on multiple feature selection and improved PSO-XGBoost model

Kerang Cao ^1,², Chang Liu ¹, Siqi Yang ¹, Yuxin Zhang ¹, Lili Li ³, Hoekyung Jung ^4,^✉, Shuo Zhang ^1,^✉

PMCID: PMC11992166 PMID: 40216915

Abstract

Cardiovascular disease is a common disease that threatens human health. In order to predict it more accurately, this paper proposes a cardiovascular disease prediction model that combines multiple feature selection, improved particle swarm optimization algorithm, and extreme gradient boosting tree. Firstly, the dataset is preprocessed, and an XGBoost cardiovascular disease prediction model is constructed for model training and compare it with other algorithms. Then, combined with two factor Pearson correlation analysis and feature importance ranking, multiple feature selection is performed, with the optimal feature subset as the feature input. Finally, the improved particle swarm optimization algorithm is used to adjust the hyperparameters of the extreme gradient boosting tree algorithm, and selecting the optimal hyperparameter combination to construct the MFS-DLPSO-XGBoost model. The recall, precision, accuracy, F1 score, and area under the ROC curve (AUC) of the MFS-DLPSO-XGBoost model reached 71.4%, 76.3%, 74.7%, 73.6%, and 80.8%, respectively, which increased by 3.6%, 3.2%, 2.7%, 3.2%, and 2.3% compared to XGBoost. The results indicate that the model proposed in this article has good classification performance and can provide assistance for doctors and patients in predicting and preventing heart disease.

Keywords: Cardiovascular disease, Machine learning, XGBoost algorithm, Multi feature selection, Particle swarm optimization algorithm, Model prediction

Subject terms: Predictive medicine, Cardiovascular diseases

Introduction

Chronic diseases are increasingly posing a serious threat to human health and safety. The main causes of CVD include individual and environmental factors, and cardiovascular diseases (hypertension, coronary heart disease, stroke, etc.) and malignant tumors are relatively common chronic diseases. Due to the obscurity of chronic diseases, the long duration of treatment and the difficult to cure, Patients often struggle with high treatment costs¹. In recent years, its incidence rate continues to be at a high level and presents complex epidemiological characteristics. With the accelerating aging of the global population, CVD incidence rates among older adults have shown a steady upward trajectory. At the same time, the prevalence of metabolic diseases such as obesity and diabetes further increases the risk of cardiovascular disease. Key modifiable risk factors, including the widespread adoption of energy-dense diets (characterized by excessive sugar, salt, and saturated fat intake) and sedentary behaviors, have become increasingly prevalent drivers of CVD pathogenesis². About one-third of the world’s population dies from cardiovascular disease, with China being the country with the highest number of deaths from cardiovascular disease³.

In recent years, machine learning has a wide application prospect in cardiovascular disease prediction⁴. Machine learning is a technique that can automatically establish and handle the complex relationships between data, and is currently one of the most important methods in medical research. It learns from existing medical test data or survey results, and builds a model⁵ that is often used to predict disease risk. According to the literature data, Currently, the most commonly used methods are Support Vector Machine(SVM), Random Forest(RF), Adaboost, K-Nearest Neighbor(KNN)and so on⁶. In recent years, V V Ramalingam et al.⁷ have demonstrated the effectiveness and feasibility of machine learning algorithms in predicting cardiovascular diseases. Subsequently, many related improvement methods have emerged, including mixed random forest based on linear models (HRFLM)⁸, enhanced deep learning assisted Convolutional Neural Network (EDCNN)⁹, hybrid decision support system¹⁰, machine learning based diagnostic framework for cardiovascular diseases (MaLCaDD)¹¹, new hybrid deep learning model¹², and new convolutional neural network (CNN)¹³.

In the aforementioned studies, while both HRFLM and EDCNN have improved prediction accuracy, they still face limitations in handling feature redundancy and hyper parameter tuning. MaLCaDD similarly exhibits limitations in feature selection. The hybrid decision support system and the new convolutional neural network primarily demonstrate constraints in model generalization capability and interpretability. As for the new hybrid deep learning model, it encounters issues such as parameter redundancy and computational complexity.Our study addresses these gaps by integrating PSO for enhanced optimization. Our study establishes a cardiovascular disease prediction model based on Extreme Gradient Boosting (XGBoost) tree. Multiple feature selection (MFS) is used to remove redundant features, and an improved particle swarm optimization algorithm (DLPSO) combining dynamic inertia weight and local search is proposed to optimize the model, in order to enhance its stability and improve its accuracy.

The main contribution of this article in predicting cardiovascular disease is:

Improve the particle swarm algorithm, dynamically adjust the inertia parameters of the particle swarm algorithm, and perform local search at the global optimal position to enhance the parameter tuning ability of the particle swarm algorithm, thereby enhancing the robustness and adaptability of the model in cardiovascular disease prediction.
Adjust the hyperparameters of XGBoost algorithm using the improved particle swarm optimization algorithm to obtain the optimal hyperparameter combination and improve the accuracy of cardiovascular disease prediction.
Combining two factor Pearson correlation analysis and XGBoost feature importance ranking for feature selection, the optimal feature subset is obtained to reduce feature redundancy and improve prediction accuracy.

Results

To verify the feasibility and reliability of the algorithm proposed in the article, and to demonstrate its advantages over other similar algorithms, relevant experiments were conducted, and the experimental results were analyzed and discussed.

Experimental environment

The experiment used PyCharm and Python 3.8 (64-bit). The computer configuration is shown in Table 1. The experimental dataset adopts the cardiovascular disease public dataset provided by the Kaggle platform.

Table 1.

Computer configuration.

Item	Configuration description
Operating system	Windows 10 64-bit
GPU	Nvidia GEFORCE RTX4090
Memory	32GB
CPU	Intel i9-13900KF
CPU clock speed	5.80 GHz
Number of CPU cores	24 cores

Open in a new tab

Data overview and data preprocessing

The dataset in this article comes from the public cardiovascular disease dataset provided by the Kaggle platform, which includes the basic physical condition and examination information of the population undergoing physical examinations in recent years. Due to the large size of the original dataset, we employed stratified sampling techniques to select data, ensuring the maintenance of class balance and inter-feature correlations. This approach aims to make the optimization effects of the research more pronounced. The selected dataset consists of 11,389 samples, including 5,715 healthy individuals and 5,674 cardiovascular disease patients. Each data includes 12 features such as height, weight, blood glucose, blood pressure, and 1 label indicating whether one has cardiovascular disease. Whether one has cardiovascular disease is a binary classification problem. This paper uses methods such as missing value processing, outlier processing, and label encoding for preprocessing. For the convenience of subsequent analysis and research, the characteristics of each variable are explained as shown in Table 2.

Table 2.

Explanation of characteristics of various variables.

Number	Feature	Characterization	Feature type
1	Age	Calculated by days	Continuous type
2	Gender	1: Male 0: Female	Discrete type
3	Height	Height	Continuous type
4	Weight	Weight	Continuous type
5	ap_hi	Systolic pressure	Continuous type
6	ap_lo	Diastolic pressure	Continuous type
7	Cholesterol	1: Normal 2: Above normal 3: Far higher than normal	Discrete type
8	Gulc	1: Normal 2: Above normal 3: Far higher than normal	Discrete type
9	Smoke	1: Smoking 0: Non-smoking	Discrete type
10	Alco	1: Drinking 0: Non-drinking	Discrete type
11	Active	1: Regular exercise 0: Not exercising regularly	Discrete type
12	Bmi	Body mass index	Continuous type
13	Cardio	1: Sick 0: Not sick	Discrete type

Open in a new tab

Missing value handling

Missing values (constituting < 5% of data with MCAR pattern) were addressed through direct deletion rather than imputation to avoid unnecessary computational complexity while maintaining data integrity. For example, in the dataset examined, there are a small number of missing values under the “active” feature. The .dropna() function is used to remove the missing values and their corresponding tuples. The remaining missing values are processed using the same method. After processing all missing values, there are still 10,900 data points left, and a total of 399 sample data points have been deleted.

Outlier handling

After missing value processing, outlier processing was performed on continuous variables such as age, height, weight, ap_hi, and ap_lo. This article uses a box plot to detect abnormal data, and considers values outside the upper and lower limits of the interquartile range in the box plot as outliers. The results are shown in Fig. 1. From Fig. 1, it can be observed that there are outliers in all continuous clinical indicators in the dataset. To improve model accuracy, remove all values outside the upper and lower limits of the interquartile range. After removing the outliers, resulting in a final dataset of 10,587 data points, and a total of 313 sample data points have been deleted.

Fig. 1 — Continuous data box wiring diagram.

Tag encoding

This article employs label encoding to preprocess categorical variables. In this dataset, there are multiple classification features, such as “gender”, “smoke”, “alco” and others. To convert these categorical variables into numerical forms that can be used for model training, assign a numerical value to each category of each categorical feature. For example, the feature “gender” has two categories, namely “Male” and “Female”. By assigning “Male” a value of “1” and “Female” a value of “0”. The remaining classification features are processed using the same method.

Evaluating indicator

In the medical field, commonly used machine learning classification metrics include accuracy, precision, recall, F1 score value, and AUC value.

This article uses the above indicators to evaluate the model, where accuracy is the proportion of correctly predicted samples to the total sample size, and is a measure of overall performance. Accuracy refers to the proportion of samples predicted as positive categories by a model that are actually positive categories. It focuses on the accuracy of model predictions. The recall rate is the proportion of samples that are actually positive categories that the model successfully predicts as positive categories. The recall rate measures the extent to which the model covers positive categories. F1 score takes into account both accuracy and recall, and is a balanced evaluation metric. AUC is the area under the ROC curve used to measure the performance of a model at different thresholds. The closer the AUC value is to 1, the better the model performance. The Receiver Operating Characteristic Curve (ROC), shows the trade-off between the probability of correctly predicting a positive class (TPR) and the probability of incorrectly predicting a positive class (FPR). Among them, accuracy, precision, recall, and F1 value can be represented by a confusion matrix.

Confusion matrix, also known as error matrix, is a standard form of accuracy evaluation. For k-element classification, it represents a table of (K*K), where the classifier’s prediction results are recorded. For example, the confusion matrix for binary classification is shown in Table 3.

Table 3.

Binary confusion matrix.

Predictive value	True value
Predictive value	Positive	Negative
Positive	TP	FP
Negative	FN	TN

Open in a new tab

The above indicator formulas are shown in Eq. (1)–(4) respectively.

Experimental results and analysis

Benchmark experiment of XGBoost in cardiovascular disease prediction

First, the processed data was divided into training and test sets at an 8:2 ratio. The training set data was then input into the XGBoost model for training to establish a cardiovascular disease prediction model based on XGBoost, which was evaluated on the test set. To ensure the robustness of the experimental results, we adopted a five-fold cross-validation method to obtain the average values of various performance metrics. Under the same experimental environment parameters, the results of each algorithm compared with other classical algorithms are shown in Table 4.

Table 4.

Comparison of model performance indicators.

Model	Accuracy	Precision	Recall	F1-score	AUC
KNN	0.654	0.658	0.614	0.636	0.695
LR	0.720	0.743	0.657	0.697	0.781
RF	0.702	0.708	0.670	0.688	0.760
SVM	0.712	0.710	0.663	0.695	0.782
XGBoost	0.720	0.731	0.678	0.704	0.785

Open in a new tab

We found that the XGBoost method performs well on all metrics, with the SVM second only to the XGBoost method in performance. LR and RF methods are not much different, while the KNN has relatively poor performance.

Cardiovascular disease prediction experiment based on MFS-DLPSO-XGBoost model

As can be seen from the previous text, the performance of the cardiovascular disease prediction model based on XGBoost is superior to other models. Therefore, in order to further improve the prediction accuracy and model performance, this paper chooses to improve on the basis of XGBoost model.

Firstly, conduct a two factor Pearson correlation analysis. As shown in Fig. 2A, the group of features with the highest correlation coefficient is BMI and weight, with a correlation coefficient of 0.83, showing a clear positive correlation.Subsequently, considering that the XGBoost performed relatively well in prediction tasks compared to other models, XGBoost was used to calculate feature importance and rank it, making subsequent feature selection work more suitable for its prediction. The feature importance ranking diagram is shown in Fig. 2B. According to Fig. 2B, the features with the highest and lowest feature importance are ap_hi and gender, respectively, with feature importance values of 0.22 and 0.02. Taking into account the results shown in the feature heatmap and feature importance ranking map, the feature BMI with the highest correlation coefficient and lower feature importance is removed, and the remaining features are reduced one by one according to their importance from small to large. The remaining features after each reduction are used as feature subsets and put into the XGBoost for prediction. The accuracy of the prediction results is shown in Fig. 2C. As shown in Fig. 2C, when the number of reduced features is 2, that is, the two features with the smallest feature importance, gender and height, are removed as redundant features, the prediction accuracy reaches the highest value of 75.16%.

In summary, after multiple feature selection in this article, the features BMI, gender, and height are removed as redundant features, and the remaining features are input into the model prediction as the optimal feature subset, thereby simplifying the model structure and improving prediction efficiency and accuracy.

Next, we will optimize the hyperparameters of the XGBoost algorithm using the improved PSO algorithm. The optimization range of XGBoost hyperparameters, default hyperparameters, and hyperparameters optimized by the improved PSO algorithm are shown in Table 5.

Table 5.

Algorithm hyperparameters table.

Parameter	Default value	Change range	Optimized parameter values
learning_rate	0.3	0.1–0.5	0.10
n_estimators	50	40–60	58
max_depth	6	1–8	6
colsample_bytree	1	0.3–1.0	0.31
subsample	1	0.5–1.0	0.98
min_child_weight	1	1.0–5.0	1.00
reg_alpha	0	0–10	0.80
reg_lambda	1	0–10	8.61

Open in a new tab

Algorithm hyperparameters table apply the optimized hyperparameters to XGBoost for cardiovascular disease prediction again, and compare the XGBoost model before and after multi feature selection with the traditional PSO XGBoost model under the same experimental conditions and parameters. The indicators are shown in Table 6.

Table 6.

Comparison of model performance indicators.

Model	Accuracy	Precision	Recall	F1-score	AUC
XGBoost	0.720	0.731	0.678	0.704	0.785
MFS-XGBoost	0.730	0.752	0.672	0.710	0.793
PSO-XGBoost	0.728	0.740	0.686	0.712	0.786
MFS-PSO-XGBoost	0.740	0.758	0.691	0.723	0.793
DLPSO-XGBoost	0.737	0.742	0.712	0.727	0.793
MFS-DLPSO-XGBoost	0.747	0.763	0.714	0.736	0.808

Open in a new tab

In order to compare the performance of each model more intuitively, a comparison chart of the performance indicators of the six prediction models is provided, as shown in Fig. 3A. As shown in Fig. 3A, the cardiovascular disease prediction model based on MFS-DLPSO-XGBoost algorithm has higher values for all indicators than other models.To present the AUC values of each method more clearly and intuitively, and compare their performance, the ROC curves of each prediction method are plotted on the same graph for comparison, as shown in Fig. 3B. According to Fig. 3B, the MFS-DLPSO-XGBoost method has the largest area under the ROC curve, with an AUC value of 0.808; The traditional XGBoost method has the smallest area with an AUC value of 0.785.

Fig. 3 — Experimental results of some indicator properties. (A) Bar chart comparing model performance indicators. (B) Comparison of ROC curves of models.

The larger the values of the above indicators, the more correctly predicted samples and the fewer incorrectly predicted samples. Therefore, the larger the values of the indicators, the stronger the predictive ability of the model. As can be seen from the above experiment, the performance of the MFS-DLPSO-XGBoost model proposed in this paper is superior to other models.

To validate the statistical significance of the model improvements, we performed a paired t-test on the AUC values obtained from five-fold cross-validation. The results (p < 0.05) demonstrate that the improved model significantly outperforms the original model, confirming the statistical significance of the metric enhancements.

Subsequently, we compared the above experimental results with other literature benchmarks^14–17, and the comparison results are shown in Table 7. The methods in the compared literature are respectively referred to as Method 1 to Method 4.

Table 7.

Comparison of experimental results and different literature benchmarks.

Methods	Accuracy	Precision	Recall	F1-score	AUC
Proposed method	0.747	0.763	0.714	0.736	0.808
Method1	0.739	0.720	0.780	0.750	0.803
Method2		0.757	0.704	0.730	0.807
Method3	0.718	0.719	0.718	0.717
Method4			0.710		0.711

Open in a new tab

The above comparison shows that the method proposed in this paper outperforms other methods in most indicators, proving the effectiveness of our method.

From the comparison results of the above model experiments, it can be seen that the cardiovascular disease prediction model based on MFS-DLPSO-XGBoost proposed in this paper performs better than other models in various prediction performance indicators. This experiment fully verifies the feasibility of our model and reflects its advantages compared to other models.

Limitations

Since this study conducted predictive experiments solely on the Kaggle dataset and did not involve other datasets or real-world clinical environments, the model may have limitations in its generalization capability. Potential limitations include:

Single data source may lead to over-reliance on this specific dataset (including its composition, categories, and data distribution);
Discrepancies between labeling standards and real clinical practice;
Absence of specific patient data (e.g., rare cases or patients with multiple comorbidities).

These factors may impact model performance in actual medical scenarios.

Future works

To address the potential limitations of this study, we plan to undertake the following subsequent work:

Collaborate with local hospitals to obtain multi-center clinical data for external validation;
Conduct domain adaptation studies to enhance model robustness against data discrepancies;
Develop extended datasets including patients with comorbidities;
Adopting deep learning methods based on multimodal data fusion to enhance model performance.

Conclusion

This article constructs a cardiovascular disease prediction model based on the ensemble learning algorithm XGBoost. After performing multi feature selection, an improved particle swarm algorithm is used to adjust the XGBoost hyperparameters to improve the performance of the prediction model. Firstly, based on the processing of missing and outlier values and label encoding on the dataset, a prediction model for cardiovascular disease was established using XGBoost. Comparative experiments were conducted with four mainstream machine learning algorithms proposed in previous studies, namely logistic regression, support vector machine, random forest, and K-nearest neighbor, to demonstrate the superior performance of XGBoost model in predicting cardiovascular disease. Subsequently, multiple feature selection was performed by combining the two factor Pearson correlation coefficient and feature importance ranking. The particle swarm algorithm was introduced to adjust the hyperparameters of XGBoost. In order to further improve the predictive performance of the model, the particle swarm algorithm was improved and combined with XGBoost to form the DLPSO-XGBoost cardiovascular disease prediction model. A series of comparative experiments proved that the MFS-DLPSO-XGBoost cardiovascular disease prediction model proposed in this paper had better performance than other models, confirming the feasibility of this method. In clinical applications, this model can be integrated into the computing devices of relevant departments in hospitals. By inputting the patient’s relevant indicators into the device, the cardiovascular disease incidence of the patient can be predicted conveniently and quickly. In future clinical practice, on the one hand, this model can be designed in the form of software to facilitate interaction between doctors; On the other hand, it needs to be integrated with the patient’s historical case data to further improve its predictive performance.

Methods

XGBoost algorithm

XGBoost has good classification performance and running speed¹⁸. By using parallel processing for feature selection, the algorithm runs faster and the results are more interpretable^19,20. Assuming a dataset containing n sample sizes is given.

Inline graphic composed of m features, with a total of n samples, where R^m and R are respectively the m-dimensional real vector dataset and the real sum.

In the Eq. (5): Inline graphic is a regression tree; K is the total number of regression trees; F is the regression tree space.

The objective function O_bj is:

In the Eq. (6): Inline graphic is the loss function used to measure the error between the classification prediction value and the true value; is the classification prediction value; is the true value; is the regularization term. XGBoost adopts gradient boosting iterative operation. After each iteration process, a new regression tree will be added. The result of the t iteration operation is:

Substitute Eq. (7) into Eq. (6) and calculate the objective function expression for the t iteration as Inline graphic :

In the equation: Inline graphic is a constant term.

Expand Eq. (8) into a second-order Taylor expansion and add a regularization term to prevent overfitting:

In the Eq. (9): Inline graphic is the penalty coefficient of the subtree tree; T is the number of leaf nodes in the tree; is the leaf weight; is the weight penalty coefficient, and is used for regular operations.The trained XGBoost model can use the total gain of each feature at all split nodes to calculate the importance score of that feature, in order to evaluate and rank its importance.

Particle swarm optimization

Particle swarm optimization is a swarm intelligence algorithm based on mimicking the foraging behavior of bird flocks^21,22. Compared to intelligent optimization algorithms such as ant colony, genetic, and simulated annealing, this algorithm addresses the slow convergence speed of ant colony algorithm and the tendency of genetic algorithm and simulated annealing algorithm to get stuck in local optimal solutions²³. PSO has been widely used to solve optimization problems of support vector machines²⁴, BP neural networks²⁵, extreme gradient boosting²⁶ and other algorithms due to its advantages of few parameters, simple structure, high efficiency, easy implementation, and ability to solve non convex problems, and the optimization effect is significant. In this study, PSO was selected for its demonstrated efficacy in navigating large hyperparameter spaces without exhaustive grid search. Compared to Bayesian optimization, PSO offers lower computational costs while maintaining robust search capabilities. Assuming there is a D-dimensional search space with a total of m particles, the position of the i particle is represented as vector X_i = [X_i1, X_i2, … X_iD]^T, the velocity vector is represented as V_i = [V_i1, V_i2, … V_iD]^T, the optimal position found by the particle’s own search is P_i = [P_i1, P_i2, … P_iD]^T, the optimal position searched by the entire population is represented as Pg = [Pg₁, Pg₂, Pg_D]^T, here, g is the particle number, g Inline graphic (1,2,3,… ,m). After initializing the particle swarm, PSO will calculate the fitness value of each particle and search for the optimal solution through continuous updates and iterations. After each iteration, particle X_i updates its position and velocity through individual optimal value P_i and population optimal value P_g. The iteration formula is as follows:

In the Eqs. (10) and (11): k is the number of iterations; Inline graphic is the inertia coefficient used to control the convergence and search ability of the algorithm; are random numbers between [0,1];are acceleration factors, representing the acceleration term weights that push particles towards the individual optimal value P_i and the population optimal value P_g.

MFS-DLPSO-XGBoost prediction model

In the model construction stage, this article employs a two-factor Pearson correlation analysis combined with XGBoost feature importance analysis for multi feature selection, removes redundant features, and improves the model prediction efficiency and accuracy to a certain extent. Then, the improved particle swarm algorithm is used to optimize the hyperparameters of XGBoost, reduce the randomness of parameter selection, and improve the model classification prediction performance. Figure 4 illustrates the cardiovascular disease prediction model based on MFS-DLPSO-XGBoost, demonstrating the implementation steps of the proposed prediction method.

Fig. 4 — Overall structure of MFS-PSO-XGBoost prediction model.

Multiple feature selection (MFS)

Considering that a large number of features in a dataset can lead to feature redundancy, the two factor Pearson correlation coefficient is used to calculate the correlation between features. The feature correlation heatmap is visualized, thus intuitively representing the degree of correlation between features, the darker the color block where two features intersect, the greater the Pearson correlation coefficient representing these two features, and thus their correlation degree. Then use XGBoost algorithm to generate feature importance ranking. The correlation and importance of features are comprehensively considered for multi feature selection, and redundant features are removed. The optimal feature subset is used as the input feature.

Application of PSO-XGBoost model

The disadvantage of XGBoost is that it has numerous parameters and is sensitive to them, making its application more complex. PSO implementation is relatively simple, does not involve complex neural network models, and has a fast convergence speed.Use particle swarm optimization algorithm with fast convergence speed, few adjustable parameters, and strong optimization ability to optimize the parameters of XGBoost.

The PSO-XGBoost algorithm flow is shown in Fig. 2.

According to Fig. 5, the specific process of the PSO-XGBoost algorithm is as follows:

Define the parameters of PSO, including particle swarm size n, iteration times t, inertia weight ω, determine the hyperparameters that need to be optimized, and set the adjustment range of each parameter.
Randomly initialize particle swarm optimization, compare the global optimal positions of particles and populations by evaluating their fitness values, continuously update speed and position, and end the particle swarm algorithm process when the maximum number of iterations is reached to obtain the optimal parameters.
Divide the medical indicator data used in the experiment into a training set and a testing set, and train and test the optimized XGBoost model to achieve cardiovascular disease prediction.

The specific construction method and steps of the PSO-XGBoost model are as follows:

Firstly, define an evaluation function that takes the parameters of the XGBoost classifier as input and returns the performance evaluation metrics of the classifier on the validation set under the corresponding parameters. This study parses the parameters of the XGBoost classifier into a dictionary and creates a classifier using the dictionary. Then, fit the classifier on the training set and evaluate the accuracy and AUC of the classifier on the testing set to evaluate the optimization effect of PSO on the XGBoost classifier. The selected hyperparameters and their corresponding ranges of variation are shown in Table 8.

Table 8.

Hyperparameter selection.

Parameter	Change range	Default value
learning_rate	0.1–0.3	0.3
n_estimators	40–60	50
max_depth	1–8	6
colsample_bytree	0.3–1.0	1
subsample	0.5–1.0	1
min_child_weight	1.0–5.0	1
reg_alpha	0–10	0
reg_lambda	0–10	1

Open in a new tab

Secondly, specify the lower and upper limits of each parameter in the model, the initial solution, as well as the particle swarm size and maximum number of iterations. Finally, in the PSO model, the initial particle population is set to 20, the maximum number of iterations is set to 50, and both individual and global learning factors are set to 2.0.

The performance of the established prediction model and the particle swarm optimization algorithm optimized model is evaluated using accuracy, precision, recall, F1 value, and AUC indicators to obtain the optimal model parameter combination.

Improved PSO-XGBoost (DLPSO-XGBoost)

The PSO-XGBoost algorithm proposed earlier has been able to significantly improve the accuracy of model predictions, but there is still room for improvement. The improvement point lies in the fact that PSO is prone to premature convergence and lacks adaptive control, leading to the problem of parameters falling into local optima. Therefore, in this paper, based on the PSO-XGBoost algorithm mentioned above, the inertia weight of the particle swarm algorithm is dynamically adjusted and local search is performed around the global optimal position to optimize the PSO. The specific process for improvement is as follows:

Dynamically adjust inertia weight

Traditional PSO may converge prematurely to local optimal solutions in some cases, especially when the search space of the problem is complex. Dynamically adjusting inertia weights helps prevent algorithms from getting stuck in local optima in the early stages of search, thereby increasing exploration of global optima.

When setting dynamic inertia weights in this article, the maximum value Inline graphic _max and minimum value _min of inertia weights are first defined, which respectively represent the trade-off between global search and local search. Subsequently, initialize the inertia weight so that the initial inertia weight value is equal to the maximum inertia weight value _max, so that the inertia weight can be dynamically adjusted as the particles are iteratively optimized. After updating the historical optimal fitness value and position of particles, adjust the inertia weight.

The calculation formula for dynamic inertia weight is a linear interpolation, which performs linear interpolation between global search and local search based on the current iteration number. The formula is as follows:

In the Eq. (12), t represents the current round of algorithm execution, and n represents the total number of iterations.

This means that in the early iteration stage, the value of inertia weight is larger, which prompts particles to conduct more global searches; In the later iteration stage, the value of inertia weight is smaller, which helps particles to perform more local searches.

(2)
Perform local search around the global optimal position

Performing local search around the global optimal position can further refine the search in known good areas and improve the efficiency of local search. This strategy helps to better explore potential local optimal solutions in the search space.

Before performing local search around the global position, this article initializes the local search parameter s in the parameter definition stage. After obtaining the global optimal position, a random perturbation is added to the current position, with a size of a generated random number between (-s, s), in order to further explore possible solutions in the search space.

The algorithm parameter settings are shown in Table 9:

Table 9.

Algorithm parameter settings.

Parameter	Value
Population	20
Iterations	50
_min	0.4
_max	0.9
c1	2.0
c2	2.0
s	0.05

Open in a new tab

The above improvements will enhance the parameter optimization ability of PSO. In the face of cardiovascular disease prediction problems, the new method can better adapt to the large and complex feature space of cardiovascular disease datasets, thereby obtaining better XGBoost hyperparameter combinations and making predictions more accurate.

Combining the improved PSO with the XGBoost, optimizing the optimization method of PSO, improving the optimization accuracy, and obtaining better XGBoost hyperparameter combinations.

Author contributions

K.C., C.L., S.Y., Y.Z., L.L., H.J. and S.Z. designed and supervised the study. S.Y. and Y.Z. wrote the manuscript. K.C. and H.J. performed the experiments. S.Z., L.L. and C.L. analyzed the data. All authors reviewed the manuscript.

Funding

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation(IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government(MSIT)(IITP-2025-RS-2022-00156334,contribution rate:70%). This work was supported by the Basic Research Projects Fund of Liaoning Provincial Department of Education in 2023 (JYTMS20231518).

Data availability

The datasets generated and/or analysed during the current study are available in the Kaggle repository, https://www.kaggle.com/datasets/colewelkins/cardiovascular-disease.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Hoekyung Jung, Email: hkjung@pcu.ac.kr.

Shuo Zhang, Email: syzs1210@163.com.

References

1.Bloom, D. E. et al. The economic burden of chronic diseases: Estimates and projections for China, Japan, and South Korea. J. Econ. Ageing (2018).
2.Vaduganathan, M., Mensah, G. A., Turco, J. V., Fuster, V. & Roth, G. A. The global burden of cardiovascular diseases and risk: A compass for future health. J. Am. Coll. Cardiol.80(25), 2361–2371. 10.1016/j.jacc.2022.11.005 (2022). [DOI] [PubMed] [Google Scholar]
3.Roth, G. A., Mensah, G. A. & Fuster, V. The global burden of cardiovascular diseases and risks: A compass for global action. J. Am. Coll. Cardiol.76(25), 2980–2981 (2020). [DOI] [PubMed] [Google Scholar]
4.Bhardwaj, R., Nambiar, A. R. & Dutta, D. A study of machine learning in healthcare. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), Turin, Italy, 2017 236–241.
5.Beaulieu-Jones, B. K. et al. Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians?. npj Digit. Med.4, 62 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Lohachab, A. & Kumar, K. A comparative study of machine learning algorithms for predicting cardiovascular disease. In The Future of Artificial Intelligence and Robotics. ICDLAIR 2023. Lecture Notes in Networks and Systems Vol. 100 (eds Pastor-Escuredo, D. et al.) (Springer, Cham, 2024). [Google Scholar]
7.Ramalingam, V. V., Dandapath, A. & Raja, M. K. Heart disease prediction using machine learning techniques: A survey. Int. J. Eng. Technol.7(2.8), 684–687 (2018). [Google Scholar]
8.Mohan, S., Thirumalai, C. & Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access7, 81542–81554. 10.1109/ACCESS.2019.2923707 (2019). [Google Scholar]
9.Pan, Y., Fu, M., Cheng, B., Tao, X. & Guo, J. Enhanced deep learning assisted convolutional neural network for heart disease prediction on the internet of medical things platform. IEEE Access8, 189503–189512. 10.1109/ACCESS.2020.3026214 (2020). [Google Scholar]
10.Rani, P. et al. A decision support system for heart disease prediction based upon machine learning. J. Reliable Intell. Environ.7, 263–275 (2021). [Google Scholar]
11.Rahim, A. et al. An integrated machine learning framework for effective prediction of cardiovascular diseases. IEEE Access9, 106575–106588. 10.1109/ACCESS.2021.3098688 (2021). [Google Scholar]
12.Krishnan, S., Magalingam, P. & Ibrahim, R. Hybrid deep learning model using recurrent neural network and gated recurrent unit for heart disease prediction. Int. J. Electr. Comput. Eng.10.11591/IJECE.V11I6.PP5467-5476 (2021). [Google Scholar]
13.Abubaker, M. B. & Babayiğit, B. Detection of cardiovascular diseases in ECG images using machine learning and deep learning methods. IEEE Trans. Artif. Intell.4(2), 373–382. 10.1109/TAI.2022.3159505 (2023). [Google Scholar]
14.Kırboğa, K. K. & Küçüksille, E. U. Identifying cardiovascular disease risk factors in adults with explainable artificial intelligence. Anatol. J. Cardiol.27(11), 657–663. 10.14744/AnatolJCardiol.2023.3214 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Peng, M. et al. Prediction of cardiovascular disease risk based on major contributing features. Sci. Rep.13, 4778. 10.1038/s41598-023-31870-8 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zhou, Y. Prediction and analysis of cardiovascular diseases based on XGBoost. In 2024 IEEE 2nd International Conference on Image Processing and Computer Applications (ICIPCA), Shenyang, China, 2024 1877–1881. 10.1109/ICIPCA61593.2024.10709218.
17.Athanasiou, M., Sfrintzeri, K., Zarkogianni, K., Thanopoulou, A. C. & Nikita, K. S. An explainable XGBoost–based approach towards assessing the risk of cardiovascular disease in patients with Type 2 Diabetes mellitus. In 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE), Cincinnati, OH, USA, 2020 859–864. 10.1109/BIBE50027.2020.00146.
18.Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. ACM. 10.1145/2939672.2939785 (2016).
19.Zhang, Y. & Haghani, A. A gradient boosting method to improve travel time prediction. Transp. Res. Part C Emerg. Technol.10.1016/j.trc.2015.02.019 (2015). [Google Scholar]
20.Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion81, 84 (2022). [Google Scholar]
21.Poli, R. Particle swarm optimization an overview. Swarm Intelligence. 10.1007/s11721-007-0002-0.
22.Eberhart, R. & Kennedy, J. A new optimizer using particle swarm theory. In MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, 1995 39–43. 10.1109/MHS.1995.494215.
23.Zhang, Y., Wang, S. & Ji, G. A comprehensive survey on particle swarm optimization algorithm and its applications. Math. Probl. Eng.1, 931256. 10.1155/2015/931256 (2015). [Google Scholar]
24.Wang, Y., Wang, D. & Tang, Y. Clustered hybrid wind power prediction model based on ARMA, PSO-SVM, and clustering methods. IEEE Access8, 17071–17079. 10.1109/ACCESS.2020.2968390 (2020). [Google Scholar]
25.Li, G. et al. A particle swarm optimization improved BP neural network intelligent model for electrocardiogram classification. BMC Med. Inform. Decis. Mak.21(Suppl 2), 99. 10.1186/s12911-021-01453-6.PMID:34330266;PMCID:PMC8322832 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Jiang, H., He, Z., Ye, G. & Zhang, H. Network intrusion detection based on PSO-Xgboost model. IEEE Access8, 58392–58401. 10.1109/ACCESS.2020.2982418 (2020). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and/or analysed during the current study are available in the Kaggle repository, https://www.kaggle.com/datasets/colewelkins/cardiovascular-disease.

[CR1] 1.Bloom, D. E. et al. The economic burden of chronic diseases: Estimates and projections for China, Japan, and South Korea. J. Econ. Ageing (2018).

[CR2] 2.Vaduganathan, M., Mensah, G. A., Turco, J. V., Fuster, V. & Roth, G. A. The global burden of cardiovascular diseases and risk: A compass for future health. J. Am. Coll. Cardiol.80(25), 2361–2371. 10.1016/j.jacc.2022.11.005 (2022). [DOI] [PubMed] [Google Scholar]

[CR3] 3.Roth, G. A., Mensah, G. A. & Fuster, V. The global burden of cardiovascular diseases and risks: A compass for global action. J. Am. Coll. Cardiol.76(25), 2980–2981 (2020). [DOI] [PubMed] [Google Scholar]

[CR4] 4.Bhardwaj, R., Nambiar, A. R. & Dutta, D. A study of machine learning in healthcare. In 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC), Turin, Italy, 2017 236–241.

[CR5] 5.Beaulieu-Jones, B. K. et al. Machine learning for patient risk stratification: standing on, or looking over, the shoulders of clinicians?. npj Digit. Med.4, 62 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Lohachab, A. & Kumar, K. A comparative study of machine learning algorithms for predicting cardiovascular disease. In The Future of Artificial Intelligence and Robotics. ICDLAIR 2023. Lecture Notes in Networks and Systems Vol. 100 (eds Pastor-Escuredo, D. et al.) (Springer, Cham, 2024). [Google Scholar]

[CR7] 7.Ramalingam, V. V., Dandapath, A. & Raja, M. K. Heart disease prediction using machine learning techniques: A survey. Int. J. Eng. Technol.7(2.8), 684–687 (2018). [Google Scholar]

[CR8] 8.Mohan, S., Thirumalai, C. & Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access7, 81542–81554. 10.1109/ACCESS.2019.2923707 (2019). [Google Scholar]

[CR9] 9.Pan, Y., Fu, M., Cheng, B., Tao, X. & Guo, J. Enhanced deep learning assisted convolutional neural network for heart disease prediction on the internet of medical things platform. IEEE Access8, 189503–189512. 10.1109/ACCESS.2020.3026214 (2020). [Google Scholar]

[CR10] 10.Rani, P. et al. A decision support system for heart disease prediction based upon machine learning. J. Reliable Intell. Environ.7, 263–275 (2021). [Google Scholar]

[CR11] 11.Rahim, A. et al. An integrated machine learning framework for effective prediction of cardiovascular diseases. IEEE Access9, 106575–106588. 10.1109/ACCESS.2021.3098688 (2021). [Google Scholar]

[CR12] 12.Krishnan, S., Magalingam, P. & Ibrahim, R. Hybrid deep learning model using recurrent neural network and gated recurrent unit for heart disease prediction. Int. J. Electr. Comput. Eng.10.11591/IJECE.V11I6.PP5467-5476 (2021). [Google Scholar]

[CR13] 13.Abubaker, M. B. & Babayiğit, B. Detection of cardiovascular diseases in ECG images using machine learning and deep learning methods. IEEE Trans. Artif. Intell.4(2), 373–382. 10.1109/TAI.2022.3159505 (2023). [Google Scholar]

[CR14] 14.Kırboğa, K. K. & Küçüksille, E. U. Identifying cardiovascular disease risk factors in adults with explainable artificial intelligence. Anatol. J. Cardiol.27(11), 657–663. 10.14744/AnatolJCardiol.2023.3214 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Peng, M. et al. Prediction of cardiovascular disease risk based on major contributing features. Sci. Rep.13, 4778. 10.1038/s41598-023-31870-8 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Zhou, Y. Prediction and analysis of cardiovascular diseases based on XGBoost. In 2024 IEEE 2nd International Conference on Image Processing and Computer Applications (ICIPCA), Shenyang, China, 2024 1877–1881. 10.1109/ICIPCA61593.2024.10709218.

[CR17] 17.Athanasiou, M., Sfrintzeri, K., Zarkogianni, K., Thanopoulou, A. C. & Nikita, K. S. An explainable XGBoost–based approach towards assessing the risk of cardiovascular disease in patients with Type 2 Diabetes mellitus. In 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE), Cincinnati, OH, USA, 2020 859–864. 10.1109/BIBE50027.2020.00146.

[CR18] 18.Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. ACM. 10.1145/2939672.2939785 (2016).

[CR19] 19.Zhang, Y. & Haghani, A. A gradient boosting method to improve travel time prediction. Transp. Res. Part C Emerg. Technol.10.1016/j.trc.2015.02.019 (2015). [Google Scholar]

[CR20] 20.Shwartz-Ziv, R. & Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion81, 84 (2022). [Google Scholar]

[CR21] 21.Poli, R. Particle swarm optimization an overview. Swarm Intelligence. 10.1007/s11721-007-0002-0.

[CR22] 22.Eberhart, R. & Kennedy, J. A new optimizer using particle swarm theory. In MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, Nagoya, Japan, 1995 39–43. 10.1109/MHS.1995.494215.

[CR23] 23.Zhang, Y., Wang, S. & Ji, G. A comprehensive survey on particle swarm optimization algorithm and its applications. Math. Probl. Eng.1, 931256. 10.1155/2015/931256 (2015). [Google Scholar]

[CR24] 24.Wang, Y., Wang, D. & Tang, Y. Clustered hybrid wind power prediction model based on ARMA, PSO-SVM, and clustering methods. IEEE Access8, 17071–17079. 10.1109/ACCESS.2020.2968390 (2020). [Google Scholar]

[CR25] 25.Li, G. et al. A particle swarm optimization improved BP neural network intelligent model for electrocardiogram classification. BMC Med. Inform. Decis. Mak.21(Suppl 2), 99. 10.1186/s12911-021-01453-6.PMID:34330266;PMCID:PMC8322832 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Jiang, H., He, Z., Ye, G. & Zhang, H. Network intrusion detection based on PSO-Xgboost model. IEEE Access8, 58392–58401. 10.1109/ACCESS.2020.2982418 (2020). [Google Scholar]

PERMALINK

Prediction of cardiovascular disease based on multiple feature selection and improved PSO-XGBoost model

Kerang Cao

Chang Liu

Siqi Yang

Yuxin Zhang

Lili Li

Hoekyung Jung

Shuo Zhang

Abstract

Introduction

Results

Experimental environment

Table 1.

Data overview and data preprocessing

Table 2.

Missing value handling

Outlier handling

Fig. 1.

Tag encoding

Evaluating indicator

Table 3.

Experimental results and analysis

Benchmark experiment of XGBoost in cardiovascular disease prediction

Table 4.

Cardiovascular disease prediction experiment based on MFS-DLPSO-XGBoost model

Fig. 2.

Table 5.

Table 6.

Fig. 3.

Table 7.

Limitations

Future works

Conclusion

Methods

XGBoost algorithm

Particle swarm optimization

MFS-DLPSO-XGBoost prediction model

Fig. 4.

Multiple feature selection (MFS)

Application of PSO-XGBoost model

Fig. 5.

Table 8.

Improved PSO-XGBoost (DLPSO-XGBoost)

Table 9.

Author contributions

Funding

Data availability

Declarations

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases