Data augmentation alters feature importance in XGBoost for CVD prediction

Shuai Chang; Xiangyu Wang; Yu Luo; Lei Jia

doi:10.1038/s41598-025-26228-1

. 2025 Nov 25;15:41754. doi: 10.1038/s41598-025-26228-1

Data augmentation alters feature importance in XGBoost for CVD prediction

Shuai Chang ¹, Xiangyu Wang ¹, Yu Luo ², Lei Jia ^3,^✉

PMCID: PMC12647713 PMID: 41290908

Abstract

Machine learning models are powerful tools for cardiovascular disease (CVD) prediction, but their performance is often limited by dataset size and class imbalance. While data augmentation techniques can address these issues, their impact on model interpretability and the relative importance of clinical predictors remains poorly understood. This study investigates how different data augmentation strategies affect the performance and feature importance hierarchy of an Extreme Gradient Boosting (XGBoost) model for CVD prediction. This study conducted an ablation study using a public CVD dataset. Three XGBoost models were developed and compared: a baseline model trained on original data, a model trained with data augmented by the Synthetic Minority Over-sampling Technique (SMOTE), and a model using a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP). Model performance was evaluated using accuracy, F1-score, and AUC. Feature importance was quantified and compared across models using the Gain metric. All models demonstrated high predictive performance on the independent test set, with the SMOTE-augmented model achieving an accuracy and AUC of 1.0. Data augmentation fundamentally altered the model’s feature importance. In the baseline model, ‘oldpeak’ (Gain: 8.25) and ‘slope’ (Gain: 7.01) were the top predictors. In contrast, ‘slope’ became the single most dominant feature in both the SMOTE (Gain: 27.49) and WGAN-GP (Gain: 36.68) augmented models. Data augmentation can significantly reshape the predictive strategy of a high-performance machine learning model. For high-quality datasets, the primary effect of augmentation may be the re-prioritization of predictive features rather than a direct improvement in classification accuracy. These findings underscore the critical need to evaluate the impact of synthetic data on model interpretability before clinical application.

Keywords: Cardiovascular disease, Data augmentation, XGBoost, Generative adversarial network, Feature importance

Subject terms: Health care, Health services, Public health, Quality of life

Introduction

Accurate prediction of cardiovascular disease (CVD) is essential for timely clinical intervention and patient risk stratification^1–4. Machine learning algorithms provide a powerful framework for developing such predictive models from complex clinical data^5–10. However, the performance and generalizability of these models are fundamentally dependent on the quality and quantity of the training data. Datasets that are limited in size or suffer from class imbalance can severely constrain a model’s ability to learn robust decision boundaries^11,12. Therefore, the development of advanced data augmentation techniques to synthetically expand and balance clinical datasets is of critical importance^13–15.

To address class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) is a widely adopted method in clinical predictive modeling^16–18. This algorithm operates by generating synthetic instances through linear interpolation between existing minority class samples and their neighbors¹⁹. While effective in many applications, more advanced generative models have emerged. Generative Adversarial Networks (GANs), for instance, offer a more sophisticated approach²⁰. Instead of simple interpolation, GANs employ a competitive training process between a generator and a discriminator to learn the underlying probability distribution of the data²¹. This allows them to create highly realistic and diverse synthetic samples. GANs have shown considerable success in medical data synthesis, particularly for generating complex, high-dimensional tabular data such as electronic health records^21–23.

Despite their utility, both approaches have inherent limitations. The linear mechanism of SMOTE can introduce noise and generate suboptimal samples that do not respect the natural data manifold, potentially blurring class boundaries²⁴. While GANs offer a more sophisticated solution, their application to heterogeneous clinical data containing mixed data types remains a non-trivial challenge^22,25. More critically, the majority of existing research has focused primarily on whether data augmentation improves aggregate performance metrics like accuracy or AUC^20,26. However, the impact of data augmentation on the internal decision-making processes of models remains largely unexplored. Recent studies have begun to investigate the intersection of data augmentation and model interpretability, questioning how synthetic data influences what a model learns^27–29. A crucial unanswered question is whether these techniques alter the relative importance of clinical predictors, thereby changing the model’s underlying predictive strategy²⁸. This knowledge gap is significant, as understanding a model’s feature reliance is essential for its clinical translation and trustworthiness.

This study aimed to systematically evaluate the impact of different data augmentation strategies on both the performance and the feature importance of an Extreme Gradient Boosting (XGBoost) model for CVD prediction. An ablation study was designed to compare a baseline model against models trained with data augmented by SMOTE and a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP). This framework allows for a direct comparison of how traditional oversampling and advanced generative modeling influence the final predictive system. The investigation was guided by the hypothesis that these augmentation techniques would significantly alter the model’s feature importance hierarchy compared to the baseline. Such an analysis provides critical insights into how synthetic data reshapes a model’s predictive strategy, informing the development of more robust and interpretable clinical decision support tools.

Material and method

Database

Dataset description

This study utilizes the “Cardiovascular Disease Dataset” from Mendeley Data³⁰, which is also publicly available on the Kaggle platform³¹. The dataset is well-documented and has been widely used by the data science community, ensuring a high standard of data quality and usability³¹. It is available under the CC BY 4.0 license³², which allows users to share, adapt, and build upon the material for any purpose, including commercial use, as long as the appropriate credit is given, and the license terms are followed. It comprises 13 indicators across 1000 subjects, including continuous and categorical features relevant to cardiovascular health. These 13 pre-defined attributes were used directly as the input features for the predictive models. No additional statistical feature extraction or engineering was performed in this study.

Data preprocessing and partitioning

The raw dataset (n = 1000) was preprocessed to address invalid entries and missing values. The “Slope of the peak exercise ST segment” variable, defined with three valid categories in Table 1, contained 180 entries with a non-conforming “0” value. These entries were removed. Subsequently, an additional 54 entries with missing values for the “Serum cholesterol” variable were also excluded. After these cleaning steps, a final dataset of 766 valid entries was obtained for subsequent analysis. The detailed summary of this processed dataset is presented in Fig. 1.

Table 1.

The “Cardiovascular disease dataset” dataset description document³⁰.

Attribute	Assigned code	Unit	Type of the data
Age	Age	In Years	Numeric
Gender	Gender	1, 0 (0 = female, 1 = male)	Binary
Chest pain type	Chestpain	0, 1, 2, 3 (Value0: typical angina Value1: atypical angina Value2: non-anginal pain Value3: asymptomatic)	Nominal
Resting blood pressure	RestingBP	94–200 (in mm HG)	Numeric
Serum cholesterol	Serumcholestrol	126–564 (in mg/dl)	Numeric
Fasting blood sugar	Fastingbloodsugar	0, 1 > 120 mg/dl	Binary
Fasting blood sugar	Fastingbloodsugar	(0 = false, 1 = true)	Binary
Resting electrocardiogram results	Restingrelectro	0, 1, 2 (Value0: normal, Value1: having ST-T Wave abnormality (T Wave inversions and/or ST elevation or depression of > 0.05 mV), Value2: showing probable or definite left ventricular Hypertrophy by Estes’ criteria)	Nominal
Maximum heart rate achieved	Maxheartrate	71–202	Numeric
Exercise induced angina	Exerciseangia	0, 1 (0 = no, 1 = yes)	Binary
Oldpeak = ST	Oldpeak	0–6.2.2	Numeric
Slope of the peak exercise ST segment	Slope	1, 2, 3 (1 - upsloping, 2 - flat, 3 - downsloping)	Nominal
Number of major ves sels	Noofmajorvessels	0, 1, 2, 3	Numeric
Classification	Target	0, 1 (0 = Absence of Heart Disease, 1 = Presence of Heart Disease)	Binary

Open in a new tab

To prevent information leakage, the entire dataset was partitioned before any model development. The 767 samples were split into a development set and a final hold-out test set. The partition ratio was 79.92% for the development set (n = 613) and 20.08% for the test set (n = 154). Stratified random sampling was applied based on the binary target variable to preserve the original class distribution in both sets. A fixed random seed (random_state = 42) was used to ensure the reproducibility of this split. The resulting positive case distribution was 68.68% (421 of 613) in the development set and 68.83% (106 of 154) in the test set. All subsequent procedures, including data augmentation and model training, were performed exclusively on the development set. The hold-out test set was used only once for the final, unbiased evaluation of the model. This distribution, with a majority-to-minority class ratio of approximately 2.2:1, represents a moderate class imbalance that can bias machine learning models. Effectively addressing this imbalance is a central objective of this study.

Predictive modeling framework

This study employed a unified framework for model development and evaluation across all experimental conditions to ensure a fair comparison (Fig. 2). The core classification model was the XGBoost classifier, implemented using the XGBClassifier library.

Fig. 2 — Detailed Predictive Model Development and Validation Pipeline.

A two-stage validation strategy was designed to find optimal hyperparameters and provide an unbiased estimate of model performance. First, a hyperparameter optimization (HPO) process, using RandomizedSearchCV with 5-fold cross-validation, was conducted on the development set (n = 613) to find the optimal parameter combination. The search space for key XGBoost hyperparameters is detailed in the supplementary materials. The Area Under the Receiver Operating Characteristic Curve (AUC) was used as the scoring metric to identify the best parameter combination.

Second, the performance of the model with these optimized hyperparameters was evaluated using a 5-fold cross-validation protocol on the development set. To prevent data leakage, a dynamic data augmentation strategy was integrated within this process. For each of the 5 folds, the augmentation algorithm was trained exclusively on the training partition of that specific fold. The resulting synthetic data was then combined with the original training partition to form the final training set for the XGBoost model. The model was subsequently evaluated on the untouched validation partition of that fold. This dynamic, nested approach ensures that the model is always validated on data it has never seen in any form, either original or synthetic.

The final model for each experimental condition was trained on the entire development set, augmented with synthetic data generated from the development set itself. This final, trained model was then evaluated a single time on the independent hold-out test set (n = 154) to report the final, unbiased performance metrics. Model performance was quantified using accuracy, F1-score, and AUC.

Ablation study: data augmentation strategies

To systematically evaluate the contribution of the data augmentation technique, two distinct methods were implemented and compared against a baseline model trained on the original data. Both augmentation methods operated on a pre-processed version of the data, where continuous features were standardized using StandardScaler and categorical features were transformed via OneHotEncoder. The specific sample sizes and augmentation ratios applied at each stage of the modeling process for the WGAN-GP framework are summarized in Table 2.

Table 2.

Summary of the WGAN-GP data augmentation strategy and sample sizes at different modeling Stages.

Modeling stage

Input data
(Original Samples)

Generated synthetic samples

Augmentation ratio (synthetic : original)

Total training samples

HPO

613

(Full Development Set)

613

1:1

1226

5-Fold CV (per fold)

~ 490

(4/5 of Development Set)

~ 490

1:1

~ 980

Final Model Training

613

(Full Development Set)

1226

2:1

1839

Open in a new tab

WGAN-GP-based augmentation

The primary data augmentation approach utilized a WGAN-GP. This generative model consists of two neural networks: a Generator (G) and a Critic (D), trained adversarially. The Generator network was a multi-layer perceptron (MLP) with three hidden layers (128, 256, 512 neurons) using ReLU activation functions. It takes a 100-dimensional latent vector as input and outputs a synthetic data sample. The Critic network was also an MLP with three hidden layers (512, 256, 128 neurons) using LeakyReLU activation functions, which outputs a scalar value representing the “realness” of the input data.

The generator and critic networks were trained adversarially to optimize the Wasserstein-1 distance. The Critic’s objective is to maximize the difference between its output for real samples ( Inline graphic ) and generated samples (). A gradient penalty term was added to enforce the Lipschitz constraint, ensuring stable training. The Critic’s loss function is defined as:

The Generator’s objective is to produce samples that the Critic evaluates as increasingly real. Its loss function is:

During training, the Critic was updated 5 times for every single update of the Generator, using the RMSprop optimizer with a learning rate of 0.00005. The gradient penalty coefficient (λ) was set to 10. After training, the Generator was used to produce synthetic samples. These generated samples, which exist in the standardized and one-hot encoded space, were then subjected to a post-processing pipeline. This pipeline applied inverse transformations to revert the data to its original scale and format. Continuous values were clipped to fall within plausible physiological ranges, and categorical features were mapped to the nearest valid discrete category to ensure data integrity.

SMOTE-based augmentation

For the ablation study, the SMOTE was implemented as a comparative method. SMOTE is a well-established algorithm that addresses class imbalance by creating synthetic instances of the minority class. It operates by selecting a minority class instance and identifying its k-nearest minority class neighbors. A new synthetic instance is then created by interpolating between the selected instance and one of its neighbors.

In this study’s framework, SMOTE was applied within the same dynamic cross-validation structure as WGAN-GP. For each training fold, SMOTE was used to oversample the minority class until it was perfectly balanced with the majority class. The algorithm was applied to the same standardized and one-hot encoded data space used for the WGAN-GP. The synthetic samples generated by SMOTE were then subjected to the exact same post-processing pipeline (inverse scaling, clipping, and inverse encoding) to ensure a direct and fair comparison of the two augmentation techniques’ effects on the final XGBoost model performance.

Baseline model

To establish a performance benchmark, a baseline model was developed. This model strictly followed the predictive modeling framework described in Sect. "1.1 Predictive modeling framework". The crucial distinction for the baseline model is the complete omission of any data augmentation. The HPO, the 5-fold cross-validation, and the final model training were all conducted exclusively on the original, unaltered development set data. This approach isolates the impact of data augmentation by ensuring it is the sole variable differentiating this model from the WGAN-GP and SMOTE-enhanced models. To mitigate the impact of class imbalance at the algorithmic level, the XGBoost scale_pos_weight hyperparameter was also optimized for the Baseline model. This parameter adjusts the weight of the minority class during the training process, providing a baseline strategy for handling imbalanced data.

Feature importance analysis

To interpret the model’s decision-making process, a feature importance analysis was conducted. This analysis was performed on the final trained model from each of the three experimental conditions: the Baseline, SMOTE-augmented, and WGAN-GP-augmented models.

The importance of each feature was quantified using the “Gain” metric native to the XGBoost algorithm. Gain represents the average improvement in model performance contributed by a feature across all splits where that feature was used. A higher Gain value indicates a greater contribution to the model’s predictive accuracy. This comparative analysis across the three models was designed to determine whether data augmentation techniques altered the relative importance of clinical predictors, thereby offering insights into changes in the model’s underlying predictive strategy.

For full reproducibility, the Python scripts and data used for all modeling and analysis described in this study are hosted in a public GitHub repository at the following URL: https://github.com/Michael1006-dev/CVD_Prediction_with_Generative_Models.

Result

HPO and model performance on cross-validation

The HPO process yielded distinct optimal parameter sets for each of the three models, as detailed in Table 3. Notably, the optimal scale_pos_weight parameter, which addresses class imbalance, was highest for the Baseline model (20), lower for the SMOTE model (10), and lowest for the WGAN-GP model (1), reflecting the different class distributions of their respective training data.

Table 3.

Optimal hyperparameter Sets.

Hyperparameter	Baseline model	SMOTE model	WGAN-GP model
subsample	0.7	0.8	0.9
scale_pos_weight	20	10	1
reg_lambda	1.5	1.5	0.5
reg_alpha	0.01	0	0.01
n_estimators	300	200	200
max_depth	7	5	7
learning_rate	0.05	0.1	0.1
gamma	0.2	0	0.1
colsample_bytree	0.9	0.8	0.9

Open in a new tab

Note: The table lists the final hyperparameter values identified through the RandomizedSearchCV process for the Baseline, SMOTE, and WGAN-GP models.

The subsequent 5-fold cross-validation on the development set provided a robust estimate of each model’s generalization capability. The average performance metrics from this validation are visualized in Fig. 3. The SMOTE-augmented model achieved the highest average validation AUC (0.993), F1-score (0.977), and accuracy (0.969). The Baseline model yielded a validation AUC of 0.991, while the WGAN-GP model produced a validation AUC of 0.990. All models demonstrated high performance on the training data, with the WGAN-GP model showing the largest performance gap between the training set (0.99998 AUC) and the validation set (0.990 AUC).

Fig. 3 — Comparative Analysis of Model Performance in Cross-Validation. Radar charts illustrating the average performance metrics (Accuracy, F1 Score, AUC) for the Baseline, SMOTE-augmented, and WGAN-GP-augmented models. Each chart compares the performance on the training folds (CV Train Avg.) with the validation folds (CV Validation Avg.).

Performance on the independent hold-out test set

The final, unbiased evaluation was conducted on the independent hold-out test set. A comprehensive summary comparing the cross-validation and final test set performance for all three models is presented in Table 4.

Table 4.

Comprehensive model performance Comparison.

Model type	CV validation accuracy	CV validation F1 score	CV validation AUC	Final test accuracy	Final test precision	Final test recall	Final test F1 score	Final test AUC
Baseline	0.953	0.966	0.991	0.981	0.973	1	0.986	1
SMOTE	0.969	0.977	0.993	1	1	1	1	1
WGAN-GP	0.959	0.971	0.990	0.987	0.981	1	0.991	0.9996

Open in a new tab

Note: table summarizes the key performance metrics for all three models, comparing the average results from the 5-fold cross-validation with the final results on the independent hold-out test set.

On the independent test set, both the Baseline and SMOTE-augmented models achieved a perfect AUC of 1.000. The SMOTE model further attained perfect scores across all other metrics, including accuracy, precision, recall, and F1-score. The Baseline model achieved an accuracy of 0.981 and an F1-score of 0.986. The WGAN-GP model also demonstrated high performance with an AUC of 0.9996, an accuracy of 0.987, and an F1-score of 0.991. All three models achieved a perfect recall of 1.000, correctly identifying all positive cases in the test set. The specific classification outcomes for each model are detailed in the confusion matrices shown in Fig. 4.

Fig. 4 — Confusion Matrices on the Final Hold-Out Test Set. Heatmaps showing the classification results for the Baseline, SMOTE-augmented, and WGAN-GP-augmented models on the independent test set.

Comparative feature importance analysis

The feature importance, quantified by the Gain metric, was extracted from each of the three final models to understand the impact of data augmentation on predictive strategy. The results of this comparative analysis are illustrated in Fig. 5.

Fig. 5 — Comparative Feature Importance. Bar charts displaying the feature importance scores (Gain) for the top predictors in the Baseline, SMOTE-augmented, and WGAN-GP-augmented models.

Across all models, slope, oldpeak, and chestpain were consistently identified as highly influential predictors. However, the data augmentation techniques significantly altered the relative importance of these features. In the Baseline model, oldpeak (Gain: 8.25) and slope (Gain: 7.01) were the two most important features. In contrast, for both the SMOTE and WGAN-GP models, slope became the overwhelmingly dominant feature, with its Gain value increasing to 27.49 and 36.68, respectively. Concurrently, the importance of oldpeak was substantially reduced in the augmented models. The feature chestpain also saw its importance increase in the SMOTE-augmented model. The importance of other features remained relatively low and stable across the three modeling conditions.

Discussion

This study demonstrates that data augmentation techniques can significantly influence high-performance predictive models for CVD. The results confirm that the selected clinical dataset contains highly separable classes, enabling all models, including the baseline, to achieve exceptional classification accuracy. Both algorithmic oversampling with SMOTE and generative modeling with WGAN-GP proved to be effective augmentation strategies. Critically, this investigation reveals that data augmentation can fundamentally alter a model’s predictive mechanism. The introduction of synthetic data prompted a notable shift in feature importance, redirecting the model’s focus from a balanced reliance on multiple predictors to a dominant dependence on the ‘slope’ feature. This suggests that the primary impact of augmentation in this high-quality dataset was not merely to improve raw performance but to reshape the model’s learned feature hierarchy.

A key finding of this study is the exceptionally high performance achieved by all models on the independent test set. The baseline model, without any data augmentation, demonstrated near-perfect classification, a result that underscores the high quality of the selected dataset. The quality of datasets is widely recognized as a vital factor for the performance of machine learning models^33,34. In contexts such as water quality forecasting and oncology, data quality has been shown to be a primary determinant of model reliability and generalizability^33,35. When a dataset possesses highly separable classes and minimal noise, as is the case in this study, machine learning algorithms can establish effective decision boundaries with relative ease. This condition, however, presents a challenge for evaluating data augmentation techniques, as the potential for performance improvement is inherently limited^36,37. Consequently, the marginal gains in accuracy observed after applying SMOTE and WGAN-GP do not diminish their potential value. Instead, this finding highlights that for high-quality datasets, the primary contribution of augmentation may shift from enhancing raw predictive accuracy to reshaping the model’s internal predictive strategy, as evidenced by the observed changes in feature importance.

A detailed comparison of the data augmentation strategies reveals nuanced differences in their impact. During cross-validation, the SMOTE-augmented model demonstrated slightly superior average performance compared to the WGAN-GP model. SMOTE, an algorithm based on linear interpolation, is effective at generating synthetic samples along the decision boundary between classes³⁸. Its success in this context suggests that the feature space of the dataset may be relatively linear. However, SMOTE and its variants are also known to be sensitive to noise and can sometimes generate suboptimal or overly generalized samples^39,40. In contrast, WGAN-GP is a more sophisticated generative model designed to learn and replicate the underlying probability distribution of the training data^41,42. The HPO results provide critical insight into its mechanism; the WGAN-GP model required a scale_pos_weight of 1, indicating that the synthetically generated data had already effectively balanced the class distribution. This suggests WGAN-GP produced high-fidelity samples that closely mirrored the true data distribution^42,43. While its full potential may not be realized on a dataset with such clear class separation, its ability to create a balanced and representative feature space without manual class weighting points to its significant potential for more complex and severely imbalanced real-world medical datasets^44,45.

The data augmentation process induced a significant shift in the model’s feature importance hierarchy. In the baseline model, oldpeak and slope were the two most influential predictors, with comparable importance scores. However, in both the SMOTE and WGAN-GP augmented models, slope became the overwhelmingly dominant feature. This phenomenon suggests that the introduction of synthetic data can amplify the signals of certain highly predictive features^46,47. Generative models, by learning and replicating the underlying data distribution, may inadvertently reinforce the statistical relationship between the most salient features and the target variable, leading the classifier to develop a strong reliance on them⁴⁸.

This alteration in feature importance requires careful clinical interpretation. The XGBoost Gain metric, while useful, can sometimes be biased and may not fully represent the true causal relationships within the data^49,50. The diminished importance of features like oldpeak or the consistently low importance of exercise induced angina in the model does not negate their established clinical significance. The ST-segment depression (oldpeak) during exercise is a well-documented marker for myocardial ischemia⁵¹. Similarly, exercise-induced angina is a cardinal symptom of coronary artery disease⁵². The J-wave pattern (slope), while a powerful predictor, is also known to be a dynamic phenomenon influenced by physical training and heart rate, and its interpretation is often contextual⁵³.

Therefore, the model’s heavy reliance on slope after augmentation should be interpreted as a shift in its predictive strategy, not as a redefinition of clinical risk factors. The low importance scores for other variables might indicate complex, non-linear interactions that are not fully captured by the feature importance metric^50,54. A model’s internal logic, especially in a “black-box” algorithm like XGBoost, does not always directly map to established clinical pathophysiology⁴⁹. The findings underscore the importance of using model interpretability tools as a guide for understanding model behavior, rather than as a definitive measure of clinical relevance.

The primary limitation of this study is its reliance on a single, high-quality public dataset. While this allowed for a controlled evaluation of the augmentation techniques, it restricts the generalizability of the findings. Real-world clinical data, particularly from multi-center studies, often exhibit significant heterogeneity, noise, and data-sharing constraints, which pose substantial challenges to model performance and validation^55–57. For instance, related fields of cardiac research have successfully employed advanced deep learning and feature processing techniques on signal-based data, such as electrocardiogram and photoplethysmography (PPG)^58,59, highlighting potential avenues for future work beyond tabular datasets. The conclusions drawn from this clean dataset may not be directly transferable to more complex and less standardized clinical environments. Future research should therefore focus on validating the WGAN-GP framework on larger, more diverse clinical datasets. Applying these methods to multi-center data is essential to assess their robustness and true potential for improving risk prediction in heterogeneous patient populations^19,60. Furthermore, exploring more advanced generative models could yield significant benefits. For instance, conditional GANs could be utilized to generate synthetic data tailored to specific patient subgroups or underrepresented classes, offering more precise control over the data augmentation process^61,62. Such investigations will be critical for advancing the development of reliable and generalizable AI tools for clinical decision support.

Conclusion

This study systematically compared data augmentation techniques for CVD prediction. An XGBoost model was trained on an original high-quality dataset, a dataset augmented by SMOTE, and another augmented by a WGAN-GP. The study evaluated not only the classification performance of each model but also the resulting changes in the clinical feature importance hierarchy.

All models demonstrated exceptional predictive accuracy on the hold-out test set. The primary finding was that both SMOTE and WGAN-GP fundamentally altered the model’s predictive strategy by significantly increasing its reliance on the ‘slope’ feature. This suggests that for high-quality datasets, the main impact of augmentation is the reshaping of a model’s feature hierarchy rather than a simple improvement in performance. Future research should validate these augmentation frameworks on larger, more heterogeneous clinical datasets to assess their real-world generalizability.

Author contributions

SC and XW conceived the study and drafted the manuscript. YL was responsible for the formal analysis. LJ supervised the project and was responsible for the critical revision of the manuscript. All authors have read and approved the final version of the manuscript.

Funding

This research was funded by the Jiangxi Provincial Colleges and Universities Humanities and Social Sciences Research Project (Grant No.: TY24207) and the Jiangxi Provincial Graduate Education and Teaching Reform Research Project (Grant No.: JXYJG-2024-121).

Data and code availability

The “Cardiovascular Disease Dataset” utilized in this study is publicly available from Mendeley Data (https://data.mendeley.com/datasets/dzz48mvjht/1). The Python code developed for all modeling and analysis presented in this paper is available in a public GitHub repository (https://github.com/Michael1006-dev/CVD_Prediction_with_Generative_Models) to ensure full reproducibility.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Ostrominski, J. W. & Powell-Wiley, T. M. Risk stratification and treatment of obesity for primary and secondary prevention of cardiovascular disease. Curr. Atheroscler. Rep.26, 11–23. 10.1007/s11883-023-01182-3 (2024). [DOI] [PubMed] [Google Scholar]
2.Thomas, P. E., Vedel-Krogh, S. & Nordestgaard, B. G. Measuring lipoprotein(a) for cardiovascular disease prevention - in whom and when? Curr. Opin. Cardiol.39, 39–48. 10.1097/hco.0000000000001104 (2024). [DOI] [PubMed] [Google Scholar]
3.Rodriguez, J. B. C., Mohammad, K. O. & Alkhateeb, H. Contemporary review of risk scores in prediction of coronary and cardiovascular deaths. Curr. Cardiol. Rep.24, 7–15. 10.1007/s11886-021-01620-1 (2022). [DOI] [PubMed] [Google Scholar]
4.Gourdy, P. et al. Atherosclerotic cardiovascular disease risk stratification and management in type 2 diabetes: review of recent evidence-based guidelines. Front. Cardiovasc. Med.1010.3389/fcvm.2023.1227769 (2023). [DOI] [PMC free article] [PubMed]
5.Kasartzian, D. I. & Tsiampalis, T. Transforming cardiovascular risk prediction: A review of machine learning and artificial intelligence innovations. Life-Basel15, 20. 10.3390/life15010094 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Chowdhury, M. A. et al. The heart of transformation: exploring artificial intelligence in cardiovascular disease. Biomedicines13, 28. 10.3390/biomedicines13020427 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Kagiyama, N., Shrestha, S., Farjo, P. D. & Sengupta, P. P. Artificial intelligence: practical primer for clinical research in cardiovascular disease. J. Am. Heart Assoc.8, 12. 10.1161/jaha.119.012788 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Kumar, A. S. & Rekha, R. A. Dense network approach with Gaussian optimizer for cardiovascular disease prediction. New. Generation Comput.41, 859–878. 10.1007/s00354-023-00234-1 (2023). [Google Scholar]
9.Kumar, A. S. & Rekha, R. An improved Hawks optimizer based learning algorithms for cardiovascular disease prediction. Biomed. Signal. Process. Control. 81, 104442. 10.1016/j.bspc.2022.104442 (2023). [Google Scholar]
10.Arunachalam, S. K. & Rekha, R. A novel approach for cardiovascular disease prediction using machine learning algorithms. Concurrency and Computation: Practice and Experience34, e7027 (2022). 10.1002/cpe.7027
11.Aubaidan, B. H. et al. A review of intelligent data analysis: machine learning approaches for addressing class imbalance in healthcare - challenges and perspectives. Intell. Data Anal.29, 699–719. 10.1177/1088467x241305509 (2025). [Google Scholar]
12.Tasci, E., Zhuge, Y., Camphausen, K. & Krauze, A. V. Bias and class imbalance in oncologic data-Towards inclusive and transferrable AI in large scale oncology data sets. Cancers1410.3390/cancers14122897 (2022). [DOI] [PMC free article] [PubMed]
13.Restrepo, J. P., Rivera, J. C., Laniado, H., Osorio, P. & Becerra, O. A. Nonparametric generation of synthetic data using copulas. Electronics12, 17. 10.3390/electronics12071601 (2023). [Google Scholar]
14.García-Vicente, C. et al. Evaluation of synthetic categorical data generation techniques for predicting cardiovascular diseases and Post-Hoc interpretability of the risk factors. Appl. Sci. -Basel. 13, 23. 10.3390/app13074119 (2023). [Google Scholar]
15.Innab, N. et al. AI-Driven predictive modeling for lung cancer detection and management using synthetic data augmentation and random forest classifier. Int. J. Comput. Intell. Syst.18, 20. 10.1007/s44196-025-00879-4 (2025). [Google Scholar]
16.Martínez-Velasco, A., Martínez-Villaseñor, L. & Miralles-Pechuán, L. Addressing class imbalance in healthcare data: machine learning solutions for Age-Related macular degeneration and preeclampsia. IEEE Latin Am. Trans.22, 806–820. 10.1109/tla.2024.10705995 (2024). [Google Scholar]
17.Alturki, N. et al. Improving prediction of chronic kidney disease using KNN imputed SMOTE features and TrioNet model. CMES-Comp Model. Eng. Sci.139, 3513–3534. 10.32604/cmes.2023.045868 (2024). [Google Scholar]
18.Guo, Y., Kou, Y., Yi, L. Z. & Fu, G. H. HiBBKA: A hybrid method with resampling and heuristic feature selection for Class-Imbalanced data in chemometrics. J. Chemometr. 39, 21. 10.1002/cem.70029 (2025). [Google Scholar]
19.Alharbi, F., Ouarbya, L. & Ward, J. A. Comparing sampling strategies for tackling imbalanced data in human activity recognition. Sensors22, 20. 10.3390/s22041373 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Zhang, Z. Y., Li, Y. X., Liu, T. M. & Liu, C. A. Data augmentation-empowered diabetic retinopathy detection based on collaborative discrimination-enabled generative adversarial network. Ann. Oper. Res.2310.1007/s10479-024-06147-6 (2024).
21.Baowaly, M. K., Lin, C. C., Liu, C. L. & Chen, K. T. Synthesizing electronic health records using improved generative adversarial networks. J. Am. Med. Inform. Assoc.26, 228–241. 10.1093/jamia/ocy142 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Weng, X. T. et al. A joint learning method for incomplete and imbalanced data in electronic health record based on generative adversarial networks. Comput. Biol. Med.168, 12. 10.1016/j.compbiomed.2023.107687 (2024). [DOI] [PubMed] [Google Scholar]
23.Li, R. Z. et al. Improving an electronic health Record-Based clinical prediction model under label deficiency: Network-Based generative adversarial semisupervised approach. JMIR Med. Inf.11, 14. 10.2196/47862 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Bounab, R., Zarour, K., Guelib, B. & Khlifa, N. Enhancing medicare fraud detection through machine learning: addressing class imbalance with SMOTE-ENN. Ieee Access.12, 54382–54396. 10.1109/access.2024.3385781 (2024). [Google Scholar]
25.Yan, C. et al. A multifaceted benchmarking of synthetic electronic health record generation models. Nat. Commun.13, 18. 10.1038/s41467-022-35295-1 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Ince, V., Bader-El-Den, M., Esmeli, R., Maurya, L. & Sari, O. F. Predicting spoilage intensity level in sausage products using explainable machine learning and GAN-Based data augmentation. Food Bioprocess Technol.10.1007/s11947-025-03971-x (2025). [Google Scholar]
27.Hasan, T. & Tasnim, S. Real-time explainable IoT security with machine learning and CTGAN-enhanced detection for resource-constrained devices. Ad Hoc Netw.178, 21. 10.1016/j.adhoc.2025.103937 (2025). [Google Scholar]
28.Won, S., Bae, S. H. & Kim, S. T. Effects of mixed sample data augmentation on interpretability of neural networks. Neural Netw.190, 12. 10.1016/j.neunet.2025.107611 (2025). [DOI] [PubMed] [Google Scholar]
29.Wang, L. Y. et al. Integrative modeling of heterogeneous soil salinity using sparse ground samples and remote sensing images. Geoderma430, 16. 10.1016/j.geoderma.2022.116321 (2023). [Google Scholar]
30.Doppala, B. P. & Bhattacharyya, D. Cardiovascular_Disease_Dataset. Mendeley Data (2021). https://data.mendeley.com/datasets/dzz48mvjht/1
31.Doppala, B. P. & Bhattacharyya, D. Cardiovascular_Disease_Dataset. Kaggle (2021). https://www.kaggle.com/datasets/jocelyndumlao/cardiovascular-disease-dataset/data
32.Licenses, U. C. C. P. Attribution 4.0 International.
33.Adeoye, J., Hui, L. L. & Su, Y. X. Data-centric artificial intelligence in oncology: a systematic review assessing data quality in machine learning models for head and neck cancer. J. Big Data. 10, 25. 10.1186/s40537-023-00703-w (2023). [Google Scholar]
34.Guo, W. J. et al. Review of machine learning and deep learning models for toxicity prediction. Exp. Biol. Med.248, 1952–1973. 10.1177/15353702231209421 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Lokman, A., Ismail, W. Z. W. & Aziz, N. A. A. A review of water quality forecasting and classification using machine learning models and statistical analysis. Water17, 31. 10.3390/w17152243 (2025). [Google Scholar]
36.Du, Q. X., Wang, H. C., Jiang, B. B. & Wang, X. W. Advancing genetic engineering with active learning: theory, implementations and potential opportunities. Brief. Bioinform. 26, 24. 10.1093/bib/bbaf286 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Song, X. Y., Xiong, J. H., Wang, M., Mei, Q. S. & Lin, X. D. Combined data augmentation on EANN to identify indoor anomalous sound event. Appl. Sci. -Basel. 14, 16. 10.3390/app14041327 (2024). [Google Scholar]
38.Alataiqeh, M., Shi, H., Qu, Q. Q., Mei, X. S. & Wang, H. T. Thermal error modeling of Slant bed CNC lathe spindle based on BiLSTM with data augmentation and grey Wolf optimizer algorithm. Case Stud. Therm. Eng.70, 18. 10.1016/j.csite.2025.106090 (2025). [Google Scholar]
39.Chen, J., Xia, M. & Wang, Z. J. Radial-based oversampling based on differential evolution for imbalanced data. Appl. Intell.55, 20. 10.1007/s10489-025-06460-y (2025). [Google Scholar]
40.Sun, P., Wang, Z., Jia, L., Xu, Z. & SMOTE-kTLNN: A hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier. Expert Syst. Appl.23810.1016/j.eswa.2023.121848 (2024).
41.Jiménez-Gaona, Y., Carrión-Figueroa, D., Lakshminarayanan, V. & Rodríguez-Alvarez, M. J. Gan-based data augmentation to improve breast ultrasound and mammography mass classification. Biomed. Signal. Process. Control. 94, 18. 10.1016/j.bspc.2024.106255 (2024). [Google Scholar]
42.Zhang, Y. et al. GAN-based one dimensional medical data augmentation. Soft Comput.27, 10481–10491. 10.1007/s00500-023-08345-z (2023). [Google Scholar]
43.Zhao, X. Y., Mansor, Z., Razali, R., Nazri, M. Z. A. & Xiong, X. of Five Generation Techniques. Ieee Access.13, 63219–63236. 10.1109/access.2025.3558715 (2025). Advancing Agile Software Cost Estimation Through Data Synthesis: A Comparative Analysis.
44.Bhattarai, S., Chong, K. T. & Tayara, H. GAN-ML: advancing anticancer peptide prediction through innovative deep Convolution generative adversarial network data augmentation technique. Chemometrics Intell. Lab. Syst.262, 14. 10.1016/j.chemolab.2025.105390 (2025). [Google Scholar]
45.Isasa, I. et al. Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis. BMC Med. Inf. Decis. Mak.24, 14. 10.1186/s12911-024-02427-0 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Kazemi, P., Entezami, A. & Ghisi, A. Machine learning techniques for diagrid Building design: Architectural-Structural correlations with feature selection and data augmentation. J. Build. Eng.86, 27. 10.1016/j.jobe.2024.108766 (2024). [Google Scholar]
47.Sghaireen, M. G. et al. Machine learning approach for metabolic syndrome diagnosis using explainable Data-Augmentation-Based classification. Diagnostics12, 21. 10.3390/diagnostics12123117 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Qu, H. L., Hu, T. Y., Qu, J. T. & Taffese, W. Z. A machine learning model based on GAN-ANN data augmentation for predicting the bond strength of FRP-reinforced concrete under high-temperature conditions. Compos. Struct.369, 21. 10.1016/j.compstruct.2025.119321 (2025). [Google Scholar]
49.Takefuji, Y. & Beyond XGBoost Unveiling true feature importance. J. Hazard. Mater.488, 4. 10.1016/j.jhazmat.2025.137382 (2025). [DOI] [PubMed] [Google Scholar]
50.Ekanayake, I. U., Meddage, D. P. P. & Rathnayake, U. A novel approach to explain the black-box nature of machine learning in compressive strength predictions of concrete using Shapley additive explanations (SHAP). Case Stud. Constr. Mater.16, 20. 10.1016/j.cscm.2022.e01059 (2022). [Google Scholar]
51.Pigozzi, F. et al. Role of exercise stress test in master athletes. Br. J. Sports Med.39, 527–531. 10.1136/bjsm.2004.014340 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Song, C. X. et al. The CAMI-score: A novel tool derived from CAMI registry to predict In-hospital death among acute myocardial infarction patients. Sci. Rep.8, 8. 10.1038/s41598-018-26861-z (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Pelliccia, A. & Quattrini, F. M. Clinical significance of J-wave in elite athletes. J. Electrocardiol.48, 385–389. 10.1016/j.jelectrocard.2015.03.012 (2015). [DOI] [PubMed] [Google Scholar]
54.Chan, M. C. et al. Explainable machine learning to predict long-term mortality in critically ill ventilated patients: a retrospective study in central Taiwan. BMC Med. Inf. Decis. Mak.22, 11. 10.1186/s12911-022-01817-6 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Gu, T., Lee, P. H. & Duan, R. C. O. M. M. U. T. E. Communication-efficient transfer learning for multi-site risk prediction. J. Biomed. Inform.137, 11. 10.1016/j.jbi.2022.104243 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Pirmani, A. et al. Personalized federated learning for predicting disability progression in multiple sclerosis using real-world routine clinical data. Npj Digit. Med.8, 15. 10.1038/s41746-025-01788-8 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Zhou, Y. et al. Artificial intelligence for predicting delivery modes: A systematic review of Applications, Challenges, and future directions. Clin. Exp. Obstet. Gynecol.52, 11. 10.31083/ceog37807 (2025). [Google Scholar]
58.Saranya, K., Karthikeyan, U., Kumar, A. S., Salau, A. O. & Tin Tin, T. DenseNet-ABiLSTM: revolutionizing multiclass arrhythmia detection and classification using hybrid deep learning approach leveraging PPG signals. Int. J. Comput. Intell. Syst.18, 33. 10.1007/s44196-025-00765-z (2025). [Google Scholar]
59.Rajagopal, R. & Ranganathan, V. Evaluation of effect of unsupervised dimensionality reduction techniques on automated arrhythmia classification. Biomed. Signal. Process. Control. 34, 1–8. 10.1016/j.bspc.2016.12.017 (2017). [Google Scholar]
60.Said, Y., Saidani, T., Atri, M., Alsheikhy, A. A. & Shawly, T. Computational intelligence for emotion recognition in autism spectrum disorder: a systematic review of signal-based modeling, simulation, and clinical potential. Biomed. Signal. Process. Control. 111, 18. 10.1016/j.bspc.2025.108367 (2026). [Google Scholar]
61.Eom, G. & Byeon, H. Searching for optimal oversampling to process imbalanced data: generative adversarial networks and synthetic minority Over-Sampling technique. Mathematics11, 14. 10.3390/math11163605 (2023). [Google Scholar]
62.Joseph, A. J. et al. Prior-guided generative adversarial network for mammogram synthesis. Biomed. Signal. Process. Control. 87, 11. 10.1016/j.bspc.2023.105456 (2024). [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Zhao, X. Y., Mansor, Z., Razali, R., Nazri, M. Z. A. & Xiong, X. of Five Generation Techniques. Ieee Access.13, 63219–63236. 10.1109/access.2025.3558715 (2025). Advancing Agile Software Cost Estimation Through Data Synthesis: A Comparative Analysis.

Data Availability Statement

[CR1] 1.Ostrominski, J. W. & Powell-Wiley, T. M. Risk stratification and treatment of obesity for primary and secondary prevention of cardiovascular disease. Curr. Atheroscler. Rep.26, 11–23. 10.1007/s11883-023-01182-3 (2024). [DOI] [PubMed] [Google Scholar]

[CR2] 2.Thomas, P. E., Vedel-Krogh, S. & Nordestgaard, B. G. Measuring lipoprotein(a) for cardiovascular disease prevention - in whom and when? Curr. Opin. Cardiol.39, 39–48. 10.1097/hco.0000000000001104 (2024). [DOI] [PubMed] [Google Scholar]

[CR3] 3.Rodriguez, J. B. C., Mohammad, K. O. & Alkhateeb, H. Contemporary review of risk scores in prediction of coronary and cardiovascular deaths. Curr. Cardiol. Rep.24, 7–15. 10.1007/s11886-021-01620-1 (2022). [DOI] [PubMed] [Google Scholar]

[CR4] 4.Gourdy, P. et al. Atherosclerotic cardiovascular disease risk stratification and management in type 2 diabetes: review of recent evidence-based guidelines. Front. Cardiovasc. Med.1010.3389/fcvm.2023.1227769 (2023). [DOI] [PMC free article] [PubMed]

[CR5] 5.Kasartzian, D. I. & Tsiampalis, T. Transforming cardiovascular risk prediction: A review of machine learning and artificial intelligence innovations. Life-Basel15, 20. 10.3390/life15010094 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Chowdhury, M. A. et al. The heart of transformation: exploring artificial intelligence in cardiovascular disease. Biomedicines13, 28. 10.3390/biomedicines13020427 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Kagiyama, N., Shrestha, S., Farjo, P. D. & Sengupta, P. P. Artificial intelligence: practical primer for clinical research in cardiovascular disease. J. Am. Heart Assoc.8, 12. 10.1161/jaha.119.012788 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Kumar, A. S. & Rekha, R. A. Dense network approach with Gaussian optimizer for cardiovascular disease prediction. New. Generation Comput.41, 859–878. 10.1007/s00354-023-00234-1 (2023). [Google Scholar]

[CR9] 9.Kumar, A. S. & Rekha, R. An improved Hawks optimizer based learning algorithms for cardiovascular disease prediction. Biomed. Signal. Process. Control. 81, 104442. 10.1016/j.bspc.2022.104442 (2023). [Google Scholar]

[CR10] 10.Arunachalam, S. K. & Rekha, R. A novel approach for cardiovascular disease prediction using machine learning algorithms. Concurrency and Computation: Practice and Experience34, e7027 (2022). 10.1002/cpe.7027

[CR11] 11.Aubaidan, B. H. et al. A review of intelligent data analysis: machine learning approaches for addressing class imbalance in healthcare - challenges and perspectives. Intell. Data Anal.29, 699–719. 10.1177/1088467x241305509 (2025). [Google Scholar]

[CR12] 12.Tasci, E., Zhuge, Y., Camphausen, K. & Krauze, A. V. Bias and class imbalance in oncologic data-Towards inclusive and transferrable AI in large scale oncology data sets. Cancers1410.3390/cancers14122897 (2022). [DOI] [PMC free article] [PubMed]

[CR13] 13.Restrepo, J. P., Rivera, J. C., Laniado, H., Osorio, P. & Becerra, O. A. Nonparametric generation of synthetic data using copulas. Electronics12, 17. 10.3390/electronics12071601 (2023). [Google Scholar]

[CR14] 14.García-Vicente, C. et al. Evaluation of synthetic categorical data generation techniques for predicting cardiovascular diseases and Post-Hoc interpretability of the risk factors. Appl. Sci. -Basel. 13, 23. 10.3390/app13074119 (2023). [Google Scholar]

[CR15] 15.Innab, N. et al. AI-Driven predictive modeling for lung cancer detection and management using synthetic data augmentation and random forest classifier. Int. J. Comput. Intell. Syst.18, 20. 10.1007/s44196-025-00879-4 (2025). [Google Scholar]

[CR16] 16.Martínez-Velasco, A., Martínez-Villaseñor, L. & Miralles-Pechuán, L. Addressing class imbalance in healthcare data: machine learning solutions for Age-Related macular degeneration and preeclampsia. IEEE Latin Am. Trans.22, 806–820. 10.1109/tla.2024.10705995 (2024). [Google Scholar]

[CR17] 17.Alturki, N. et al. Improving prediction of chronic kidney disease using KNN imputed SMOTE features and TrioNet model. CMES-Comp Model. Eng. Sci.139, 3513–3534. 10.32604/cmes.2023.045868 (2024). [Google Scholar]

[CR18] 18.Guo, Y., Kou, Y., Yi, L. Z. & Fu, G. H. HiBBKA: A hybrid method with resampling and heuristic feature selection for Class-Imbalanced data in chemometrics. J. Chemometr. 39, 21. 10.1002/cem.70029 (2025). [Google Scholar]

[CR19] 19.Alharbi, F., Ouarbya, L. & Ward, J. A. Comparing sampling strategies for tackling imbalanced data in human activity recognition. Sensors22, 20. 10.3390/s22041373 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Zhang, Z. Y., Li, Y. X., Liu, T. M. & Liu, C. A. Data augmentation-empowered diabetic retinopathy detection based on collaborative discrimination-enabled generative adversarial network. Ann. Oper. Res.2310.1007/s10479-024-06147-6 (2024).

[CR21] 21.Baowaly, M. K., Lin, C. C., Liu, C. L. & Chen, K. T. Synthesizing electronic health records using improved generative adversarial networks. J. Am. Med. Inform. Assoc.26, 228–241. 10.1093/jamia/ocy142 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Weng, X. T. et al. A joint learning method for incomplete and imbalanced data in electronic health record based on generative adversarial networks. Comput. Biol. Med.168, 12. 10.1016/j.compbiomed.2023.107687 (2024). [DOI] [PubMed] [Google Scholar]

[CR23] 23.Li, R. Z. et al. Improving an electronic health Record-Based clinical prediction model under label deficiency: Network-Based generative adversarial semisupervised approach. JMIR Med. Inf.11, 14. 10.2196/47862 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Bounab, R., Zarour, K., Guelib, B. & Khlifa, N. Enhancing medicare fraud detection through machine learning: addressing class imbalance with SMOTE-ENN. Ieee Access.12, 54382–54396. 10.1109/access.2024.3385781 (2024). [Google Scholar]

[CR25] 25.Yan, C. et al. A multifaceted benchmarking of synthetic electronic health record generation models. Nat. Commun.13, 18. 10.1038/s41467-022-35295-1 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Ince, V., Bader-El-Den, M., Esmeli, R., Maurya, L. & Sari, O. F. Predicting spoilage intensity level in sausage products using explainable machine learning and GAN-Based data augmentation. Food Bioprocess Technol.10.1007/s11947-025-03971-x (2025). [Google Scholar]

[CR27] 27.Hasan, T. & Tasnim, S. Real-time explainable IoT security with machine learning and CTGAN-enhanced detection for resource-constrained devices. Ad Hoc Netw.178, 21. 10.1016/j.adhoc.2025.103937 (2025). [Google Scholar]

[CR28] 28.Won, S., Bae, S. H. & Kim, S. T. Effects of mixed sample data augmentation on interpretability of neural networks. Neural Netw.190, 12. 10.1016/j.neunet.2025.107611 (2025). [DOI] [PubMed] [Google Scholar]

[CR29] 29.Wang, L. Y. et al. Integrative modeling of heterogeneous soil salinity using sparse ground samples and remote sensing images. Geoderma430, 16. 10.1016/j.geoderma.2022.116321 (2023). [Google Scholar]

[CR30] 30.Doppala, B. P. & Bhattacharyya, D. Cardiovascular_Disease_Dataset. Mendeley Data (2021). https://data.mendeley.com/datasets/dzz48mvjht/1

[CR31] 31.Doppala, B. P. & Bhattacharyya, D. Cardiovascular_Disease_Dataset. Kaggle (2021). https://www.kaggle.com/datasets/jocelyndumlao/cardiovascular-disease-dataset/data

[CR32] 32.Licenses, U. C. C. P. Attribution 4.0 International.

[CR33] 33.Adeoye, J., Hui, L. L. & Su, Y. X. Data-centric artificial intelligence in oncology: a systematic review assessing data quality in machine learning models for head and neck cancer. J. Big Data. 10, 25. 10.1186/s40537-023-00703-w (2023). [Google Scholar]

[CR34] 34.Guo, W. J. et al. Review of machine learning and deep learning models for toxicity prediction. Exp. Biol. Med.248, 1952–1973. 10.1177/15353702231209421 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Lokman, A., Ismail, W. Z. W. & Aziz, N. A. A. A review of water quality forecasting and classification using machine learning models and statistical analysis. Water17, 31. 10.3390/w17152243 (2025). [Google Scholar]

[CR36] 36.Du, Q. X., Wang, H. C., Jiang, B. B. & Wang, X. W. Advancing genetic engineering with active learning: theory, implementations and potential opportunities. Brief. Bioinform. 26, 24. 10.1093/bib/bbaf286 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Song, X. Y., Xiong, J. H., Wang, M., Mei, Q. S. & Lin, X. D. Combined data augmentation on EANN to identify indoor anomalous sound event. Appl. Sci. -Basel. 14, 16. 10.3390/app14041327 (2024). [Google Scholar]

[CR38] 38.Alataiqeh, M., Shi, H., Qu, Q. Q., Mei, X. S. & Wang, H. T. Thermal error modeling of Slant bed CNC lathe spindle based on BiLSTM with data augmentation and grey Wolf optimizer algorithm. Case Stud. Therm. Eng.70, 18. 10.1016/j.csite.2025.106090 (2025). [Google Scholar]

[CR39] 39.Chen, J., Xia, M. & Wang, Z. J. Radial-based oversampling based on differential evolution for imbalanced data. Appl. Intell.55, 20. 10.1007/s10489-025-06460-y (2025). [Google Scholar]

[CR40] 40.Sun, P., Wang, Z., Jia, L., Xu, Z. & SMOTE-kTLNN: A hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier. Expert Syst. Appl.23810.1016/j.eswa.2023.121848 (2024).

[CR41] 41.Jiménez-Gaona, Y., Carrión-Figueroa, D., Lakshminarayanan, V. & Rodríguez-Alvarez, M. J. Gan-based data augmentation to improve breast ultrasound and mammography mass classification. Biomed. Signal. Process. Control. 94, 18. 10.1016/j.bspc.2024.106255 (2024). [Google Scholar]

[CR42] 42.Zhang, Y. et al. GAN-based one dimensional medical data augmentation. Soft Comput.27, 10481–10491. 10.1007/s00500-023-08345-z (2023). [Google Scholar]

[CR43] 43.Zhao, X. Y., Mansor, Z., Razali, R., Nazri, M. Z. A. & Xiong, X. of Five Generation Techniques. Ieee Access.13, 63219–63236. 10.1109/access.2025.3558715 (2025). Advancing Agile Software Cost Estimation Through Data Synthesis: A Comparative Analysis.

[CR44] 44.Bhattarai, S., Chong, K. T. & Tayara, H. GAN-ML: advancing anticancer peptide prediction through innovative deep Convolution generative adversarial network data augmentation technique. Chemometrics Intell. Lab. Syst.262, 14. 10.1016/j.chemolab.2025.105390 (2025). [Google Scholar]

[CR45] 45.Isasa, I. et al. Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis. BMC Med. Inf. Decis. Mak.24, 14. 10.1186/s12911-024-02427-0 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Kazemi, P., Entezami, A. & Ghisi, A. Machine learning techniques for diagrid Building design: Architectural-Structural correlations with feature selection and data augmentation. J. Build. Eng.86, 27. 10.1016/j.jobe.2024.108766 (2024). [Google Scholar]

[CR47] 47.Sghaireen, M. G. et al. Machine learning approach for metabolic syndrome diagnosis using explainable Data-Augmentation-Based classification. Diagnostics12, 21. 10.3390/diagnostics12123117 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Qu, H. L., Hu, T. Y., Qu, J. T. & Taffese, W. Z. A machine learning model based on GAN-ANN data augmentation for predicting the bond strength of FRP-reinforced concrete under high-temperature conditions. Compos. Struct.369, 21. 10.1016/j.compstruct.2025.119321 (2025). [Google Scholar]

[CR49] 49.Takefuji, Y. & Beyond XGBoost Unveiling true feature importance. J. Hazard. Mater.488, 4. 10.1016/j.jhazmat.2025.137382 (2025). [DOI] [PubMed] [Google Scholar]

[CR50] 50.Ekanayake, I. U., Meddage, D. P. P. & Rathnayake, U. A novel approach to explain the black-box nature of machine learning in compressive strength predictions of concrete using Shapley additive explanations (SHAP). Case Stud. Constr. Mater.16, 20. 10.1016/j.cscm.2022.e01059 (2022). [Google Scholar]

[CR51] 51.Pigozzi, F. et al. Role of exercise stress test in master athletes. Br. J. Sports Med.39, 527–531. 10.1136/bjsm.2004.014340 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] 52.Song, C. X. et al. The CAMI-score: A novel tool derived from CAMI registry to predict In-hospital death among acute myocardial infarction patients. Sci. Rep.8, 8. 10.1038/s41598-018-26861-z (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Pelliccia, A. & Quattrini, F. M. Clinical significance of J-wave in elite athletes. J. Electrocardiol.48, 385–389. 10.1016/j.jelectrocard.2015.03.012 (2015). [DOI] [PubMed] [Google Scholar]

[CR54] 54.Chan, M. C. et al. Explainable machine learning to predict long-term mortality in critically ill ventilated patients: a retrospective study in central Taiwan. BMC Med. Inf. Decis. Mak.22, 11. 10.1186/s12911-022-01817-6 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Gu, T., Lee, P. H. & Duan, R. C. O. M. M. U. T. E. Communication-efficient transfer learning for multi-site risk prediction. J. Biomed. Inform.137, 11. 10.1016/j.jbi.2022.104243 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Pirmani, A. et al. Personalized federated learning for predicting disability progression in multiple sclerosis using real-world routine clinical data. Npj Digit. Med.8, 15. 10.1038/s41746-025-01788-8 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR57] 57.Zhou, Y. et al. Artificial intelligence for predicting delivery modes: A systematic review of Applications, Challenges, and future directions. Clin. Exp. Obstet. Gynecol.52, 11. 10.31083/ceog37807 (2025). [Google Scholar]

[CR58] 58.Saranya, K., Karthikeyan, U., Kumar, A. S., Salau, A. O. & Tin Tin, T. DenseNet-ABiLSTM: revolutionizing multiclass arrhythmia detection and classification using hybrid deep learning approach leveraging PPG signals. Int. J. Comput. Intell. Syst.18, 33. 10.1007/s44196-025-00765-z (2025). [Google Scholar]

[CR59] 59.Rajagopal, R. & Ranganathan, V. Evaluation of effect of unsupervised dimensionality reduction techniques on automated arrhythmia classification. Biomed. Signal. Process. Control. 34, 1–8. 10.1016/j.bspc.2016.12.017 (2017). [Google Scholar]

[CR60] 60.Said, Y., Saidani, T., Atri, M., Alsheikhy, A. A. & Shawly, T. Computational intelligence for emotion recognition in autism spectrum disorder: a systematic review of signal-based modeling, simulation, and clinical potential. Biomed. Signal. Process. Control. 111, 18. 10.1016/j.bspc.2025.108367 (2026). [Google Scholar]

[CR61] 61.Eom, G. & Byeon, H. Searching for optimal oversampling to process imbalanced data: generative adversarial networks and synthetic minority Over-Sampling technique. Mathematics11, 14. 10.3390/math11163605 (2023). [Google Scholar]

[CR62] 62.Joseph, A. J. et al. Prior-guided generative adversarial network for mammogram synthesis. Biomed. Signal. Process. Control. 87, 11. 10.1016/j.bspc.2023.105456 (2024). [Google Scholar]

PERMALINK

Data augmentation alters feature importance in XGBoost for CVD prediction

Shuai Chang

Xiangyu Wang

Yu Luo

Lei Jia

Abstract

Introduction

Material and method

Database

Dataset description

Data preprocessing and partitioning

Table 1.

Fig. 1.

Predictive modeling framework

Fig. 2.

Ablation study: data augmentation strategies

Table 2.

WGAN-GP-based augmentation

SMOTE-based augmentation

Baseline model

Feature importance analysis

Result

HPO and model performance on cross-validation

Table 3.

Fig. 3.

Performance on the independent hold-out test set

Table 4.

Fig. 4.

Comparative feature importance analysis

Fig. 5.

Discussion

Conclusion

Author contributions

Funding

Data and code availability

Declarations

Competing interests

Footnotes

References

Associated Data

Data Citations

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases