Skip to main content
Endocrinology, Diabetes & Metabolism logoLink to Endocrinology, Diabetes & Metabolism
. 2025 Sep 18;8(5):e70111. doi: 10.1002/edm2.70111

Diagnostic Performance of Machine Learning Algorithms for Predicting Heart Failure in Diabetic Patients: A Systematic Review and Meta‐Analysis

Pooya Eini 1,, Peyman Eini 2, Homa Serpoush 3, Mohammad Rezayee 4
PMCID: PMC12445121  PMID: 40965312

ABSTRACT

Background

Heart failure is a significant complication in diabetic patients, and machine learning algorithms offer potential for early prediction. This systematic review and meta‐analysis evaluated the diagnostic performance of ML models in predicting HF among diabetic patients.

Methods

We searched PubMed, Web of Science, Embase, ProQuest, and Scopus, identifying 2830 articles. After deduplication and screening, 16 studies were included, with 7 providing data for meta‐analysis. Study quality was assessed using PROBAST+AI. A bivariate random‐effects model (Stata, midas, metadta) pooled sensitivity, specificity, likelihood ratios, and diagnostic odds ratio (DOR) for best‐performing algorithms, with subgroup analyses. Heterogeneity (I 2) and publication bias were assessed.

Results

This meta‐analysis of seven studies evaluating machine learning models for heart failure detection demonstrated a pooled sensitivity of 84% (95% CI: 0.75–0.90), specificity of 86% (95% CI: 0.56–0.97), and an area under the ROC curve of 0.90 (95% CI: 0.87–0.93). The pooled positive likelihood ratio was 6.6 (95% CI: 1.2–35.9), and the negative likelihood ratio was 0.17 (95% CI: 0.08–0.36), with a diagnostic odds ratio of 39 (95% CI: 4–423). Significant heterogeneity was observed, primarily related to differences in study populations, machine learning algorithms, dataset sizes, and validation methods. No significant publication bias was detected.

Conclusion

Machine learning models demonstrate promising diagnostic accuracy for heart failure detection and have the potential to support early diagnosis and risk assessment in clinical practice. However, considerable heterogeneity across studies and limited external validation highlight the need for standardised development, prospective validation, and improved interpretability of ML models to ensure their effective integration into healthcare systems.

Keywords: diabetes, diagnostic accuracy, heart failure, machine learning, predictive modelling


Machine learning models demonstrate high diagnostic accuracy for predicting heart failure in patients with type 2 diabetes, outperforming traditional risk models. Their ability to integrate diverse clinical data highlights their potential for early detection and risk stratification in clinical practice.

graphic file with name EDM2-8-e70111-g002.jpg

1. Introduction

Heart failure (HF) is a serious cardiovascular complication in patients with type 2 diabetes mellitus (T2DM), often driven by poor glycemic control, coronary artery disease, and ageing [1]. In diabetic populations, markers like HbA1c and fasting glucose are important predictors of HF [2]. They reflect underlying mechanisms such as insulin resistance and diabetic cardiomyopathy [3]. Among patients with T2DM, about 3.5% are hospitalised for HF over 360,258 person‐years, corresponding to 5.2 events per 1000 person‐years [4]. The cumulative 5‐year event rate reaches 2.0%, highlighting the urgent need for early risk identification [3]. Risk scores that include factors like systolic blood pressure, BMI, and eGFR have shown good discrimination for predicting HF at 5 and 10 years [5]. High‐risk groups may face up to a 15‐fold increase in HF risk compared to individuals with normal glucose levels [2]. Yet, despite reasonable calibration, these traditional models perform modestly in prediabetic populations and often fail to account for dynamic risk factors such as fluctuations in glycemic control or medication use [6].

Machine learning (ML) algorithms offer a promising alternative, leveraging advanced pattern recognition to address the limitations of traditional statistical approaches [7]. Unlike basic risk scores, ML models can integrate high‐dimensional data such as HbA1c variability, renal biomarkers, and prior cardiovascular events to enhance predictive accuracy [8]. Given the wide variation in HF risk across age groups and risk categories, there is a clear need for better tools to support early detection and risk stratification in diabetes care.

This systematic review and meta‐analysis holds significant importance in advancing the field of HF prediction in diabetic patients by providing the comprehensive synthesis of ML‐based diagnostic performance. While individual studies have demonstrated ML's potential, the lack of a unified evaluation has hindered clinical adoption. By pooling data across diverse studies, our work quantifies ML's diagnostic accuracy and compares it to traditional methods, addressing gaps in understanding heterogeneity and generalisability.

2. Method

2.1. Search Strategy

The study protocol is registered at osf.io/5bcfd/ (Open Science Framework) and we adhered to Preferred Reporting Items for Systematic Review and Meta‐analyses Protocols (PRISMA) [9]. We systematically searched five electronic databases—PubMed, Web of Science (WoS), Embase, ProQuest, and Scopus—for relevant studies published up to June 13, 2025. The search strategy combined terms related to machine learning, heart failure, and diagnostic accuracy, with no language restrictions applied. A total of 2830 articles were identified and after removing duplicates, 1527 unique articles were screened. Full search syntax for different databases is available in Table S1.

2.2. Study Selection

The screening process involved two stages. First, titles and abstracts of 1527 articles were independently reviewed by two investigators to assess eligibility based on predefined inclusion criteria: studies reporting diagnostic performance metrics (e.g., sensitivity, specificity, accuracy) of machine learning algorithms for HF prediction in adults using a test set with defined cases (HF) and controls (no HF). Discrepancies were resolved by consensus or consultation with a third reviewer. This initial screening yielded 20 articles selected for full‐text review. In the second stage, full texts were evaluated for data availability, methodological quality, and relevance to the meta‐analysis objectives. Studies lacking sufficient data (e.g., contingency tables or derivable metrics) or not meeting quality standards were excluded. Ultimately, 16 articles were included in the study, of which 7 provided adequate data for quantitative synthesis in the meta‐analysis (Figure 1).

FIGURE 1.

FIGURE 1

PRISMA flowchart for study screening and selection.

2.3. Data Extraction

Data extraction was performed independently by two reviewers, capturing study characteristics (algorithm type, sample size, HF prevalence), performance metrics (sensitivity, specificity, accuracy), and test set details (cases and controls).

2.4. Quality Assessment

Quality assessment of the included studies was conducted using the PROBAST + AI tool, tailored for prediction model studies incorporating artificial intelligence [10]. This tool evaluated the risk of bias and applicability across domains such as participant selection, predictors, outcomes, and analysis, ensuring the robustness of the meta‐analysis. Studies were assessed independently by two reviewers, with disagreements resolved through discussion.

2.5. Statistical Analysis

Statistical analysis was performed using Stata version 18 with the midas and metadta packages. We employed a bivariate random‐effects model to pool sensitivity, specificity, positive likelihood ratio (LR+), negative likelihood ratio (LR−), and diagnostic odds ratio (DOR) across studies, accounting for the correlation between sensitivity and specificity. The summary receiver operating characteristic (SROC) curve and area under the curve (AUROC) were estimated to assess overall diagnostic accuracy. Heterogeneity was evaluated using the I 2 statistic and χ 2 test (Q statistic), with I 2 values interpreted as low (< 25%), moderate (25%–50%), or substantial (> 50%). Publication bias was assessed using a regression‐based test for small‐study effects (Deeks' method). For meta‐regression analysis, a random‐effects meta‐regression was conducted to explore the impact of machine learning model type on diagnostic performance. The Restricted Maximum Likelihood (REML) method was employed, with within‐study standard errors specified for the log DOR estimates.

2.6. Generative AI

In the preparation of this article, the authors utilised the Grammarly application to enhance linguistic accuracy and clarity. The manuscript underwent meticulous double‐checking to ensure precision, and the authors assume full responsibility for the integrity and originality of the content presented herein.

3. Result

3.1. Study Characteristics

This systematic review included 16 studies published between 2019 and 2025. The majority were observational studies (75%), with the remainder comprising clinical trials (12.5%), cohort studies (6.25%), and cross‐sectional studies (6.25%). The studies were conducted across a diverse range of countries, including the USA, Canada, Italy, Japan, South Korea, Australia, Uzbekistan, Poland, Spain, Hong Kong, and Sweden.

Validation methods varied among studies. Most used internal validation, with 81.25% employing k‐fold cross‐validation (predominantly 5‐ or 10‐fold), while a smaller proportion (18.75%) used split‐sample validation or external cohorts. Only a few studies incorporated robust external validation, which may influence the generalisability of the results.

The included models were trained and validated on a combined total of 845,816 participants in training sets, 237,747 in validation sets, and 122,118 in test sets. The dataset sizes ranged from fewer than 100 participants to large‐scale populations exceeding 600,000 individuals. Approximately 31% of studies analysed datasets larger than 100,000 participants, whereas 38% used datasets with fewer than 10,000 participants.

Predictors most commonly included demographics (age, sex), clinical variables (comorbidities, vital signs), laboratory data (biomarkers, electrolytes, blood indices), medication histories, and diagnostic codes. Several studies also used advanced predictors, such as composite risk indices, electrocardiographic parameters, and disease‐specific biomarkers, reflecting a broad spectrum of clinical data inputs.

The median follow‐up duration across studies was 47.4 months, though 37.5% of studies did not report follow‐up times explicitly. This variation in follow‐up likely contributed to differences in outcome definitions and model performance across studies.

Patient characteristics were inconsistently reported. However, the calculated mean age of participants was 50.8 years, and the average proportion of male participants was 53.1%, which aligns with typical heart failure demographics but highlights variability among study populations.

A wide variety of machine learning algorithms were tested, with each study evaluating multiple models. The most frequently selected best‐performing algorithms were Random Forest (18.8%) and Gradient Boosting (XGBoost) (18.8%), followed by Random Survival Forest (12.5%) and other models including Neural Networks, Deep Learning with GRU, Proportional Hazards Neural Networks, Stacked Models, Multinomial Logistic Regression, Explainable Boosting Machine, Light Gradient Boosting Machine, and Stochastic Gradient Boosting (each selected in 6.2% of studies). These results reflect the diverse algorithmic landscape in ML‐based heart failure prediction research. Detailed study characteristics are available in Table 1.

TABLE 1.

included studies characteristics.

First author/year Type of study Country Validation Input characteristics Selected features Number of patients in training/test/validation group Follow‐up (months) Mean age/male % Machin learning algorithms Best predictor algorithm
Segar et al. (2019) [6] Clinical trial USA—Canada Internal validation used 50% development and 50% validation split with—external validation in ALLHAT cohort Clinical features (demographics, clinical variables, laboratory data, electrocardiographic parameters, baseline antihyperglycemic therapies, treatment randomization) 10 4378/7378/10,819

ACCORD: Median 58.8 months (4.9 years)

ALLHAT: Median 57.6 months (4.8 years)

61.7/61.5% Random Survival Forest (RSF) RSF
Angraal et al. (2020) [11] RCT United States, Canada, Argentina, Brazil 5‐fold cross‐validation Haemoglobin level, blood urea nitrogen (BUN), KCCQ variables, time since previous HF hospitalisation, glomerular filtration rate, and blood glucose levels. 68 1237/353/177 36 Mean age not reported; 50.1% male Logistic Regression (LR) with forward selection, LR with lasso regularisation, Random Forest (RF), Gradient Descent Boosting (XGBoost), Support Vector Machine (SVM) Random Forest (RF)
Lee et al. (2021) [12] Observational study Hong Kong 5‐fold cross‐validation HbA1c mean and variability, HDL‐C mean and SD, baseline NLR, hypoglycemic frequency, lipid parameters: total cholesterol, LDL‐C, HDL‐C, triglycerides 22 17,630/5037/2519 132 63/50.4% Random Survival Forests Random Survival Forests
Longato et al. (2021) [13] Observational study ITALY 5‐fold cross‐validation Clinical features (demographics, clinical variables, laboratory data, electrocardiographic parameters) 22 150,273/42,935/21,468 60 69/55% Deep Learning model with GRU Deep Learning model with GRU
Kanda et al. (2022) [8] Observational study Japan 80% training, 20% internal validation split; external validation using a separate dataset Clinical features (patient demographics, ATC diagnosis codes, laboratory values) 60 173,643/43,410/16,822 60 Not Reported Random Forest (RF), Logistic Regression (LR), Gradient Boosting (XGBoost), Deep Learning (Multilayer Perceptron), Cox Proportional Hazards XGBoost
Abegaz et al. (2023) [14] Observational study USA 5‐fold cross‐validation Clinical features (demographics, electrolytes, blood indices, disease‐specific biomarkers, physical measurements) 30 7247/1812/Not Reported Not Reported age > 18/42.6% Random Forest (RF), Logistic Regression (LR), Extreme Gradient Boosting (XGBoost), Weighted Ensemble Model (WEM) XGBoost
Gandin et al. (2023) [15] Cohort Study Italy 10‐fold cross‐validation Clinical features (medical information, diagnostic codes, laboratory tests, procedures, cardiovascular drug prescriptions, comorbidities) 20 7430/1592/1592 65 72/42% Proportional Hazards Neural Network (PHNN) PHNN
Wang et al. (2023) [16] Observational study USA Bootstrap method used for internal validation Clinical features (demographics, self‐reported medical history, medication status, smoking/alcohol consumption, physical measurements, laboratory data) 6 2469/1058/Not Reported Not Reported 59.62/57.5% Logistic Regression (LR), Random Forest (RF), Classification and Regression Tree (CART), Gradient Boosting Machine (GBM), Support Vector Machine (SVM) Random Forest (RF)
Alimova et al. (2023) [17] Observational study Uzbekistan 5‐fold cross‐validation Clinical features (demographics, medical histories, medication history, clinical parameters, blood tests) 19 91/26/13 24 Not Reported Random Forest (RF), Logistic Regression (LR), Generalised Linear Models (GLM), Extra Trees, Neural Networks (NN) Neural Networks (NN)
Mora et al. (2023) [18] Observational study Spain 5‐fold cross‐validation Clinical features (demographic data, medical histories, vital signs) 9 427,013/122,004/61,002 36 69.56/54.33% Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGB), Stacked Model (based on LR combining LR, DT, RF, XGB) Stacked Model
Nabrdalik et al. (2023) [19] Observational study Poland 5‐fold cross‐validation Clinical features (demographic data, medical histories, vital signs) 10 1400/400/200 Not Reported 58.85/52% Multinomial Logistic Regression (MLR) Multinomial Logistic Regression (MLR)
Sang et al. (2024) [20] Observational study South Korea Bootstrapping with 10,000 iterations Clinical features (demographics, medical histories, medication history, clinical parameters, blood tests) 15 12,809/Not Reported/2019 36 62.5 ± 12.1/51.0% Random Forest (RF), XGBoost (XGB), LightGBM (LGM), AdaBoost (ADB), Logistic Regression (LR), Support Vector Machine (SVM) with linear kernel Random Forest (RF)
Soh et al. (2024) [21] Observational study Australia 5‐fold cross‐validation Clinical features (demographic data, medical histories, vital signs, laboratory test results) 22 192/55/28 Not Reported Not Reported Explainable Boosting Machine (EBM) Explainable Boosting Machine (EBM)
Kwiendacz et al. (2025) [22] Observational study Poland 5‐fold cross‐validation Clinical features (demographic data, medical histories, vital signs, laboratory test results, medications) 10 781/335/Not Reported 37.2 67/57% Logistic Regression (LR), Random Forest (RF), Support Vector Classification (SVC), Light Gradient Boosting Machine (LGBM), eXtreme Gradient Boosting (XGBM) Light Gradient Boosting Machine (LGBM)
Bai et al. (2025) [5] Cross‐sectional Study USA 5‐fold cross‐validation Clinical features (novel composite indices derived from anthropometric data, laboratory tests, and patient history) 7 1011/434/Not Reported Not Reported 62.62/48.24% Random Forest (RF), Logistic Regression (LR), XGBoost, Support Vector Machine (SVM), k‐Nearest Neighbours (kNN), Gradient Boosting, AdaBoost, Neural Network, Naive Bayes XGBoost
Wändell et al. (2025) [23] Observational study Sweden 10‐fold cross‐validation Clinical features (demographics, electrolytes, blood indices, disease‐specific biomarkers) 25 38,212/10,918/5459 36 ≥ 30 years/60.4% Stochastic Gradient Boosting (SGB) Stochastic Gradient Boosting (SGB)

3.2. Pooled Analysis of Best‐Performing Models

In this analysis, we used the best‐performing machine learning model from each study, as reported by the original authors. Detailed performance metrics for different models are shown in Table 2. The pooled analysis demonstrated a sensitivity of 0.84 (95% CI: 0.75–0.90) and a specificity of 0.86 (95% CI: 0.56–0.97), based on a random‐effects model. These findings indicate a generally high level of diagnostic performance across the included studies. However, significant heterogeneity was observed, with an (I 2) of 83.17 for sensitivity and 87.77 for specificity, reflecting considerable differences in study outcomes (Figure 2). Additional pooled diagnostic performance metrics further underscored the strong potential of ML models in detecting heart failure. The pooled area under the receiver operating characteristic curve (AUROC) was 0.90 (95% CI: 0.87–0.93), suggesting excellent discriminatory power of these models in distinguishing patients with heart failure from those without the condition (Figure 3). The positive likelihood ratio (PLR) was 6.6 (95% CI: 1.2–35.9), indicating that individuals with heart failure were approximately 6.6 times more likely to have a positive test result from the ML models compared to individuals without heart failure. This level of PLR suggests that the models may have clinical utility in confirming a diagnosis of heart failure when test results are positive. The negative likelihood ratio (NLR) was 0.17 (95% CI: 0.08–0.36), a low value that implies a strong potential for ruling out heart failure when the ML test result is negative. Furthermore, the diagnostic odds ratio (DOR) was 39 (95% CI: 4–423), reflecting a high overall diagnostic effectiveness, although the wide confidence interval suggests variability across the studies. These findings demonstrate that ML models hold strong potential for clinical application in heart failure detection. The high sensitivity and low negative likelihood ratio support their use as screening tools, aiding in the early detection of heart failure and potentially reducing the risk of missed diagnoses. The high specificity and positive likelihood ratio suggest they may also be effective for confirming heart failure when used in conjunction with clinical assessment. The AUROC of 0.90 reflects excellent overall discriminative performance. Despite these promising results, the significant heterogeneity highlights the need for further standardisation and validation of ML models across diverse clinical populations and healthcare settings. Before widespread adoption into clinical practice, rigorous external validation and careful integration into diagnostic workflows will be essential to ensure reliable performance in real‐world scenarios.

TABLE 2.

Performance metrics for different ML models. First row for each study indicates best performing algorithm.

First author/year Predictor algorithm Accuracy Sensitivity (recall) Specificity Precision F1 score AUROC AUROC (upper CI) AUROC (lower CI)
Segar et al. (2019) [6] RF Not Reported Not Reported Not Reported Not Reported Not Reported 0.74 0.76 0.72
Angraal et al. (2020) [11] RF Not Reported Not Reported Not Reported Not Reported Not Reported 0.76 0.81 0.71
LR with a forward selection Not Reported Not Reported Not Reported Not Reported Not Reported 0.73 0.8 0.66
LR with a lasso regularisation Not Reported Not Reported Not Reported Not Reported Not Reported 0.73 0.79 0.67
XGBoost Not Reported Not Reported Not Reported Not Reported Not Reported 0.73 0.77 0.69
SVM Not Reported Not Reported Not Reported Not Reported Not Reported 0.72 0.81 0.63
Lee et al. (2021) [12] RSF Not Reported 0.9175 Not Reported 0.8963 Not Reported 0.8947 Not Reported Not Reported
Longato et al. (2021) [13] Deep Learning model with GRU Not Reported Not Reported Not Reported Not Reported Not Reported 0.84 0.865 0.815
Kanda et al. (2022) [8] XGBoost Not Reported Not Reported Not Reported Not Reported Not Reported 0.752 Not Reported Not Reported
Abegaz et al. (2023) [14] XGBoost 0.8 0.94 0.76 0.84 0.89 0.8 0.83 0.79
RF 0.81 0.91 0.78 0.82 0.86 0.8 0.82 0.78
LR 0.66 0.63 0.67 0.89 0.74 0.66 0.68 0.64
WEM 0.76 0.92 0.71 0.8 0.86 0.76 0.78 0.74
Gandin et al. (2023) [15] PHNN Not Reported Not Reported Not Reported Not Reported Not Reported 0.771 0.818 0.723
Wang et al. (2023) [16] RF Not Reported Not Reported Not Reported Not Reported Not Reported 0.978 Not Reported Not Reported
LGBM Not Reported Not Reported Not Reported Not Reported Not Reported 0.873 Not Reported Not Reported
LR Not Reported Not Reported Not Reported Not Reported Not Reported 0.87 Not Reported Not Reported
SVM Not Reported Not Reported Not Reported Not Reported Not Reported 0.837 Not Reported Not Reported
CART Not Reported Not Reported Not Reported Not Reported Not Reported 0.822 Not Reported Not Reported
Alimova et al. (2023) [17] NN Not Reported Not Reported Not Reported Not Reported Not Reported 0.827 Not Reported Not Reported
LR Not Reported Not Reported Not Reported Not Reported Not Reported 0.72 Not Reported Not Reported
GLM Not Reported Not Reported Not Reported Not Reported Not Reported 0.702 Not Reported Not Reported
RF Not Reported Not Reported Not Reported Not Reported Not Reported 0.768 Not Reported Not Reported
Extra Trees Not Reported Not Reported Not Reported Not Reported Not Reported 0.726 Not Reported Not Reported
Mora et al. (2023) [18] Stacked Model 0.63 Not Reported Not Reported 0.63 Not Reported 0.69 Not Reported Not Reported
Nabrdalik et al. (2023) [19] MLR 0.7245 0.7959 0.7083 0.3824 0.517 0.83 0.88 0.77
Sang et al. (2024) [20] RF 0.664 0.664 0.664 Not Reported Not Reported 0.722 0.783 0.66
XGBoost 0.647 0.646 0.647 Not Reported Not Reported 0.71 0.77 0.649
LGBM 0.649 0.649 0.649 Not Reported Not Reported 0.717 0.778 0.653
ADB 0.653 0.653 0.653 Not Reported Not Reported 0.716 0.779 0.654
LR 0.654 0.654 0.654 Not Reported Not Reported 0.717 0.791 0.636
SVM 0.508 0.508 0.508 Not Reported Not Reported 0.526 0.604 0.449
Soh et al. (2024) [21] EBM 0.7938 0.89 0.62 0.82 0.853 0.81 0.794 0.787
Kwiendacz et al. (2025) [22] LGBM 0.723 0.624 0.739 0.923 0.82 0.74 0.743 0.738
RF 0.714 0.54 0.54 0.909 0.816 0.707 0.709 0.704
SVC 0.686 0.698 0.614 0.918 0.792 0.707 0.71 0.704
LR 0.59 0.596 0.589 0.899 0.711 0.71 0.713 0.707
XGBoost 0.695 0.631 0.572 0.86 0.755 0.621 0.623 0.618
Bai et al. (2025) [5] XGBoost 0.91 0.91 0.91 0.91 0.91 0.96 0.975 0.953
RF 0.9 0.9 0.9 0.9 0.9 0.97 0.979 0.959
LR 0.72 0.72 0.72 0.72 0.72 0.76 0.785 0.733
SVM 0.81 0.81 0.81 0.81 0.8 0.86 0.883 0.841
KNN 0.85 0.85 0.85 0.87 0.84 0.92 0.933 0.899
LGBM 0.84 0.84 0.84 0.84 0.84 0.91 0.929 0.895
ADB 0.76 0.76 0.76 0.76 0.76 0.83 0.855 0.811
Neural Network 0.86 0.86 0.86 0.86 0.86 0.91 0.929 0.895
Naive Bayes 0.69 0.69 0.69 0.69 0.69 0.73 0.759 0.705
Wändell et al. (2025) [23] SGB 0.776 0.802 0.742 0.346 0.483 0.849 0.872 0.828

FIGURE 2.

FIGURE 2

Forest plot for pooled sensitivity and specificity of Best‐Performing Models.

FIGURE 3.

FIGURE 3

SROC curve of Best‐Performing Models.

3.3. Heterogeneity Analysis

To examine the robustness of these pooled results, a leave‐one‐out sensitivity analysis was performed. The pooled sensitivity remained relatively stable, ranging between 0.80 and 0.86, regardless of which study was excluded. This indicates that no single study had an undue influence on the overall findings. However, heterogeneity assessments revealed that exclusion of the study by Sang et al. (2024) resulted in a substantial reduction in the Q statistic to 76.16, suggesting that this study contributed significantly to the observed heterogeneity. Conversely, removing the study by Wändell et al. (2025) slightly increased the between‐study variance, likely because the large sample size of this study helped to stabilise the overall meta‐analytic estimates.

In an effort to explore potential sources of heterogeneity, meta‐regression analyses were conducted using algorithm type and country of origin as moderators. The meta‐regression examining algorithm type did not yield statistically significant results (p = 0.080), although the model explained a large portion of the variance (R 2 = 0.96). Similarly, meta‐regression using country of study as a moderator showed no significant association with sensitivity (p = 0.074), with the model accounting for 87% of variance (R 2 = 0.87). These findings suggest that neither the type of machine learning algorithm nor the country in which the study was conducted sufficiently explained the heterogeneity observed among the included studies.

Publication bias was assessed using Egger's regression test applied to the sensitivity estimates. The test yielded a p‐value of 0.59, indicating no statistically significant evidence of publication bias (Figure 4). However, given the limited number of included studies, the power to detect bias was inherently low, and the possibility of unrecognised bias cannot be entirely excluded.

FIGURE 4.

FIGURE 4

Funnel Plot for best performing models.

3.4. Quality Assessment

The quality of the included studies was evaluated using the PROBAST‐AI tool, focusing on three key domains: development, evaluation, and application. In the development domain, 75% of studies were assessed as having a low risk of bias, while 25% showed a moderate risk. The evaluation domain showed a greater proportion of concern, with 56% of studies rated as having a moderate risk of bias and 44% rated as low risk. The application domain reflected strong methodological adherence, with 94% of studies rated as low risk and only 6% as moderate risk. Overall, 56% of the studies were considered at high risk of bias when all domains were combined, while 44% were judged to be at low overall risk (Figure 5). These findings highlight methodological variability among the included studies, particularly in the evaluation phase, underscoring the need for more rigorous validation practices in future research.

FIGURE 5.

FIGURE 5

Quality assessment based on PROBAST + AI tool.

3.5. GRADE Assessment

The certainty of evidence for the diagnostic performance of machine learning models assessed using the GRADE framework. The pooled sensitivity and specificity estimates were high; however, the overall certainty of evidence was rated as low due to several factors. The primary reason for downgrading was the high heterogeneity observed across studies, which raises concerns about the consistency and generalisability of the findings. Additionally, despite no significant publication bias detected by Egger's test, the small number of studies limits the confidence in this assessment, warranting a further downgrade for potential reporting bias. The indirectness of evidence was considered low since most studies directly addressed the clinical question. However, variability in study design, model types, and external validation approaches contributed to concerns about applicability. Imprecision was also noted, particularly for specificity and diagnostic odds ratio estimates, reflected by wide confidence intervals. Taken together, these factors led to an overall low certainty of evidence, suggesting that while the findings are promising, further high‐quality, standardised studies are needed to strengthen the confidence in the clinical applicability of ML models for heart failure detection.

4. Discussion

Analysis of the best‐performing ML models for heart failure detection showed a pooled sensitivity of 0.84 (95% CI: 0.75–0.90) and specificity of 0.86 (95% CI: 0.56–0.97), with an AUROC of 0.90 (95% CI: 0.87–0.93). These results suggest that the models can correctly identify 84% of patients with heart failure and accurately exclude 86% without it. The positive likelihood ratio of 6.6 (95% CI: 1.2–35.9) indicates a substantial increase in disease probability after a positive test, supporting their role in screening and diagnostic confirmation. Conversely, the negative likelihood ratio of 0.17 (95% CI: 0.08–0.36) means a negative result considerably lowers the likelihood of heart failure, aiding in its exclusion. The diagnostic odds ratio of 39 (95% CI: 4–423) underscores the strong overall discriminatory power of these ML models, highlighting their potential utility in clinical decision‐making for heart failure management.

The considerable heterogeneity observed among the included studies likely stems from differences in study populations, machine learning algorithms used, dataset sizes, and validation methods. Variability in patient populations—including age, comorbidities, disease prevalence, and diagnostic criteria—can influence both model performance and outcome measures, making direct comparison difficult. The choice of ML algorithms also contributes, as different models such as random forests, boosting methods, or support vector machines differ in learning approaches, risk of overfitting, and handling of data complexity, all of which affect diagnostic accuracy. Additionally, the size of datasets impacts both model training and evaluation, with smaller datasets increasing the risk of overfitting and producing less reliable results, while larger datasets generally enhance model stability and generalisability. Finally, the validation methods used across studies vary widely, ranging from internal validation to external testing on independent cohorts, directly influencing performance estimates. These methodological and clinical differences collectively contribute to the high heterogeneity observed and highlight the importance of standardised protocols, consistent validation strategies, and diverse patient inclusion in future ML research for heart failure detection.

Currently, despite the growing body of research on machine learning models for heart failure prediction, their translation into routine clinical practice remains limited. Most studies have focused on algorithm development and internal validation, with little emphasis on how these models could be embedded into everyday healthcare workflows. Segar et al. reported that a Random Survival Forest (RSF) model outperformed traditional methods, showing stronger predictive accuracy for heart failure with reduced ejection fraction (HFrEF) compared to heart failure with preserved ejection fraction (HFpEF), suggesting these models may have subtype‐specific utility [6]. In contrast, Kanda et al. found that the Gradient Boosting model (XGBoost) performed well on Japanese claims data, with AUROC values between 0.690 and 0.898, driven by variables like age and hospitalisation frequency, and maintained consistent performance in external validation [8]. Abegaz et al. demonstrated high sensitivity (0.94) for heart failure prediction using XGBoost on the All of Us dataset, particularly among type 2 diabetes patients on SGLT2 inhibitors, with key predictors including HbA1c and troponin [14]. Other studies reflected similar findings. Gandin et al. showed that a Physics‐Informed Neural Network (PHNN) performed well, achieving time‐dependent AUCs up to 0.780, especially when incorporating cardiac parameters and diuretic use [15]. Wang et al. reported an AUROC of 0.834 for a Random Forest model applied to NHANES data, with poverty‐income ratio and myocardial infarction as influential predictors [16]. Sang et al. validated a Random Forest model with an AUROC of 0.722, emphasising creatinine and HbA1c as significant features [20]. Kwiendacz et al. identified Light Gradient Boosting Machine (LGBM) as the top model (AUROC 0.740, sensitivity 0.62), particularly in chronic kidney disease patients with diabetes, driven by variables like eGFR and the TyG index [22]. Bai et al. reported high accuracy (0.91) and AUROC (0.96) for XGBoost in elderly diabetic patients, with the prognostic nutritional index (PNI) and fatty liver index (FLI) emerging as key factors [5].

4.1. Comparison With Traditional Risk Prediction Approaches

In contrast to the adaptability and predictive strength of machine learning models, traditional approaches rely on static biomarkers, fixed risk scores, or linear regression methods, making them less capable of capturing the complex and dynamic nature of heart failure risk. For example, applied clinical risk scores (such as WATCH‐DM) and biomarkers like NT‐proBNP and hs‐cTn in a cohort of over 6000 diabetic patients demonstrate that a two‐step screening strategy could reduce the number needed to screen and lower costs. However, this method relies on fixed thresholds and predefined variables, lacking the adaptability of machine learning models. In contrast, our reviewed ML models achieved stronger diagnostic shifts, as reflected by higher positive and negative likelihood ratios [24]. Echouffo‐Tcheugui et al. examined GDF‐15 over a 23‐year period and found a higher risk of heart failure in patients with elevated levels. Although the study showed robust longitudinal associations, its reliance on Cox regression limited its ability to account for complex interactions, such as sex‐specific effects, which machine learning models are better equipped to handle [25]. Berezin et al. assessed serum irisin for distinguishing heart failure phenotypes in diabetic patients, finding that while certain biomarkers improved phenotype‐specific predictions, the static nature of ROC‐based cutoffs limited broader applicability [26]. Said et al. developed a multivariable model in the ALTITUDE cohort that identified eight key predictors and achieved good discrimination (0.78 for sensitivity and 0.72 for specificity). However, even with added biomarkers, machine learning models have the capacity to incorporate a larger number of variables and can learn complex, non‐linear interactions among them. Furthermore, ML models can be retrained or fine‐tuned on different patient populations, allowing them to adapt to varying clinical contexts and demographic profiles [27]. These emphasise that while traditional approaches have contributed significantly to risk stratification, they often fall short in dynamic prediction and adaptability.

Together, these findings suggest that Random Forest and XGBoost are consistently strong performers, yet their success depends on the target population and chosen predictors. This variability reinforces the earlier point that, despite promising results, most models remain within academic or experimental settings. The inconsistency in performance and model sensitivity across studies highlights the need for context‐specific optimisation and better‐defined implementation pathways before these tools can be seamlessly adopted in clinical practice. Integration into electronic health records (EHRs) or clinical triage systems is seldom addressed in the literature, and few examples exist where ML tools have been successfully deployed at the point of care. As a result, the clinical utility of these models often stops at proof‐of‐concept or retrospective validation stages. Moreover, issues such as interoperability with EHR systems, seamless integration into clinical decision‐making processes, and usability by healthcare providers are rarely explored.

5. Study Limitations

This study has several limitations that should be considered when interpreting the findings. First, the number of included studies was relatively small, which may have limited the statistical power of subgroup analyses and meta‐regression, as well as the ability to detect publication bias. Second, substantial heterogeneity was present across studies, driven by differences in patient populations, machine learning algorithms, dataset sizes, and validation strategies, which may affect the consistency and generalisability of the pooled estimates. Third, we relied on the best‐performing model reported by each study without access to individual patient data, preventing a more detailed analysis of model development processes, calibration, and performance across different subgroups. Finally, most studies lacked external validation and used retrospective data, raising concerns about potential overfitting and real‐world applicability. These limitations highlight the need for cautious interpretation of the results and emphasise the importance of high‐quality, prospective studies with standardised methodologies in future research.

6. Conclusion

This meta‐analysis demonstrates that machine learning models achieve strong diagnostic performance for heart failure detection, with a pooled sensitivity of 84%, specificity of 86%, and an AUROC of 0.90 among the best‐performing models identified in each study. These findings underscore the potential of ML models to support early diagnosis and risk stratification, offering valuable assistance in clinical decision‐making. However, the presence of significant heterogeneity—largely due to differences in study populations, algorithms, dataset sizes, and validation methods—highlights the need for careful interpretation of pooled estimates and limits the immediate generalisability of these results. While the performance of ML models appears promising, especially in comparison with traditional risk prediction tools, their clinical impact remains dependent on addressing methodological and practical barriers.

Future research should aim to standardise machine learning development pipelines, including transparent reporting of model architecture, training processes, and evaluation criteria. Prospective validation studies across diverse patient populations and healthcare settings are essential to ensure external validity and to establish real‐world effectiveness. Efforts should also focus on improving model interpretability, enabling clinicians to understand and trust algorithmic decisions, which is critical for acceptance in routine care. Furthermore, the integration of ML models into clinical workflows must consider factors such as data quality, interoperability with electronic health records, and cost‐effectiveness analyses. Addressing these areas will be key to translating the promising diagnostic accuracy of ML models into meaningful improvements in patient outcomes and healthcare delivery.

Author Contributions

Pooya Eini was a main contributor in the design, implementation, and writing of the manuscript. Mohammad Rezayee, Homa Serpoush, and Peyman Einiin dependently assessed articles and extracted data. All authors read and approved the final manuscript. Mohammad Rezayee performed statistical analysis.

Ethics Statement

The authors have nothing to report.

Consent

The authors have nothing to report.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Table S1: Search syntax for different databases.

EDM2-8-e70111-s001.docx (14.2KB, docx)

Eini P., Eini P., Serpoush H., and Rezayee M., “Diagnostic Performance of Machine Learning Algorithms for Predicting Heart Failure in Diabetic Patients: A Systematic Review and Meta‐Analysis,” Endocrinology, Diabetes & Metabolism 8, no. 5 (2025): e70111, 10.1002/edm2.70111.

Funding: The authors received no specific funding for this work.

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

References

  • 1. Pandey A., Khan M. S., Patel K. V., Bhatt D. L., and Verma S., “Predicting and Preventing Heart Failure in Type 2 Diabetes,” Lancet Diabetes and Endocrinology 11, no. 8 (2023): 607–624. [DOI] [PubMed] [Google Scholar]
  • 2. Rawshani A., Rawshani A., Sattar N., et al., “Relative Prognostic Importance and Optimal Levels of Risk Factors for Mortality and Cardiovascular Outcomes in Type 1 Diabetes Mellitus,” Circulation 139, no. 16 (2019): 1900–1912. [DOI] [PubMed] [Google Scholar]
  • 3. Razaghizad A., Oulousian E., Randhawa V. K., et al., “Clinical Prediction Models for Heart Failure Hospitalization in Type 2 Diabetes: A Systematic Review and Meta‐Analysis,” Journal of the American Heart Association 11, no. 10 (2022): e024833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Fitchett D., Inzucchi S. E., Cannon C. P., et al., “Empagliflozin Reduced Mortality and Hospitalization for Heart Failure Across the Spectrum of Cardiovascular Risk in the EMPA‐REG OUTCOME Trial,” Circulation 139, no. 11 (2019): 1384–1395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Bai Q., Chen H., Gao Z., et al., “Advanced Prediction of Heart Failure Risk in Elderly Diabetic and Hypertensive Patients Using Nine Machine Learning Models and Novel Composite Indices: Insights From NHANES 2003‐2016,” European Journal of Preventive Cardiology (2025), 10.1093/eurjpc/zwaf081. [DOI] [PubMed] [Google Scholar]
  • 6. Segar M. W., Vaduganathan M., Patel K. V., et al., “Machine Learning to Predict the Risk of Incident Heart Failure Hospitalization Among Patients With Diabetes: The WATCH‐DM Risk Score,” Diabetes Care 42, no. 12 (2019): 2298–2306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Sajjadi S. M., Mohebbi A., Ehsani A., et al., “Identifying Abdominal Aortic Aneurysm Size and Presence Using Natural Language Processing of Radiology Reports: A Systematic Review and Meta‐Analysis,” Abdominal Radiology 50 (2025): 3885–3899. [DOI] [PubMed] [Google Scholar]
  • 8. Kanda E., Suzuki A., Makino M., et al., “Machine Learning Models for Prediction of HF and CKD Development in Early‐Stage Type 2 Diabetes Patients,” Scientific Reports 12, no. 1 (2022): 20012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Page M. J., McKenzie J. E., Bossuyt P. M., et al., “The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews,” BMJ (Clinical Research Ed.) 372 (2021): n71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Moons K. G. M., Damen J. A. A., Kaul T., et al., “PROBAST+AI: An Updated Quality, Risk of Bias, and Applicability Assessment Tool for Prediction Models Using Regression or Artificial Intelligence Methods,” BMJ (Clinical Research Ed.) 388 (2025): e082505. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Angraal S., Mortazavi Bobak J., Gupta A., et al., “Machine Learning Prediction of Mortality and Hospitalization in Heart Failure With Preserved Ejection Fraction,” JACC Heart Failure 8, no. 1 (2020): 12–21. [DOI] [PubMed] [Google Scholar]
  • 12. Lee S., Zhou J., Wong W. T., et al., “Glycemic and Lipid Variability for Predicting Complications and Mortality in Diabetes Mellitus Using Machine Learning,” BMC Endocrine Disorders 21, no. 1 (2021): 94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Longato E., Fadini G. P., Sparacino G., Avogaro A., Tramontan L., and Di Camillo B., “A Deep Learning Approach to Predict Diabetes' Cardiovascular Complications From Administrative Claims,” IEEE Journal of Biomedical and Health Informatics 25, no. 9 (2021): 3608–3617. [DOI] [PubMed] [Google Scholar]
  • 14. Abegaz T. M., Baljoon A., Kilanko O., Sherbeny F., and Askal A. A., “Machine Learning Algorithms to Predict Major Adverse Cardiovascular Events in Patients With Diabetes,” Computers in Biology and Medicine 164 (2023): 164. [DOI] [PubMed] [Google Scholar]
  • 15. Gandin I., Saccani S., Coser A., et al., “Deep‐Learning‐Based Prognostic Modeling for Incident Heart Failure in Patients With Diabetes Using Electronic Health Records: A Retrospective Cohort Study,” PLoS One 18, no. 2 (2023): e0281878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Wang Y., Hou R., Ni B., Jiang Y., and Zhang Y., “Development and Validation of a Prediction Model Based on Machine Learning Algorithms for Predicting the Risk of Heart Failure in Middle‐Aged and Older US People With Prediabetes or Diabetes,” Clinical Cardiology 46, no. 10 (2023): 1234–1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Alimova D., Ikramov A., Trigulova R., Abdullaeva S., and Mukhtarova S., Prediction of Diastolic Dysfunction in Patients With Cardiovascular Diseases and Type 2 Diabetes with Respect to COVID‐19 in Anamnesis Using Artificial Intelligence (Proceedings of the 2023 7th International Conference on Medical and Health Informatics; Kyoto, Japan: Association for Computing Machinery, 2023), 61–65. [Google Scholar]
  • 18. Mora T., Roche D., and Rodríguez‐Sánchez B., “Predicting the Onset of Diabetes‐Related Complications After a Diabetes Diagnosis With Machine Learning Algorithms,” Diabetes Research and Clinical Practice 204 (2023): 110910. [DOI] [PubMed] [Google Scholar]
  • 19. Nabrdalik K., Kwiendacz H., Irlik K., et al., “Machine Learning Identification of Risk Factors for Heart Failure in Patients With Diabetes Mellitus With Metabolic Dysfunction Associated Steatotic Liver Disease (MASLD): The Silesia Diabetes‐Heart Project,” Cardiovascular Diabetology 22, no. 1 (2023): 318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Sang H., Lee H., Lee M., et al., “Prediction Model for Cardiovascular Disease in Patients With Diabetes Using Machine Learning Derived and Validated in Two Independent Korean Cohorts,” Scientific Reports 14, no. 1 (2024): 14966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Soh C. H., de Sá A. G. C., Potter E., Halabi A., Ascher D. B., and Marwick T. H., “Use of the Energy Waveform Electrocardiogram to Detect Subclinical Left Ventricular Dysfunction in Patients With Type 2 Diabetes Mellitus,” Cardiovascular Diabetology 23, no. 1 (2024): 91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Kwiendacz H., Huang B., Chen Y., et al., “Predicting Major Adverse Cardiac Events in Diabetes and Chronic Kidney Disease: A Machine Learning Study From the Silesia Diabetes‐Heart Project,” Cardiovascular Diabetology 24, no. 1 (2025): 76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Wändell P., Carlsson A. C., Eriksson J., Wachtler C., and Ruge T., “A Machine Learning Tool for Identifying Newly Diagnosed Heart Failure in Individuals With Known Diabetes in Primary Care,” ESC Heart Failure 12, no. 1 (2025): 613–621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Patel K. V., Segar M. W., Klonoff D. C., et al., “Optimal Screening for Predicting and Preventing the Risk of Heart Failure Among Adults With Diabetes Without Atherosclerotic Cardiovascular Disease: A Pooled Cohort Analysis,” Circulation 149, no. 4 (2024): 293–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Echouffo‐Tcheugui J. B., Daya N., Ndumele C. E., et al., “Diabetes, GDF‐15 and Incident Heart Failure: The Atherosclerosis Risk in Communities Study,” Diabetologia 65, no. 6 (2022): 955–963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Berezin A. A., Lichtenauer M., Boxhammer E., Stöhr E., and Berezin A. E., “Discriminative Value of Serum Irisin in Prediction of Heart Failure With Different Phenotypes Among Patients With Type 2 Diabetes Mellitus,” Cells 11, no. 18 (2022): 2794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Said F., Arnott C., Voors A. A., Heerspink H. J. L., and ter Maaten J. M., “Prediction of New‐Onset Heart Failure in Patients With Type 2 Diabetes Derived From ALTITUDE and CANVAS,” Diabetes, Obesity & Metabolism 26, no. 7 (2024): 2741–2751. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Table S1: Search syntax for different databases.

EDM2-8-e70111-s001.docx (14.2KB, docx)

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.


Articles from Endocrinology, Diabetes & Metabolism are provided here courtesy of Wiley

RESOURCES