Abstract
Background
Pulmonary hypertension (PH) is characterized by elevated pulmonary pressures and right ventricular strain. Pulmonary arterial hypertension (PAH), a subtype, has a poor prognosis, especially when diagnosis is delayed. Artificial intelligence (AI) methods, including machine learning (ML) and deep learning (DL), offer potential for non-invasive prediction and risk stratification.
Objective
This systematic review assesses ML and DL applications for non-invasive diagnosis, classification, and prognostication in PH and PAH, with emphasis on methodological quality and clinical applicability.
Methods
A PRISMA-guided search identified studies using ML or DL on non-invasive clinical, imaging, or biomarker data, including omics and laboratory parameters. Study characteristics and heterogeneity were synthesized using the SWiM framework. Risk of bias was assessed using PROBAST+AI across participant selection, predictors, outcomes, and analysis.
Results
Fifty-three studies were included. Most used clinical, echocardiographic, imaging, or molecular data. AUC values ranged from 0.71 to 1.00. DL approaches, especially convolutional neural networks, were increasingly applied but seldom externally validated. Nine studies were multicenter, four prospective, one combined retrospective and prospective cohorts, none were randomized controlled trials. The rest were retrospective single-center studies. In 15 studies, right heart catheterization was either not performed or not clearly reported. SWiM analysis showed substantial heterogeneity in study design and outcome definitions. According to PROBAST +AI, 44 studies (83%) had low risk of bias, though applicability concerns were common.
Conclusion
ML and DL models show promise for PH and PAH diagnosis and prognosis, but limitations in subclass differentiation, methodological transparency, and validation must be addressed in future research.
Supplementary Information
The online version contains supplementary material available at 10.1186/s40001-025-03557-5.
Keywords: Pulmonary hypertension, Pulmonary arterial hypertension, Artificial intelligence, Machine learning, Deep learning, Diagnostic and prognostic prediction models
Introduction
Pulmonary hypertension (PH) is a progressive, life-limiting condition defined by a mean pulmonary arterial pressure (mPAP) above 20 mmHg at rest, as confirmed by right heart catheterization (RHC) [1]. PH encompasses a spectrum of entities with distinct etiologies, pathophysiologies, and therapeutic implications. The international classification system endorsed by the World Symposium on PH and reaffirmed in the 2022 ESC/ERS guidelines subdivides PH into five groups, including pulmonary arterial hypertension (PAH), PH due to left heart or lung disease, chronic thromboembolic PH (CTEPH), and multifactorial forms [1]. Among these, PAH is a rare but severe vascular disorder characterized by progressive remodeling of the pulmonary arteries and increased pulmonary vascular resistance, often leading to right heart failure and systemic complications [1, 2].
Despite advances in targeted pharmacologic therapy and structured follow-up, survival remains limited, and many patients present at an advanced stage due to significant diagnostic delays [3–6]. These delays, largely attributable to non-specific symptoms such as dyspnea, fatigue, and reduced exercise tolerance [1, 5] average approximately 2.5 years and are associated with increased mortality and higher healthcare utilization [6].
Echocardiography remains the first-line screening tool, though its interpretation is prone to interobserver variability [1, 7], whereas RHC provides definitive diagnosis but is invasive and less feasible for large-scale screening [1, 8].
In this context, artificial intelligence (AI) has emerged as a promising tool to enhance early detection, improve risk stratification, and support clinical decision-making in PH [9–11]. Machine learning (ML) and deep learning (DL) algorithms can identify complex, nonlinear relationships within high-dimensional datasets, enabling earlier and more accurate recognition of disease patterns [9, 12, 13]. Recent studies have applied AI to detect PH using chest radiographs, electrocardiograms, and multimodal combinations of clinical and diagnostic data, with performance in some cases comparable to or exceeding that of physicians [14–17]. ML models trained on real-world electronic health records (EHRs) have shown potential to identify at-risk patients before clinical diagnosis, possibly helping to reduce diagnostic delay [18].
Beyond detection, AI has been used to predict disease severity, treatment response, and clinical outcomes based on diverse data inputs, including clinical parameters, imaging results, laboratory values, and omics data, often relying on multimodal architectures that approximate complex clinical reasoning [12, 16, 19].
However, existing studies vary considerably in methodological quality and clinical applicability [9, 12], and it remains uncertain whether models trained on mixed PH populations can generalize across subtypes [12, 20].This systematic review synthesizes the current evidence on ML and DL applications in PH, focusing on modeled PH subtypes, data modalities, algorithmic techniques, reported outcomes, and methodological rigor, including validation strategies and overfitting control. Given that nuanced diagnostic distinctions between PH subtypes carry significant therapeutic implications, precise attribution is essential in AI research on PH to ensure clinically meaningful translation [1]. This review aims to provide an evidence-based overview and to guide future research at the intersection of PH phenotyping and advanced AI methodologies.
Methods
This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines [21] and was prospectively registered in the International Prospective Register of Systematic Reviews (PROSPERO; registration number: CRD420251074202) [22]. The review focused on ML and DL applications in PH, emphasizing diagnostic, phenotypic, and prognostic use cases. Methodological aspects such as data types, algorithmic approaches, validation strategies, and subtype attribution were systematically assessed.
A comprehensive literature search was carried out using two databases: MEDLINE via PubMed and Google Scholar. Google Scholar was searched to identify additional relevant studies. The complete search strategy, including specific Medical Subject Headings (MeSH) terms, free-text keywords, Boolean operators, and field specifications, is provided in Supplementary Table S1. The strategy was designed to identify peer-reviewed original research articles applying AI methods in PH. MeSH terms and free-text keywords such as “pulmonary hypertension”, “pulmonary arterial hypertension”, “machine learning”, “deep learning”, “artificial intelligence”, “diagnosis”, “phenotyping”, “prognosis”, “prediction”, “non-invasive”, “risk stratification”, “survival”, “mortality”, “electrocardiography”, “echocardiography”, “chest X-ray”, “computed tomography”, “magnetic resonance imaging”, and “electronic health records” were used. Boolean operators (“AND”, “OR”) were applied to combine related terms, and searches were conducted within titles, abstracts, and MeSH terms. The same search strategy was applied across both databases, with minor syntax adjustments for Google Scholar, which also indexes full-text information. The search was restricted to English-language publications published since 2016 to reflect the emergence of modern ML and DL approaches in PH research. Only studies involving human subjects were included. In addition to database searches, reference lists of the electronically identified articles were screened manually, yielding six additional studies. The strategy was further refined through iterative testing to ensure retrieval of all known eligible studies. Refinement involved adjusting keyword groupings and Boolean logic to verify that previously known relevant publications were consistently retrieved by the final search strategy.
Studies were included if they met the predefined eligibility criteria. Specifically, studies were eligible if they (i) applied ML or DL techniques to predict or classify PH; (ii) used non-invasive data as input features, such as clinical parameters, laboratory results, electrocardiographic measurements, echocardiography, chest imaging, or other routinely collected non-invasive modalities; (iii) reported quantitative predictive performance metrics for diagnostic, classification, or prognostic tasks; and (iv) provided sufficient methodological detail to allow an assessment of model development, validation, and reproducibility.
Studies were excluded if they (i) were reviews, editorials, conference abstracts, or case reports; (ii) exclusively relied on invasive input features, such as hemodynamic parameters from RHC; or (iii) did not report relevant outcome metrics for model performance.
Two reviewers independently screened titles and abstracts for eligibility. Full-text articles of potentially eligible studies were then reviewed in detail. Disagreements were resolved by discussion and, if necessary, by consulting a third reviewer. The same two reviewers independently extracted data from each included study using a pre-defined data extraction form. Extracted information comprised authorship, year of publication, study location, population characteristics, PH subgroup investigated, data type and source, model type and structure, clinical objective (diagnostic classification, phenotypic differentiation, or prognostic prediction), model performance metrics including area under the receiver operating characteristic curve (AUC), and methods used for validation.
To systematically assess the risk of bias, methodological quality, and applicability of model development and evaluation, the updated Prediction Model Risk of Bias Assessment Tool for Artificial Intelligence (PROBAST+AI) was applied to all included studies [23]. This tool is applicable to AI-based prediction models and evaluates the risk of bias across four domains: participants, predictors, outcome, and analysis. It includes additional signaling questions tailored to ML workflows, addressing aspects such as model calibration, resampling methods, data leakage, and model explainability. To address domain-specific concerns in PH, the assessment was extended to evaluate whether studies applied guideline-based hemodynamic criteria for cohort labeling and adhered to consistent subtype classification according to current recommendations [1]. This approach ensured a robust appraisal of AI model quality in relation to the underlying ground truth and the respective PH populations.
All steps of study selection, data extraction, and quality appraisal were conducted in accordance with established best practices for systematic reviews of prediction model studies. Due to substantial methodological heterogeneity in model types, input features, and clinical endpoints, no meta-analysis was performed. Instead, findings were synthesized systematically. Studies were grouped according to clinical objective, algorithmic approach, outcome parameters, performance metrics, validation strategy, and PH subtype, wherever classification was possible based on the reported data and in accordance with the current ESC/ERS guidelines and Nice classification [1]. This synthesis approach followed the Synthesis Without Meta-analysis (SWiM) guideline to ensure transparent and structured reporting in the absence of a meta-analysis [24]. Study characteristics were extracted and systematically tabulated, incorporating relevant clinical and modeling features.
Results
Study selection and characteristics
A total of 53 studies met the predefined eligibility criteria and were included in this systematic review. The study selection process is illustrated in the PRISMA flow chart (Fig. 1). The corresponding PRISMA 2020 Checklist detailing adherence to reporting standards is provided in Supplementary Table S2.
Fig. 1.
PRISMA flowchart for study selection. This flowchart illustrates the study selection process in accordance with the PRISMA 2020 guidelines [21]. A total of 472 records were identified through database searches (MEDLINE via PubMed and Google Scholar), and an additional six studies were identified through manual reference searching of the previously selected studies. After removing 142 duplicates, 336 records remained for title and abstract screening. Of these, 263 were excluded based on predefined eligibility criteria. Seventy-two full-text articles were assessed for eligibility, of which 19 articles were excluded. Ultimately, 53 studies were included in the final qualitative synthesis. Reasons for exclusion at each stage are detailed in the flowchart
Results were synthesized narratively and organized by clinical objective, algorithmic approach and outcome parameters in line with SWiM guidance. Study-level characteristics such as study design, sample size, PH classification, input data types, clinical objectives, validation strategies, model types, and performance metrics are systematically presented in Table 1. The table also summarizes outcome definitions, use of prognostic modeling and key strengths and limitations for each study. Risk of bias and applicability concerns were assessed using PROBAST-AI and are detailed in Supplementary Table S3, structured by predefined domains.
Table 1.
Study characteristics and performance metrics
| Author (year) | PH group | Study group | Study design | Sample size | Key findings | Outcome measures | Prognostic model used | Model type | Diagnosis method | Validation method | Strengths and limitations |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Alabed et al. (2022) [25] | PAH | MPCA-based ML for Mortality Prediction in PAH | Retro-spective cohort | 723 patients | MPCA-based features from CMR significantly improved 1-year mortality prediction (c-index: 0.83 vs. 0.71 with REVEAL) | c-index, ROC-AUC, Kaplan–Meier survival | MPCA, CMR features, REVEAL score | ML | RHC | Tenfold Cross-validation, Internal validation | Strengths: Transparent, clinically interpretable; Limitations: Retrospective, single center, no external validation |
| Anand et al. (2024) [26] | PH | ML for PH Diagnosis Using Echo | Retro-spective cohort | 7853 patients | XGBoost model for PH detection achieved high AUC (0.83) and sensitivity (88%) with specificity (54%) | AUC, Accuracy, Sensitivity, Specificity, PPV, NPV | XGBoost | ML | RHC | Fivefold Cross-validation, Internal validation | Strengths: Large cohort, no need for TR jet velocity; Limitations: Retrospective, High PH prevalence in cohort, model performance drop in testing data |
| Aras et al. (2023) [27] | PAH, PH | DL ECG Detection of PH | Retro-spective cohort | 24,470 patients | CNN model achieved high AUC for detecting PH (AUC: 0.89), sensitivity (0.79), and specificity (0.84). For pre-capillary PH, the model performed excellently (AUC: 0.91). For PAH, AUC was 0.88 | AUC, Sensitivity, Specificity, PPV, NPV, F1-Score | CNN | DL | RHC or Echo, RHC (subgroup) | Internal validation | Strengths: Large cohort, early detection capability (up to 2 years before diagnosis), potential for widespread clinical use with remote monitoring; Limitations: Retrospective design, misclassification potential for some PH subtypes due to broad inclusion criterion (TR-Velocity > 3.4 m/s), dependent on quality of ECG data |
| Argiento et al. (2024) [28] | PAH | ML for PAH Prediction | Retro-spective cohort | 226 patients | Developed an ML algorithm for identifying PAH from anamnesis and non-invasive data. AUC of 83%, accuracy of 74% | AUC, Sensitivity, Specificity | Elastic-Net Regularized Generalized Linear Model | ML | RHC | Threefold Cross-validation, Internal validation | Strengths: Focus on high-risk populations, robust ML model. Limitations: Retrospective, Single center, unbalanced sample, smaller dataset |
| Bauer et al. (2021) [29] | PAH | ML for PAH Prediction Using Proteomics Data | Retro-spective cohort | 157 | ML models using proteomics data showed significant potential in predicting PAH with high AUC, sensitivity, and specificity, outperforming traditional biomarkers | AUC, Sensitivity, Specificity | Random Forest | ML | RHC | Tenfold Cross-validation, Internal validation | Strengths: Use of proteomics for early detection, potential for personalized treatment; Limitations: Retrospective, sample size, and lack of external validation |
| Bordag et al. (2023) [30] | PAH, PH (left heart disease), PH (lung disease), CTEPH | ML for PH Prediction Using Lipidomics | Retro-spective cohort | 233 patients | ML models using lipidomics identified diagnostic and prognostic biomarkers, with predictive potential (AUC 0.82–0.90) for PH | AUC, Sensitivity, Specificity | Random Forest, XGBoost | ML | RHC | Sevenfold cross-validation, External Validation | Strengths: Novel lipidomics approach, high diagnostic accuracy; Limitations: Small sample size, single center, potential bias in prognostic scores with mixed PH groups |
| Chettrit et al. (2019) [31] | COPD-related PH | DL System for PH Risk Stratification using chest CT | Retro-spective cohort | 1285 chest CT studies | The DL model automated the measurement of pulmonary artery (PA) and aorta (Ao) diameters to assess the PA-to-Ao ratio, showing significant potential for PH risk stratification with high Pearson correlation (93% for Ao, 92% for PA) | Pearson correlation, Sensitivity, Specificity, PPV | CNN | DL | Diagnosis based on clinical criteria (RHC, Echo, and other diagnostic tests) | Cross-validation, Internal validation | Strengths: Fully automated, accurate measurements, high specificity for screening; Limitations: Retrospective design, reliance on contrast-enhanced CT scans, potential bias with mixed PH groups |
| Dawes et al. (2017) [66] | PH | ML of 3D Right Ventricular Motion for survival prediction in PH | Prospective cohort | 256 patients | Survival prediction improved with 3D right ventricular motion data. Model provided better prediction than conventional clinical measures (AUC: 0.73 vs. 0.60, P < 0.001) | AUC, Sensitivity, Survival Time | Principal Component Analysis, Supervised Learning | ML | RHC | Eightfold Cross-validation | Strengths: Incorporates 3D motion, better survival prediction, prospective design; Limitations: Retrospective data analysis |
| Diller et al. (2022) [32] | PAH | DL Framework for Detection and Prognostication of PAH | Retro-spective cohort | 450 patients | DL model achieved 97.6% accuracy and 100% sensitivity in detecting PAH. It also provided prognostic insights with non-inferior performance compared to expert echocardiography | Sensitivity, Specificity, AUC, Cox-proportional hazard models | CNN-based segmentation and feature extraction; prognostic modelling via multivariable Cox regression | DL | RHC | Internal validation | Strengths: High accuracy, expert-level prediction, provides prognostic data; Limitations: Limited to expert center data, small number of normal controls, retrospective design |
| DuBrock et al. (2024) [33] | PAH | DL algorithm for early detection of PH based on 12-lead ECG | Retro-spective cohort | 39,823 PH-likely patients, 219,404 controls | DL model achieved high accuracy for detecting PH, with an AUC of 0.92 at Mayo Clinic, and 0.88 at VUMC. The model was capable of predicting PH up to 5 years prior to diagnosis | AUC, Sensitivity, Specificity, PPV, NPV | CNN | DL | RHC or Echo | Internal validation, External validation (VUMC) | Strengths: High performance, early detection capability; Limitations: Retrospective design, reliance on RHC and TRV measurements, potential bias due to using echo and RHC for cohort definition |
| Duo et al. (2022) [34] | PAH | Gene expression-based diagnostic signature for PAH | Retro-spective cohort | 73 PAH samples, 36 normal samples | A diagnostic signature (PDS) for PAH was constructed from key genes identified via WGCNA and LASSO. ROC analysis showed AUCs of 0.948 and 0.945 in two independent cohorts | Sensitivity, Specificity, AUC | LASSO | ML | RHC | External validation (GSE113439) | Strengths: High accuracy, identification of key biomarkers and immune landscape; Limitations: Limited sample size, experimental validation needed |
| Dwivedi et al. (2024) [35] | PAH and PH-LD | AI model for lung fibrosis quantification and survival prediction | Retro-spective cohort | 521 patients | AI-quantified lung fibrosis on CT pulmonary angiograms was associated with increased mortality risk (C-index: 0.76). Combining AI with radiologic scoring improved survival prediction | C-index, Mortality | AI-based DL Model | DL | RHC | External validation | Strengths: AI model for accurate fibrosis quantification; Limitations: Retrospective study, reliance on external validation, potential bias from image acquisition variability |
| Errington et al. (2023) [36] | PAH | miRNA expression-based ML model for PAH diagnosis | Retro-spective cohort | 107 patients | ML model based on miRNA expression showed high diagnostic accuracy (AUC: 0.85) for PAH and provided potential biomarkers for prognosis | AUC, Sensitivity, Specificity, PPV, NPV | SVM, Random Forest, LASSO, XGBoost, Ensemble, Rpart | ML | RHC | Tenfold Cross-validation, External validation | Strengths: High diagnostic accuracy, identification of miRNA biomarkers for PAH; Limitations: Limited external validation, retrospective design |
| Fortmeier et al. (2022) [37] | PH | XGBoost model for mPAP prediction | Retro-spective cohort | 116 patients | XGBoost model based on echocardiographic parameters was able to predict mPAP and associated with 2-year all-cause mortality (HR 2.4) | Pearson correlation, survival | XGBoost | ML | RHC | Internal and external validation | Strengths: Cohort with both RHC and echo data; Limitations: Small sample size, retrospective design |
| Gawlitza et al. (2024) [38] | CTEPH | ML-based feature identification for hemodynamic endpoint prediction using CT | Retro-spective cohort | 127 patients | The random forest model achieved AUC of 0.82 for mPAP prediction and 0.74 for PA SaO2 prediction, using quantitative and qualitative CT features | AUC, Sensitivity, Specificity, PPV, NPV | Random Forest | ML | RHC | Cross-validation, internal validation | Strengths: Non-invasive risk stratification using CT features; Limitations: Small cohort, Retrospective design |
| Imai et al. (2024) [14] | PAH | DL algorithm for PAH detection using CXR | Retro-spective cohort | 145 PAH patients, 260 controls | The DL model (ResNet50) achieved AUC of 0.988 for PAH detection using CXR images, outperforming experienced doctors (AUC 0.945) | AUC, Sensitivity, Specificity | ResNet50 | DL | RHC | Fourfold cross-validation, Internal validation | Strengths: High diagnostic accuracy, non-invasive and cost-effective; Limitations: Small sample size, single center study, potential image quality variability, retrospective design |
| Kanwar et al. (2020) [41] | PAH | Bayesian network (PHORA) for PAH risk stratification | Retro-spective cohort | 3515 patients from the REVEAL registry | The PHORA Bayesian network model achieved AUC of 0.80 for 1-year survival, outperforming the REVEAL 2.0 model (AUC 0.76). It was validated externally in two registries with an AUC of 0.74 and 0.80 | AUC, Sensitivity, Specificity, NPV, PPV | Bayesian network (Tree-augmented Naïve Bayes—TAN) | ML | RHC | Internal validation (REVEAL registry), External validation (COMPERA, PHSANZ) | Strengths: Improved discriminatory ability, can handle missing data, validated in multiple cohorts; Limitations: Survival bias, missing data in registries |
| Kheyfets et al. (2023) [68] | PAH | Random forest model for PAH survival prediction using clinical and biomarker data | Prospective cohort | 167 PAH patients | The random forest model predicted 4-year survival risk with AUC 0.94 (internal validation) and AUC 0.81 (external validation). It identified novel biomarkers such as IL-2, IL-9, and 6MWD as significant predictors of risk | AUC, Sensitivity, Specificity | Random Forest | ML | RHC | Internal validation (Stanford cohort), External validation (Sheffield cohort) | Strengths: Novel approach combining clinical and biomarker data for personalized PAH prognostication, prospective design; Limitations: Relatively small cohort, biomarker data from a single center |
| Kiely et al. (2019) [42] | iPAH | Predictive model based on HCRU to identify patients at risk for iPAH | Retro-spective cohort | 709 iPAH patients and 2,812,458 non-iPAH patients | The Gradient Boosting Trees model achieved 99.99% specificity and 14.10% sensitivity, identifying 100 iPAH cases among 969 flagged patients | Sensitivity, Specificity, PPV, NPV | Gradient Boosting Trees (XGBoost) | ML | RHC | Fivefold cross-validation, internal validation | Strengths: Cost-effective, real-world data-based model for rare disease screening; Limitations: Low sensitivity, narrow scope of healthcare data used |
| Kogan et al. (2023) [18] | PAH, CTEPH, and other PH types | XGBoost model for early PH detection using EHR data | Retro-spective cohort | 115,822 patients, 11,279,478 controls | The XGBoost model achieved AUC 0.92 for PH prediction. The model also predicted PH subgroups (PAH: 0.79–0.90 AUC, CTEPH: 0.87–0.96 AUC) | AUC, Sensitivity, PPV | XGBoost | ML | Echo or RHC | Threefold cross-validation, internal validation | Strengths: Large cohort, uses real-world EHR data; Limitations: PH diagnosis not uniformly confirmed by RHC, potential bias from coding algorithms, retrospective design |
| Kusunose et al. (2020) [45] | PH | DL model for PH detection using CXR | Retro-spective cohort | 900 patients | CNN achieved AUC of 0.71 for PH detection using CXR images, improving significantly compared to human observers | AUC, NPV | CNN | DL | RHC | Tenfold cross-validation, internal and external validation | Strengths: AI-driven approach with CXR, non-invasive screening; Limitations: Moderate accuracy, single-center |
| Leha et al. (2019) [47] | PAH, PH due to left heart disease, PH due to lung disease and hypoxia, PH with unclear and multi-factorial | ML for PH prediction using echo | Retro-spective cohort | 90 patients (68 with confirmed PH, 22 without PH) | AUC for SVM: 0.83, Random Forest Regression: 0.87 for predicting PH from echo | AUC, Sensitivity, Specificity, PPV, NPV | SVM, Random Forest, Lasso, boosted classification trees | ML | RHC | Threefold cross-validation, Internal validation | Strengths: High AUC, broad echocardiographic data; Limitations: Small cohort, retrospective design |
| Liao et al. (2023) [69] | PH due to left heart disease, PAH, CTED | ML for PH detection using echo | Retro-spective cohort | 346 patients | The ML model achieved AUC 0.945 in internal validation and AUC 0.950 in external validation for predicting PH from echocardiographic images | AUC | Linear regression,LightGBM, CatBoost | ML | RHC | Cross-validation (50%), Internal and external validation | Strengths: High AUC, robust model for PH detection from echocardiographic images, external validation; Limitations: relatively small sample size, no data on ethnic diversity, Possible bias in image quality, retrospective design |
| Lungu et al. (2016) [49] | PAH, PH due to left heart disease, PH due to lung disease and hypoxia, CTEPH, PH with unclear and multi-factorial | MRI-based ML model for PH detection | Retro-spective cohort | 72 patients | The ML model achieved 92% accuracy in diagnosing PH using MRI-derived parameters and decision tree analysis | AUC, Sensitivity, Specificity, PPV, NPV | Random Forest Classification | ML | RHC | Leave-one-out cross-validation | Strengths: Non-invasive diagnostic tool, high accuracy with MRI; Limitations: Small sample size, lack of internal or external validation dataset, retrospective design, single center |
| Matsunaga et al. (2024) [50] | CTEPH | ML models for predicting mPAP in CTEPH | Retro-spective cohort | 136 patients | The linear regression model achieved the highest R2 value of 0.388. Models including age, BNP, TRPG, CXR performed better than traditional methods using TRPG alone | R2, RMSE, MAE | Linear regression, Decision Tree, SVR, KNN, Random Forest, XGBoost | ML | RHC | Internal validation | Strengths: Multiple model types tested, multivariable model increases prediction accuracy; Limitations: Small sample size, no external validation |
| Murayama et al. (2024) [51] | PAH, CTEPH | DL model for RVEF estimation using 2D echo | Retro-spective cohort | 93 patients | The DL model (3D-ResNet50) predicted RVEF with a mean absolute error of 7.67% and showed AUC 0.84 for detecting severe RV dysfunction | AUC, Mean absolute error | 3D-ResNet50 CNN model | DL | RHC | Fivefold cross-validation, Internal validation | Strengths: Automated tool for RVEF prediction, high diagnostic accuracy, Echo-based; Limitations: Small sample size, retrospective design, proportional error observed |
| Nemati et al. (2024) [20] | PH | ML model for PH detection using orthogonal voltage gradient (OVG) and photoplethysmographic (PPG) signals | Retro-spective cohort | 488 patients | AUC of 0.93, sensitivity of 87%, specificity of 83% for PH detection using non-invasive sensors (OVG & PPG signals) | AUC, Sensitivity, Specificity | Elastic Net, Random Forest | ML | RHC | Out-of-fold cross-validation, Internal validation | Strengths: Non-invasive, point-of-care, high sensitivity and specificity, generalizable; Limitations: Relies on specific features, limited to point-of-care application, retrospective design, still requires further validation |
| Ong et al. (2020) [52] | PH, PAH | Claims-based ML model for PH detection using EHR and Medicare claims | Retro-spective cohort | 550 patients | ML models outperformed rule-based algorithms for identifying PH in administrative claims, achieving an AUC of 0.88 | AUC, Sensitivity, Specificity, PPV, NPV | Lasso, Random Forest, Gradient boosting machine | ML | RHC | Tenfold cross-validation, Internal validation, Bootstrap | Strengths: High performance, real-world healthcare data, multicenter design; Limitations: Relies on administrative claims, no full external validation, retrospective design |
| Priya et al. (2021) [53] | PAH, PH due to left heart disease, PH due to lung disease and hypoxia, CTEPH, PH with unclear and multi-factorial | Cardiac MRI-based radiomics for PH detection using texture features | Retro-spective cohort | 72 patients (42 PH, 30 controls) | The radiomics-based model achieved AUC 0.862 for PH detection and AUC 0.918 for PH patients with preserved LVEF in subgroup analysis | AUC, Sensitivity, Specificity, Accuracy | MLP, Random Forest, SVM, Elastic Net, Ridge | ML | RHC | Five-fold cross-validation | Strengths: Non-invasive, good diagnostic performance; Limitations: Small sample size, no external validation, single institution retrospective design |
| Priya et al. (2021) [54] | PAH, PH due to left heart disease, PH due to lung disease and hypoxia, CTEPH, PH with unclear and multi-factorial | Cardiac MRI-derived radiomics with DAFIT for PH detection | Retro-spective cohort | 82 patients (42 PH, 40 controls) | DAFIT model with combined LV and RV masks performed with AUC 0.958, outperforming other models and showing superior predictive performance in PH detection | AUC | Linear, logistic, ridge, elastic net, and LASSO regression, Neural network, SVM, MLP, Random Forest, generalized boosted regression model | ML | RHC | Five-fold cross-validation | Strengths: High AUC, non-invasive, Data augmentation approach improves reproducibility; Limitations: Small sample size, Lack of external validation, limited PH subgroup, retrospective design |
| Schuler et al. (2022) [56] | PAH | ML model using ICD-9/10 codes, RHC, and PAH medication for PAH prediction | Retro-spective cohort | 194 PAH patients and 786 controls | ML algorithm achieved sensitivity 0.88, specificity 0.93, PPV 0.89, NPV 0.92 in identifying PAH using administrative claims data | Sensitivity, Specificity, PPV, NPV, AUC | Random Forest, XGBoost, Elastic Net | ML | RHC | Tenfold cross-validation, Internal and external validation | Strengths: High sensitivity and specificity, External validation, non-invasive administrative data use; Limitations: Relies on administrative data, retrospective design |
| Shikhare et al. (2022) [57] | CTEPH | ML-based algorithm for right-to-left ventricle ratio (dRV/dLV) prediction from CTPA | Retro-spective cohort | 125 patients | ML-based algorithm performed well with a strong correlation of r = 0.96 between predicted and manual dRV/dLV, associated with long ICU length of stay | AUC, Sensitivity, ICU length of stay | Neural Networks, CNN | ML | RHC | – | Strengths: High correlation with manual measurements, predictive of ICU length of stay, non-invasive; Limitations: 20% algorithm failure, small cohort, no validation on test set, single center, retrospective design |
| Suvon et al. (2023) [58] | PAH | Multimodal learning for mortality prediction using EHR, echo, and MRI data | Retro-spective cohort | 2563 patients | The multimodal model combined numerical imaging features, categorical features, and textual features from EHR data, achieving AUC 0.89 for one-year mortality prediction | AUC, Sensitivity, Specificity, PPV, NPV | Bidirectional Encoder Representations from Transformers (BERT), MLP | ML | RHC | Tenfold cross-validation, Internal validation | Strengths: Multimodal approach, high AUC, utilizes real-world data; Limitations: Missing data, small class imbalance, retrospective design |
| Sweatt et al. (2019) [70] | PAH | Immune phenotypes classification using proteomic profiles | Prospective obser-vational | 385 patients (discovery: 281, validation: 104) | Identified 4 immune clusters with distinct cytokine profiles using unsupervised ML, which correlated with clinical outcomes and 5-year survival | Survival Rate, Kaplan–Meier estimates, Cytokine levels | Consensus Clustering, Partial Correlation Networks | ML | RHC | External validation | Strengths: Unsupervised phenotyping, identifies immune phenotypes, links to prognosis, prospective multicenter design, external validation; Limitations: One-time point sampling, no dynamic monitoring |
| Swift et al. (2020) [59] | PAH | Tensor-based ML for CMR feature extraction to predict PAH | Retro-spective cohort | 220 patients (150 with PAH, 70 with no PH) | Tensor-based ML approach showed AUC = 0.92 for PAH diagnosis using CMR data, identifying new diagnostic features | AUC, Sensitivity, Specificity, PPV, NPV | Tensor-based ML, Multilinear Subspace Learning (MPCA) | ML | RHC | Tenfold cross-validation | Strengths: High diagnostic accuracy, innovative approach using CMR data; Limitations: Small sample size, requires CMR, single center, no external validation retrospective design |
| Swinnen et al. (2023) [60] | PAH vs. PH due to left heart disease (PH-LHD) | Differentiation of PAH from PH-LHD using ML on noninvasive data | Retro-spective cohort | 344 patients | Random Forest-based model showed sensitivity of 64% and 100% specificity for PH-LHD detection; outperforming the Jacobs score | AUC, Sensitivity, Specificity, PPV, NPV | Random Forest, Logistic Regression | ML | RHC | Tenfold cross-validation, Internal validation | Strengths: Highly specific model, non-invasive approach to differentiate PAH vs PH-LHD; Limitations: Retrospective design, single center |
| Zhang et al. (2023) [63] | PAH, PH due to left heart disease, PH due to lung disease and hypoxia, CTEPH, PH with unclear and multi-factorial | ML-based PAP prediction from CTPA | Retro-spective cohort | 55 patients | Developed ML model using CTPA for the automatic evaluation of PAP. Achieved good consistency between predicted and manual measurements for mPAP, sPAP, dPAP | Intraclass correlation coefficient (ICC), AUC, mPAP, sPAP, dPAP, TPR | XGBoost, SVM, CatBoost | ML, DL | RHC | Tenfold cross-validation | Strengths: Accurate PAP prediction and segmentation via CTPA; Limitations: Small sample size, retrospective design |
| Zhao et al. (2025) [71] | PH (pre- and postcapillary) | Multimodal DL for PH detection from EHR, echo, and CXR | Prospective and retro-spective design | 2451 patients | Developed MMF-PH model integrating CXR, ECG, echo, and clinical data; outperformed Echo in PH screening with higher specificity and NPV across datasets | Accuracy, Precision, Sensitivity, Specificity, NPV, F1, AUROC, AUPRC | Multimodal DL (MMF-PH) | DL | RHC | Internal and external validation | Strengths: Robust diagnostic accuracy, non-invasive PH screening, multicenter design, partially prospective design; Limitations: Small external validation group, PH subtypes were not comprehensively classified, overfitting potential |
This table summarizes the characteristics of the studies included in the systematic review. The “Author (year)” column lists the lead author and the publication year of each study. The “PH group” column identifies the PH subtype studied. The “Study group” column provides a brief description of the study’s focus. “Study design” describes the methodology used in each study, whether retrospective or prospective. “Sample size” lists the number of participants included in each study. The “Key findings” column highlights the primary outcomes or findings. “Outcome measures” refers to the specific performance metrics used in each study. “Prognostic model used" details the type of model applied for prediction or prognosis. The “Model type” column specifies whether the model was ML or DL. “Diagnosis method” outlines the diagnostic methods used for PH in the study. “Validation method” describes the validation strategy used. Finally, the “Strengths and limitations” column provides insights into the strengths and weaknesses of each study
The majority of studies were retrospective (48 studies, 90.6%) [14–18, 20, 25–65] and single-center (44 studies, 83.0%) [14–16, 18, 20, 25–40, 42–45, 47, 49–51, 53–63, 66–69] in design. Nine studies (17.0%) were conducted across multiple centers [17, 41, 46, 48, 52, 64, 65, 70, 71], four (7.5%) were designed prospectively [66–68, 70], and one (1.9%) included both retrospective and prospective cohorts [71]. All included studies were published between 2016 and 2025. No randomized controlled trials were identified. Study populations included patients with either PAH or broader PH, with varying degrees of diagnostic certainty and subtype attribution (see Table 1 for full study-level details).
The primary clinical objectives varied across studies: 47 studies (88.7%) addressed diagnostic classification [14–18, 20, 26–56, 59, 61–65, 67, 69–71], nine studies (17.0%) aimed at prognostic prediction [25, 31, 32, 35, 41, 57, 58, 66, 68], and one study (1.9%) focused on phenotypic subgroup differentiation [60]. There was some overlap, as several studies pursued more than one objective. Data sources were heterogeneous and included clinical variables, echocardiographic data, electrocardiograms (ECG), chest imaging [(chest X-ray (CXR), computed tomography (CT), magnetic resonance imaging (MRI)], laboratory parameters, and omics-based inputs. The latter were employed in seven studies (13.2%), including four based on proteomic or transcriptomic data [29, 34, 36, 70], two on radiomic features [53, 54], and one on lipidomics [30] (see Table 1).
Algorithmic approaches and input modalities
Among the 53 included studies, 32 (60.4%) employed ML models such as random forests, support vector machines or gradient boosting. DL models, particularly convolutional neural networks (CNNs), were applied in 18 studies (34.0%), mainly for image-based classification tasks involving CXR, CT scans, echocardiographic images, MRI, or ECG data. Three studies (5.7%) combined ML and DL methods [17, 63, 64]. An increasing number of studies adopted multimodal frameworks that integrated structured clinical data with unstructured sources such as imaging or free-text reports. Input features differed considerably between studies. While most studies relied on clinical, imaging, and echocardiographic data, seven studies (13.2%) incorporated ECG-derived parameters [16, 17, 27, 33, 46, 48, 71]. Eight studies (15.1%) used biomarker data [29, 34, 36, 39, 61, 62, 68, 70]. Data preprocessing strategies, feature selection methods, and hyperparameter tuning procedures were reported inconsistently (Tables 1, 2).
Table 2.
Study characteristics and performance metrics—studies without explicit RHC confirmation for diagnosis
| Author (year) | PH group | Study group | Study design | Sample size | Key findings | Outcome measures | Prognostic model used | Model type | Diagnosis method | Validation method | Strengths and limitations |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Guo et al. (2025) [67] | PH | DL model based on phonocardiograms for PH screening | Prospective cohort study | 985 patients | The model achieved an AUC of 0.79 for detecting elevated PASP ≥ 40 mm Hg, with sensitivity of 0.73 and specificity of 0.74. Performance was better when using a per-patient approach (AUC 0.82) | AUC, Sensitivity, Specificity, PPV, NPV | CNN | DL | Echo | Fivefold cross-validation, internal validation | Strengths: Non-invasive, low-cost screening tool for PH using a digital stethoscope, prospective design; Limitations: Echocardiographic PASP used as ground truth instead of RHC |
| Han et al. (2024) [15] | PAH-CHD (Pulmonary Arterial Hypertension in Congenital Heart Disease) | AI model based on chest radiographs (CXR) for PAH-CHD diagnosis | Retro-spective study | 3255 radiographs | AI model achieved AUC 0.948 for CHD diagnosis and AUC 0.778 for PAH-CHD detection. With AI assistance, radiologists’ performance improved significantly for both diagnoses | AUC, Sensitivity, Specificity, Accuracy, F1 Score | ResNet18 (DL) | DL | Echo (CHD diagnosis), Clinical Reports | Fivefold cross-validation, Internal validation cohort | Strengths: Non-invasive, easy-to-perform CXR with AI assistance; Limitations: Single center, PAH diagnosed by echo, small sample for specific CHD types, retrospective design |
| Hu et al. (2023) [39] | PAH | ML-based biomarker identification for PAH | Retro-spective cohort | 3 Lung tissue samples from PAH patients | Identification of gene biomarkers that reliably distinguished PAH from controls | AUC | Gradient boosting decision tree | ML | Bio-informatics analysis, Gene expression data from public datasets | Fivefold cross-validation, External dataset validation (GSE53408) | Strengths: Comprehensive bioinformatics approach and experimental validation. Limitations: Small sample size, potential overfitting due to limited data |
| Hyde et al. (2023) [40] | PAH | Claims-based ML algorithm for PAH identification | Retro-spective cohort | 1339 PAH and 4222 non-PAH patients | The random forest model distinguished PAH from non-PAH patients with AUC of 0.84 for 6 months prior to diagnosis, showing promising early identification capability | AUC, Recall, Precision, Accuracy | Random Forest | ML | Claims data-based (ICD-10 codes for PAH or PH, outpatient claims) | Fivefold cross-validation, Internal validation | Strengths: Claims data-based approach for early PAH (or PH) identification, real-world evidence; Limitations: Potential biases in claims data (PAH vs PH), missing data |
| Kishikawa et al. (2025) [17] | PH | Ensemble learning model for PH detection using ECG, CXR, and BNP | Retro-spective cohort | 71,826 ECG data points, 4718 CXR data points, 4718 BNP data points | AUC 0.872 for ensemble model; improves cardiologists’ detection accuracy for PH from 65 to 74% using ECG, CXR, and BNP data | AUC, Sensitivity, Specificity, Accuracy, PPV, NPV | Ensemble learning model | ML, DL | Echo | Internal validation | Strengths: Multimodal model, multicenter design, improves accuracy in detecting PH; Limitations: Only cardiologists tested; small cohort; unspecific patient population; only echocardiographic diagnosis without subtype classification, potentially limiting treatment decisions, retrospective design |
| Kivrak et al. (2023) [43] | PAH, PH due to left heart disease, PH due to lung disease and hypoxia, PH with unclear and multi-factorial mechanism, and non-PH | AI-based classification of PH using Chest X-ray images | Retro-spective cohort | 6642 X-ray images from 2005 patients | The DL model (EfficientNetb0) achieved accuracy of 86.14%, AUC of 0.945 for PH detection | Accuracy, Recall, Precision, F1 Score, AUC | EfficientNetb0, SVM | DL | CXR and clinical findings | Internal validation | Strengths: High performance with CXR for PH classification; Limitations: Unbalanced dataset, retrospective design, black-box AI, no reliable PH diagnosis (no RHC) |
| Kusunose et al. (2022) [44] | Exercise-induced PH | DL model for PH detection using CXR | Retro-spective cohort | 142 patients | The DL model achieved an AUC of 0.71 adding predictive value over clinical and echocardiographic parameters at rest, improving AUC from 0.65 to 0.74 | AUC | DL (Capsule Network with residual blocks) | DL | AI model | Tenfold cross-validation | Strengths: Non-invasive detection of exercise-induced PH using CXR and AI; Limitations: Small cohort, no RHC for diagnosis, Black-box nature of the model, retrospective design |
| Kwon et al. (2020) [46] | PH | DL model for PH prediction using ECG | Retro-spective cohort | 38,241 patients (including 4096 PH patients) | The AI algorithm achieved AUC of 0.859 (internal validation) and 0.902 (external validation) | AUC, Sensitivity, NPV, PPV | DL (ensemble neural network, CNN) | DL | Echo | Internal and external validation | Strengths: High accuracy using ECG data for PH detection, multicenter cohort, external validation; Limitations: No RHC for PH confirmation, potential bias from data imbalances |
| Liu et al. (2025) [48] | PH | DL model combining ECG and CXR for elevated PAP detection | Retro-spective cohort | 85,193 patients from Hospital A, 16,736 patients from Hospital B | The DL model achieved AUC 0.8644 in internal validation and AUC 0.8734 in external validation for detecting elevated PAP using a combination of ECG and CXR. It also predicted future left ventricular dysfunction and cardiovascular mortality | AUC, Sensitivity, Specificity, PPV, NPV, Hazard Ratio | CNN, XGBoost | DL | Echo | Internal and external validation | Strengths: High diagnostic accuracy and NPV, integrates ECG and CXR for early PH detection, external validation, multicenter design; Limitations: No RHC confirmation, retrospective design |
| Liu et al. (2022) [16] | PH | AI model using ECG and Echo for PH detection | Retro-spective cohort | 41,097 patients | The AI model achieved AUC 0.88 for elevated PAP detection and predicted cardiovascular mortality. It outperformed conventional ECG diagnosis by cardiologists | AUC, Sensitivity, Specificity, accuracy, Hazard Ratio | Neural network | DL | Echo | Tenfold cross-validation, internal and external validation | Strengths: Good diagnostic accuracy (AUC 0.88), robust prediction for cardiovascular mortality, validated externally, large sample size; Limitations: No RHC confirmation, retrospective design, possible biases due to cohort |
| Ragnarsdottir et al. (2024) [55] | PH in newborns | Echo-based multi-view DL for predicting and classifying PH | Retro-spective cohort | 270 newborns | Explainable multi-view DL model for predicting and classifying PH severity with F1-score of 0.84 for severity and 0.92 for binary detection. Results demonstrated that multi-view and spatio-temporal analysis helped significantly improve prediction | F1-score, AUROC, accuracy, Recall, Precision | CNN | DL | Echo | Tenfold cross-validation, Internal validation | Strengths: First automated PH severity prediction in newborns using echo, explainable model, high performance metrics; Limitations: Data imbalance, limited to newborns, retrospective design |
| Yang et al. (2024) [61] | PAH and PH (unclear) | Gene expression data from 65 samples (41 PAH, 24 controls) from GEO datasets GSE113439 and GSE15197 were used for PAH prediction | Retro-spective study | 274 (unclear) | Lasso combined with Linear Discriminant Analysis achieved the best feature selection performance (AUC = 0.741); the resulting diagnostic model based on selected hub genes reached an AUC of 0.87 | AUC | 113 ML algorithms | ML | unclear | Cross-validation | Strengths: High AUC, well-selected biomarkers; Limitations: Small sample size, unclear use of RHC for diagnosis in dataset, lack of diverse validation datasets, retrospective design, several methodological limitations |
| Zeng et al. (2021) [62] | PAH | Identification of biomarkers and immune infiltration analysis in IPAH using bioinformatics | Retro-spective cohort | 74 patients | Identified HBB, RNASE2, S100A9, and IL1R2 as biomarkers with high diagnostic value (AUC = 1) for IPAH detection. Immune infiltration differences noted between IPAH and controls | AUC, Sensitivity, Specificity, ROC curve | SVM-recursive feature elimination, Lasso | ML | unclear | Tenfold cross-validation, External validation | Strengths: Accurate biomarkers, Immune infiltration analysis; Limitations: Small dataset, relies on bioinformatics datasets, no real-time monitoring, unclear use of RHC for diagnosis in dataset, retrospective design |
| Zhao et al. (2024) [64] | CTEPH | Automated CTEPH detection using non-contrasted CT scans | Retro-spective cohort | 300 patients | Developed a cascaded network with multiple instance learning using non-contrast CT scans, achieving an AUC of 0.807 and sensitivity of 0.795 in detecting CTEPH | AUC, Sensitivity, Specificity, Accuracy | ResNet-18 CNN | ML, DL | CTEPH diagnosis based on MSKCC Q-SPECT/CT and Modified PIOPED II criteria | Fivefold cross-validation, External validation | Strengths: Non-invasive approach with no additional annotations required. High diagnostic accuracy, multicenter design. Limitations: External validation is limited as the second cohort included only healthy subjects |
| Zou et al. (2020) [65] | PH | DL-based PH detection and PASP prediction from CXR | Retro-spective cohort | 762 patients | DL approach using frontal CXR to screen for PH with high AUC (0.970) on internal test, 0.967 on external test | AUC, Sensitivity, Specificity, PPV, NPV, MAE | InceptionV3, Xception, ResNet50 | DL | Echo | Eightfold cross-validation, Internal and external validation | Strengths: High diagnostic accuracy, multicenter design, external validation. Limitations: Small sample size for external validation, overfitting potential, PH diagnosis based on Echo without RHC confirmation |
This table summarizes the characteristics and performance metrics of the 15 studies that either did not perform RHC, did not explicitly report its use, or replaced it with echocardiography alone for diagnostic confirmation, which we acknowledge as a methodological limitation. These studies are presented separately because the absence of invasive confirmation represents a methodological limitation that may affect diagnostic ground truth. Columns include “Author (year)” (lead author and publication year), “PH group” (pulmonary hypertension subtype studied), “Study group” (study focus), “Study design” (retrospective or prospective), “Sample size,” “Key findings,” “Outcome measures,” “Prognostic model used,” “Model type” (ML or DL), “Diagnosis method,” “Validation method,” and “Strengths and limitations,” summarizing key methodological aspects and performance outcomes
Diagnostic and classification models
Model performance and validation
Reported model performance varied according to study objective, input modality and algorithmic approach. For example, Imai et al. [14] developed a DL model based on CXR images, achieving an AUC of 0.988 with a sensitivity of 0.93 and specificity of 0.98, outperforming experienced physicians in detecting PAH [14]. Similarly, DuBrock et al. [33] demonstrated that an ECG-based CNN could predict PH up to five years before clinical diagnosis (AUC 0.92 at diagnosis, remaining ≥0.80 up to 18 months pre-diagnosis) across two independent cohorts, highlighting the potential of AI for early, non-invasive screening and disease detection [72]. AUC values ranged from 0.71 to 1.00 across both ML and DL models [32, 44, 45, 62]. Diagnostic model performance varied substantially across input domains. CXR and CT-based models generally achieved moderate AUCs (for example, CXR: 0.71 in Kusunose et al. 2020/2022 [44, 45]; CT for CTEPH detection: 0.81 in Zhao et al. 2024 [64]), whereas the best-performing CXR algorithms reached very high accuracy (Imai 2024 0.988 [14]; Zou 2020 0.970/0.967 internal/external [65]). ECG-based models consistently performed in the high range, typically 0.86–0.92 (Kwon 2020 0.859/0.902 [46]; DuBrock 2024 0.92/0.88 [33], with predictive ability up to five years before diagnosis). Echocardiography-based ML also showed strong discrimination (Liao 2023 0.945/0.950 internal/external [69]). Claims/EHR-based approaches yielded high to very high AUCs (Ong 2020 0.88 [52]; Kogan 2023 0.92 [18]). Biomarker and omics studies reported exceptionally high AUCs in smaller, homogeneous cohorts (Duo 2022 0.948/0.945 [34]; Zeng 2021 AUC = 1.00 [62]), although their generalizability remains limited. Finally, multimodal models that integrated imaging and clinical data (for example, Zhao et al. 2025 [71]) achieved consistently high performance, likely reflecting the richer feature space (see Tables 1, 2).
External validation was performed in 20 studies (37.7%), partially in combination with internal validation using a held-out test set. Cross-validation was the most commonly applied strategy, frequently supplemented by a separate internal test split.
Calibration metrics and decision curve analyses were rarely reported across studies.
Risk of bias and applicability
Risk of bias and applicability were assessed using the PROBAST+AI tool. For each domain, both methodological quality and clinical applicability were independently rated as low, moderate, or high. Based on this assessment, 44 studies (83.0%) were classified as having a low overall risk of bias. However, moderate applicability concerns were frequently identified. These were mainly related to non-representative patient populations, insufficient detail on predictor definitions and measurement, non-guideline-conforming diagnostic criteria for PH (including inconsistent use of RHC), and limited generalizability of imaging-based models to broader clinical settings.
In 15 studies (28.3%) [15–17, 39, 40, 43, 44, 46, 48, 55, 61, 62, 64, 65, 67], RHC, the diagnostic gold standard for PH, was either not performed, not explicitly reported, or replaced by echocardiography alone for diagnostic confirmation. This raised applicability concerns regarding the validity and consistency of case definitions across these studies. Subtype attribution in accordance with ESC/ERS guidelines and the Nice classification [1] was clearly reported in 38 studies (71.2%), while the remaining studies used heterogeneous PH definitions, partly without clear diagnostic specification or consistent delineation according to guideline-based criteria [1]. As RHC is not routinely performed in all patients with imaging signs of right heart strain in clinical practice, these studies nonetheless provide valuable insights into clinical and echocardiography-based AI applications in suspected PH. For completeness of the review’s evidence base, these studies were retained but are now presented separately in Table 2 to maintain a clear distinction between reference standards in the main analysis. A detailed study-level assessment of bias and applicability is provided in Supplementary Table S3.
Prognostic and predictive models
Among the studies reviewed, nine focused on prognostic modeling, specifically aimed at predicting outcomes such as mortality [25, 31, 32, 35, 41, 57, 58, 66, 68]. While these studies leveraged imaging data and clinical endpoints, they were rarely externally validated or prospectively tested.
Alabed et al. (2022) applied a cardiac MRI-based multilinear principal component analysis (MPCA) approach to identify prognostic features across the cardiac cycle, improving 1-year mortality prediction in PAH compared with the REVEAL score (c-index 0.76 vs. 0.71) while maintaining interpretability through visualization of high-risk myocardial regions [25]. Kheyfets et al. (2023) developed a random forest model in PAH, integrating clinical, hemodynamic, and biomarker data, and achieved excellent internal (AUC 0.94) and robust external validation (AUC 0.81) for 4-year survival prediction, illustrating the potential of explainable, individualized AI-based risk assessment [68].
Prognostic models demonstrated moderate to high discriminatory ability, depending on input modality and outcome definition. Early CMR-based motion models performed at the lower bound (Dawes 2017, AUC 0.73 [66]), while registry-based Bayesian networks showed intermediate accuracy (Kanwar 2020, AUC 0.80; external 0.74–0.80 [41]). Imaging-rich or multimodal approaches achieved higher performance, with the CMR-based MPCA model by Alabed 2022 improving 1-year mortality prediction in PAH (c-index 0.83 vs. 0.71 REVEAL) [25], and the random-forest model by Kheyfets 2023 reaching AUCs of 0.94 (internal) and 0.81 (external) [68]. Further prognostic applications, such as AI-quantified fibrosis in CT (Dwivedi 2024, c-index 0.76 [35]) or multimodal EHR-based survival prediction (Suvon 2023, AUC 0.89 [58]), also demonstrated strong predictive accuracy. These results collectively highlight that greater data richness and more precise labeling enhance prognostic power (see Table 1).
Discussion
Despite growing enthusiasm for AI in PH research, the clinical translation of ML and DL models remains limited. The observed variation in reported AUCs reflects the heterogeneity of data sources and study objectives. CXR- and CT-based models generally achieved moderate accuracy, ECG-based algorithms showed consistently higher performance, and multimodal or MRI-based prognostic models achieved the highest results, albeit often in smaller, more homogeneous cohorts. These differences likely stem from variations in data richness, label quality (RHC-confirmed vs. surrogate definitions), and cohort heterogeneity, underscoring the need for standardized endpoints, transparent reporting, and robust external validation in future studies (see Tables 1, 2).
To our knowledge, this is the first systematic review to provide a structured quality assessment of 53 studies addressing non-invasive diagnosis, phenotypic classification, and prognostication in PH. By synthesizing heterogeneous approaches using the SWiM framework and evaluating methodological rigor via the PROBAST+AI tool, we identified key limitations that currently hinder clinical implementation.
Study cohorts and diagnosis methodology
A key limitation identified across many studies is the lack of clear differentiation between PH subgroups, particularly with respect to the current ESC/ERS classification [1]. While studies such as Swinnen et al. (2023) explicitly aimed to distinguish PAH from post-capillary PH due to left heart disease (PH-LHD) [60], this distinction was not rigorously addressed in other studies, despite its clinical importance. Differentiating between PH Groups 1 to 5 is essential, as these entities differ markedly in pathophysiology, therapeutic implications, and clinical outcomes [1]. The omission of this distinction in a considerable number of studies underscores a persistent gap between clinical priorities and prevailing practices in AI research, thereby limiting the utility of ML and DL models that do not account for the multifaceted nature of PH.
Regarding diagnostic methodology, most studies employed RHC [1]. When performed in experienced centers, RHC offers high diagnostic specificity and precision with an acceptable risk profile [8, 73]. Across 15 studies, RHC was not performed, not clearly reported, or substituted by echocardiographic assessment [15–17, 39, 40, 43, 44, 46, 48, 55, 61, 62, 64, 65, 67]. Despite its wider availability, echocardiography is insufficient for definitive diagnosis according to guideline-based algorithms [1]. To account for this methodological heterogeneity, studies that did not explicitly report RHC confirmation or used echocardiography alone for diagnosis were analyzed separately (Table 2). Although the absence of invasive confirmation limits diagnostic ground truth, these studies remain relevant as they reflect real-world clinical practice, where RHC is not routinely performed in all patients with imaging signs of right heart strain. Moreover, they provide complementary insights into the development and validation of non-invasive AI models for screening or triage applications. Nevertheless, the use of inconsistent diagnostic standards introduces potential bias, limits comparability, and hinders both model performance and clinical translation. In addition, the decision to perform RHC itself may represent a classification factor, as patient selection for invasive confirmation often differs across PH subgroups and disease stages.
Model performance: machine learning vs. deep learning
The studies included in this review predominantly employed supervised ML models such as support vector machines (SVM), random forests, and gradient boosting machines (GBM) for structured clinical data. These models demonstrated promising discriminatory performance, with AUC values ranging from 0.73 to 1.00 [62, 66]. However, most were developed for binary classification tasks, such as distinguishing PH from healthy controls or identifying at-risk individuals. While such classifiers may support initial screening, they fall short of addressing dynamic clinical needs, including the prediction of long-term outcomes, therapeutic response, or continuous hemodynamic parameters. The limited adoption of regression-based ML approaches limits the clinical applicability of current models for longitudinal monitoring and individualized prognostic assessment.
In contrast, DL approaches, primarily based on CNNs, were mainly applied to imaging data. A considerable number of studies focused specifically on CXR analysis [14, 15, 43–45, 48, 65]. These studies demonstrated the feasibility of using DL for non-invasive detection of PH and PAH, with reported AUC values ranging from 0.71 to 1.00 [32, 44, 45]. The use of widely available CXR data underscores the potential of DL models for scalable, non-invasive screening in PH. However, several limitations remain. The opaque nature of CNN-based models challenges clinical acceptance [74, 75]. None of the included studies appear to have employed explainable AI (XAI) techniques such as saliency maps, Gradient-weighted Class Activation Mapping (Grad-CAM), or layer-wise relevance propagation.
Prognostic models and long-term outcomes
Although several studies have explored AI-based prognostic modeling in pulmonary hypertension, most remain limited by small sample sizes, lack of external validation, and insufficient reporting of model interpretability. Many models did not adequately quantify the relevance of input features, and clinical transparency was often insufficiently addressed. This lack of interpretability hampers clinical applicability, as clinicians require explainable and actionable outputs to inform patient management and therapeutic decisions [76].
Despite these challenges, AI-driven prognostic tools hold considerable promise for advancing PH management by enabling earlier risk stratification and personalized treatment strategies [11].
Data heterogeneity and multi-modal integration
A major limitation across the studies was the heterogeneity of data sources. The studies included in this review utilized various combinations of data types. This diversity complicates direct comparisons between studies and the identification of reproducible predictors. Notably, only a small number of studies integrated multiple data modalities, such as imaging and clinical data, to potentially enhance predictive performance [17, 58, 71]. Only seven studies incorporated advanced data types such as proteomics, transcriptomics, lipidomics, or radiomics [29, 30, 34, 36, 53, 54, 70].
Validation methods and generalizability
A critical limitation across the studies was the insufficient attention to model validation. While most studies relied solely on internal cross-validation, only a limited number employed independent hold-out sets or external datasets. Notably, Zhao et al. (2025) and Bordag et al. (2023) highlighted the value of external validation, demonstrating robust model performance across distinct cohorts and datasets [30, 64]. However, the lack of consistent external validation across the studies included in this review raises concerns about the generalizability of the proposed ML models. Small sample sizes, particularly from single-center cohorts, increase the risk of overfitting and limit broader applicability. Moreover, none of the studies fully met the methodological and reporting standards assessed using the PROBAST+AI tool [23], reflecting persistent gaps in model transparency, reproducibility, and bias control. While several approaches demonstrate considerable innovation, many remain at an early proof-of-concept stage. Robust external validation and prospective multicenter testing are essential to address these concerns [77, 78].
Model interpretability and ethical considerations
A key barrier to the clinical adoption of AI models is their limited interpretability and transparency. While metrics such as AUC and accuracy are important indicators of model performance, ML and DL applications in healthcare must also be comprehensible to clinicians and transparent in their decision-making logic. Techniques from the field of XAI, including SHapley Additive exPlanations (SHAP) and Local Interpretable Model-agnostic Explanations (LIME), are critical for fostering trust and enabling the meaningful integration of model outputs into clinical workflows [79–81].
In addition to technical concerns, ethical challenges such as data privacy, algorithmic fairness, and bias against underrepresented populations are often insufficiently addressed. Such bias may arise from imbalanced datasets, unrepresentative training populations, or opaque model development processes. These challenges require collaborative strategies, including representative data selection and transparent model auditing [82]. A comprehensive taxonomy of bias sources and fairness strategies highlights the persistent risk of discriminatory outcomes if fairness is not explicitly addressed throughout the AI development process [83]. Addressing these issues is essential for the responsible and equitable implementation of AI in clinical care.
Trust in AI is not established by validation metrics alone but rather emerges throughout the development process. Winter and Carusi (2022) demonstrated that validation and trust are co-constructed iteratively through continuous interaction between algorithm developers and clinical users. Their study on AI-assisted early diagnosis of PH emphasized how crucial steps such as data curation, label refinement, and the choice of benchmarks are shaped collaboratively, often through tacit, practice-oriented input that is not captured in formal reporting [84]. In this sense, validation is not a static technical endpoint but an evolving process embedded in clinical workflows. Acknowledging and integrating these collaborative dynamics may be essential for developing AI systems that are robust, interpretable, and clinically acceptable.
Summary
This systematic review analyzed 53 studies applying ML and DL to PH, focusing on non-invasive models for diagnosis, classification, and prognostication. Key aspects included model inputs, algorithm types, validation strategies, and subgroup differentiation. While ML- and DL-based approaches demonstrated promising accuracy, limited external validation, methodological heterogeneity, and the common failure to address subgroup-specific analyses continue to constrain clinical applicability.
Strengths and limitations
This systematic review provides a comprehensive and methodologically rigorous synthesis of current ML and DL applications in PH research. One key strength lies in the structured evaluation of studies based on clinical intent, model design, and phenotypic focus, which allows for a differentiated assessment of algorithmic potential across the PH spectrum. Additional strengths include the prospective registration in PROSPERO (CRD420251074202) [22], the consistent application of transparent inclusion criteria, and adherence to PRISMA methodology [21]. Study quality and reporting were critically appraised using the recently updated PROBAST+AI tool [23], specifically developed for assessing ML-based prediction models. The synthesis and reporting were also guided by the SWiM guideline, which supports transparent evidence presentation in the absence of formal meta-analysis due to methodological heterogeneity [24]. Another strength is the review’s focus on PH subgroup differentiation, which addresses a clinically important but often overlooked aspect in PH research.
Several limitations should be acknowledged. First, the number of eligible studies remains limited, reflecting the early stage of AI application in this field. Second, substantial heterogeneity in input features, outcome definitions, and performance metrics limited comparability. Third, the lack of access to source code, model parameters, or detailed preprocessing steps in most studies hindered transparency and reproducibility, which impacts the robustness of the review’s findings.
Future directions
To advance the clinical utility of AI in PH, future research should prioritize phenotypically precise model development across all subgroups defined by the current classification [1]. Substantial differences in pathophysiology, therapeutic response, and prognosis between PAH and other forms of PH necessitate subgroup-specific algorithms trained and validated on clearly stratified patient populations. Rigorous model validation must become standard practice, including not only internal cross-validation and independent hold-out testing, but also external validation to ensure broader applicability. Given the relative rarity of PH and its subtypes, collaborative multicenter registries and federated learning approaches may help overcome current limitations in sample size and data diversity. To increase model transparency and foster clinician trust, explainability techniques such as SHAP values, attention mechanisms, or class activation mapping (CAM) should be routinely implemented and clearly reported. By mitigating the black-box nature of AI models, these tools enhance clinical interpretability and help identify biologically plausible predictors, similar to the feature selection process in methods like Least Absolute Shrinkage and Selection Operator (LASSO). Furthermore, integrating structured clinical and imaging data with unstructured modalities such as free-text reports or waveform data holds promise for improving model performance and robustness. Clinical implementation of AI models in PH should complement rather than replace established clinical workflows, with particular attention to interoperability with EHRs and prospective validation. In addition, future research should aim to systematically consider health economic implications, for example by evaluating whether AI-based tools can contribute to more efficient diagnostic pathways or resource allocation. Close collaboration between clinicians, data scientists, health economists, software engineers, and regulatory bodies is essential to ensure that future AI applications meet the standards of safety, transparency, clinical and health economic relevance required for real-world adoption.
Conclusions
AI holds considerable promise to support earlier diagnosis, individualized risk assessment, and data-informed therapeutic decision-making in PH. Current ML and DL models show encouraging performance in diagnostic and prognostic applications based on non-invasive clinical and imaging data. However, progress toward clinical translation remains limited by small sample sizes, single-center designs, methodological heterogeneity, and the lack of external validation and standardized subgroup phenotyping aligned with current ESC/ERS guidelines. Future research should prioritize harmonized development and reporting practices, transparent diagnostic labeling, and robust multicenter validation to enable safe and effective integration of AI tools into clinical care.
Questions for future research
How can AI models in PH be trained on data strictly aligned with current clinical and hemodynamic definitions?
What methodological approaches can enhance the generalizability of AI tools across PH subgroups and clinical environments?
Can explainable AI increase transparency and foster clinical acceptance in PH applications?
How can AI support early risk stratification and individualized therapy guidance, especially in PAH?
How can AI tools in PH be prospectively validated through real-world, multicenter study designs?
Supplementary Information
Acknowledgements
We gratefully acknowledge Linda Stein for her valuable administrative assistance throughout the preparation of this systematic review.
Abbreviations
- AI
Artificial intelligence
- Ao
Aorta
- AUC
Area under the curve
- AUROC
Area under the receiver operating characteristic curve
- AUPRC
Area under the precision recall curve
- BERT
Bidirectional encoder representations from transformers
- BNP
Brain natriuretic peptide
- CatBoost
Categorical boosting
- CMR
Cardiovascular magnetic resonance
- CNN
Convolutional neural network
- CT
Computed tomography
- CTEPH
Chronic thromboembolic pulmonary hypertension
- CTPA
CT pulmonary angiography
- CXR
Chest X-ray
- DAFIT
Data-augmented feature integration technique
- DL
Deep learning
- Echo
Echocardiography
- ECG
Electrocardiogram
- EHR
Electronic health record
- F1-Score
Harmonic mean of precision and recall
- GSE
Gene expression omnibus series
- HR
Hazard ratio
- ICD-9/10
International classification of diseases, ninth/tenth revision
- IL-2
Interleukin-2
- IL-9
Interleukin-9
- LASSO
Least absolute shrinkage and selection operator
- LV
Left ventricle
- MAE
Mean absolute error
- ML
Machine learning
- MLP
Multilayer perceptron
- MPCA
Multilinear principal component analysis
- MRI
Magnetic resonance imaging
- mPAP
Mean pulmonary arterial pressure
- MSKCC
Memorial Sloan Kettering Cancer Center
- NPV
Negative predictive value
- PA
Pulmonary artery
- PAH
Pulmonary arterial hypertension
- PAP
Pulmonary arterial pressure
- PASP
Pulmonary arterial systolic pressure
- PH
Pulmonary hypertension
- PH-LD
Pulmonary hypertension due to lung disease
- PH-LHD
Pulmonary hypertension due to left heart disease
- PIOPED
Prospective investigation of pulmonary embolism diagnosis
- PPG
Photoplethysmography
- PPV
Positive predictive value
- PRISMA
Preferred reporting items for systematic reviews and meta-analyses
- PROBAST+AI
Prediction model risk of bias assessment tool for artificial intelligence
- R2
Coefficient of determination
- RHC
Right heart catheterization
- ROC
Receiver operating characteristic
- RV
Right ventricle
- RVEF
Right ventricular ejection fraction
- SVM
Support vector machine
- sPAP
Systolic pulmonary arterial pressure
- SVR
Support vector regression
- TAN
Tree-augmented naïve Bayes
- TPR
True positive rate
- TRPG
Tricuspid regurgitant pressure gradient
- TRV
Tricuspid regurgitation velocity
- VUMC
Vanderbilt University Medical Center
- WGCNA
Weighted gene co-expression network analysis
- XGBoost
Extreme gradient boosting
Author contributions
Author contributions TK conceived and designed the review, performed the literature search and data extraction, conducted the quality assessment, prepared all figures and tables, and drafted the manuscript. MK assisted with data organization and supported manuscript preparation. CH contributed to the economic framing and critically revised the manuscript. SS provided conceptual guidance and acted as academic supervisor of the project. All authors read and approved the final version of the manuscript.
Funding
Open Access funding enabled and organized by Projekt DEAL. This research was conducted without any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. No external funding was received for the conduct, analysis, or reporting of this study.
Data availability
This systematic review is based on publicly available data from previously published studies. As no original data were collected or generated, no new datasets are available. All relevant data from the included studies are cited in the manuscript and summarized in the main text and supplementary tables. The corresponding review protocol was prospectively registered and is publicly available in PROSPERO (registration number: CRD420251074202). No analytic code was generated, as data synthesis was conducted narratively following the SWiM (Synthesis Without Meta-analysis) approach. A pre-defined data extraction form was used but is not publicly available; it can be obtained upon reasonable request from the corresponding author. Further inquiries regarding specific studies or data can also be directed to the corresponding author.
Declarations
Ethics approval and consent to participate
Not applicable. This article is a systematic review of previously published studies and does not involve any new studies with human participants or animals performed by any of the authors. Therefore, ethical approval and informed consent were not required.
Competing interests
TK has received speaker honoraria for lectures from Janssen unrelated to the present work. MK, CH, SS: Nothing to disclose.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Humbert M, et al. 2022 ESC/ERS Guidelines for the diagnosis and treatment of pulmonary hypertension. Eur Heart J. 2022;43:3618–731. 10.1093/eurheartj/ehac237. [DOI] [PubMed] [Google Scholar]
- 2.Rosenkranz S, Howard LS, Gomberg-Maitland M, Hoeper MM. Systemic consequences of pulmonary hypertension and right-sided heart failure. Circulation. 2020;141:678–93. 10.1161/circulationaha.116.022362. [DOI] [PubMed] [Google Scholar]
- 3.Small M, Perchenet L, Bennett A, Linder J. The diagnostic journey of pulmonary arterial hypertension patients: results from a multinational real-world survey. Ther Adv Respir Dis. 2024;18:17534666231218886. 10.1177/17534666231218886. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Deshwal H, Weinstein T, Sulica R. Advances in the management of pulmonary arterial hypertension. J Investig Med. 2021;69:1270–80. 10.1136/jim-2021-002027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Frost A, et al. Diagnosis of pulmonary hypertension. Eur Respir J. 2019. 10.1183/13993003.01904-2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Weatherald J, Humbert M. The ‘great wait’ for diagnosis in pulmonary arterial hypertension. Respirology. 2020;25:790–2. 10.1111/resp.13814. [DOI] [PubMed] [Google Scholar]
- 7.Patton DM, Enzevaie A, Day A, Sanfilippo A, Johri AM. A quality control exercise in the echo laboratory: reduction in inter-observer variability in the interpretation of pulmonary hypertension. Echocardiography. 2017;34:1882–7. 10.1111/echo.13712. [DOI] [PubMed] [Google Scholar]
- 8.Rosenkranz S, Preston IR. Right heart catheterisation: best practice and pitfalls in pulmonary hypertension. Eur Respir Rev. 2015;24:642–52. 10.1183/16000617.0062-2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Fadilah A, Putri VYS, Puling I, Willyanto SE. Assessing the precision of machine learning for diagnosing pulmonary arterial hypertension: a systematic review and meta-analysis of diagnostic accuracy studies. Front Cardiovasc Med. 2024;11:1422327. 10.3389/fcvm.2024.1422327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hardacre CJ, et al. Diagnostic test accuracy of artificial intelligence analysis of cross-sectional imaging in pulmonary hypertension: a systematic literature review. Br J Radiol. 2021. 10.1259/bjr.20210332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Attaripour Esfahani S, et al. A comprehensive review of Artificial Intelligence (AI) applications in pulmonary hypertension (PH). Medicina Kaunas. 2025. 10.3390/medicina61010085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rhodes CJ, Sweatt AJ, Maron BA. Harnessing big data to advance treatment and understanding of pulmonary hypertension. Circ Res. 2022;130:1423–44. 10.1161/circresaha.121.319969. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Tchuente Foguem G, Teguede Keleko A. Artificial intelligence applied in pulmonary hypertension: a bibliometric analysis. AI Ethics. 2023;3:1063–93. 10.1007/s43681-023-00267-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Imai S, et al. Artificial intelligence-based model for predicting pulmonary arterial hypertension on chest x-ray images. BMC Pulm Med. 2024;24:101. 10.1186/s12890-024-02891-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Han PL, et al. Artificial intelligence-assisted diagnosis of congenital heart disease and associated pulmonary arterial hypertension from chest radiographs: a multi-reader multi-case study. Eur J Radiol. 2024;171:111277. 10.1016/j.ejrad.2023.111277. [DOI] [PubMed] [Google Scholar]
- 16.Liu CM, et al. Artificial Intelligence-enabled electrocardiogram improves the diagnosis and prediction of mortality in patients with pulmonary hypertension. JACC Asia. 2022;2:258–70. 10.1016/j.jacasi.2022.02.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kishikawa R, et al. An ensemble learning model for detection of pulmonary hypertension using electrocardiogram, chest X-ray, and brain natriuretic peptide. Eur Heart J Digit Health. 2025;6:209–17. 10.1093/ehjdh/ztae097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kogan E, et al. A machine learning approach to identifying patients with pulmonary hypertension using real-world electronic health records. Int J Cardiol. 2023;374:95–9. 10.1016/j.ijcard.2022.12.016. [DOI] [PubMed] [Google Scholar]
- 19.Sharkey MJ, Checkley EW, Swift AJ. Applications of artificial intelligence in computed tomography imaging for phenotyping pulmonary hypertension. Curr Opin Pulm Med. 2024;30:464–72. 10.1097/mcp.0000000000001103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Nemati N, et al. Pulmonary hypertension detection non-invasively at point-of-care using a machine-learned algorithm. Diagnostics. 2024;14:897. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Page MJ, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. 10.1136/bmj.n71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Schiavo JH. PROSPERO: an international register of systematic review protocols. Med Ref Serv Q. 2019;38:171–80. 10.1080/02763869.2019.1588072. [DOI] [PubMed] [Google Scholar]
- 23.Moons KGM, et al. PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ. 2025;388:e082505. 10.1136/bmj-2024-082505. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Campbell M, et al. Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline. BMJ. 2020;368:l6890. 10.1136/bmj.l6890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Alabed S, et al. Machine learning cardiac-MRI features predict mortality in newly diagnosed pulmonary arterial hypertension. Eur Heart J Digit Health. 2022;3:265–75. 10.1093/ehjdh/ztac022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Anand V, et al. Machine learning for diagnosis of pulmonary hypertension by echocardiography. Mayo Clin Proc. 2024;99:260–70. 10.1016/j.mayocp.2023.05.006. [DOI] [PubMed] [Google Scholar]
- 27.Aras MA, et al. Electrocardiogram detection of pulmonary hypertension using deep learning. J Card Fail. 2023;29:1017–28. 10.1016/j.cardfail.2022.12.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Argiento P, et al. A pulmonary hypertension targeted algorithm to improve referral to right heart catheterization: a machine learning approach. Comput Struct Biotechnol J. 2024;24:746–53. 10.1016/j.csbj.2024.11.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Bauer Y, et al. Identifying early pulmonary arterial hypertension biomarkers in systemic sclerosis: machine learning on proteomics from the DETECT cohort. Eur Respir J. 2021;57:2002591. 10.1183/13993003.02591-2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bordag N, et al. Lipidomics for diagnosis and prognosis of pulmonary hypertension. medRxiv. 2023. 10.1101/2023.05.17.23289772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Chettrit D, Bregman Amitai O, Tamir I, Bar A, Elnekave E. PHT-bot: a deep learning based system for automatic risk stratification of COPD patients based upon signs of pulmonary hypertension. Vol 10950 MI (SPIE, 2019). arXiv:1905.11773
- 32.Diller G-P, et al. A framework of deep learning networks provides expert-level accuracy for the detection and prognostication of pulmonary arterial hypertension. Eur Heart J Cardiovasc Imaging. 2022;23:1447–56. 10.1093/ehjci/jeac147. [DOI] [PubMed] [Google Scholar]
- 33.DuBrock HM, et al. An electrocardiogram-based AI algorithm for early detection of pulmonary hypertension. Eur Respir J. 2024;64:2400192. 10.1183/13993003.00192-2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Duo M, et al. Construction of a diagnostic signature and immune landscape of pulmonary arterial hypertension. Front Cardiovasc Med. 2022. 10.3389/fcvm.2022.940894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Dwivedi K, et al. Improving prognostication in pulmonary hypertension using AI-quantified fibrosis and radiologic severity scoring at baseline CT. Radiology. 2024;310:e231718. 10.1148/radiol.231718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Errington N, et al. A diagnostic miRNA signature for pulmonary arterial hypertension using a consensus machine learning approach. EBioMedicine. 2021. 10.1016/j.ebiom.2021.103444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Fortmeier V, et al. Solving the pulmonary hypertension paradox in patients with severe tricuspid regurgitation by employing artificial intelligence. JACC Cardiovasc Interv. 2022;15:381–94. 10.1016/j.jcin.2021.12.043. [DOI] [PubMed] [Google Scholar]
- 38.Gawlitza J, et al. Machine learning assisted feature identification and prediction of hemodynamic endpoints using computed tomography in patients with CTEPH. Int J Cardiovasc Imaging. 2024;40:569–77. 10.1007/s10554-023-03026-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Hu H, et al. Identification of potential biomarkers for group I pulmonary hypertension based on machine learning and bioinformatics analysis. Int J Mol Sci. 2023. 10.3390/ijms24098050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Hyde B, et al. A claims-based, machine-learning algorithm to identify patients with pulmonary arterial hypertension. Pulm Circ. 2023;13:e12237. 10.1002/pul2.12237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kanwar MK, et al. Risk stratification in pulmonary arterial hypertension using Bayesian analysis. Eur Respir J. 2020;56:2000008. 10.1183/13993003.00008-2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Kiely DG, et al. Utilising artificial intelligence to determine patients at risk of a rare disease: idiopathic pulmonary arterial hypertension. Pulm Circ. 2019;9:2045894019890549. 10.1177/2045894019890549. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kıvrak T et al. Pulmonary Hypertension Classification using Artificial Intelligence and Chest X-Ray:ATA AI STUDY-1. 2023. medRxiv. 10.1101/2023.04.14.23288561
- 44.Kusunose K, et al. Deep learning for detection of exercise-induced pulmonary hypertension using chest x-ray images. Front Cardiovasc Med. 2022. 10.3389/fcvm.2022.891703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Kusunose K, Hirata Y, Tsuji T, Kotoku Ji, Sata M. Deep learning to predict elevated pulmonary artery pressure in patients with suspected pulmonary hypertension using standard chest x ray. Sci Rep. 2020;10:19311. 10.1038/s41598-020-76359-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kwon J-M, et al. Artificial intelligence for early prediction of pulmonary hypertension using electrocardiography. J Heart Lung Transplant. 2020;39:805–14. 10.1016/j.healun.2020.04.009. [DOI] [PubMed] [Google Scholar]
- 47.Leha A, et al. A machine learning approach for the prediction of pulmonary hypertension. PLoS ONE. 2019;14:e0224453. 10.1371/journal.pone.0224453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Liu P-Y, et al. A deep-learning-enabled electrocardiogram and chest X-ray for detecting pulmonary arterial hypertension. J Imaging Inf Med. 2025;38:747–56. 10.1007/s10278-024-01225-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Lungu A, et al. Diagnosis of pulmonary hypertension from magnetic resonance imaging-based computational models and decision tree analysis. Pulm Circ. 2016;6:181–90. 10.1086/686020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Matsunaga T, et al. Development and web deployment of prediction model for pulmonary arterial pressure in chronic thromboembolic pulmonary hypertension using machine learning. PLoS ONE. 2024;19:e0300716. 10.1371/journal.pone.0300716. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Murayama M, et al. Deep learning to assess right ventricular ejection fraction from two-dimensional echocardiograms in precapillary pulmonary hypertension. Echocardiography. 2024;41:e15812. 10.1111/echo.15812. [DOI] [PubMed] [Google Scholar]
- 52.Ong MS, et al. Claims‐based algorithms for identifying patients with pulmonary hypertension: a comparison of decision rules and machine‐learning approaches. J Am Heart Assoc. 2020;9:e016648. 10.1161/JAHA.120.016648. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Priya S, et al. Radiomics detection of pulmonary hypertension via texture-based assessments of cardiac MRI: a machine-learning model comparison—cardiac MRI radiomics in pulmonary hypertension. J Clin Med. 2021;10:1921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Priya S, et al. Radiomics side experiments and DAFIT approach in identifying pulmonary hypertension using Cardiac MRI derived radiomics based machine learning models. Sci Rep. 2021;11:12686. 10.1038/s41598-021-92155-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Ragnarsdottir H, et al. Deep learning based prediction of pulmonary hypertension in newborns using echocardiograms. Int J Comput Vis. 2024;132:2567–84. 10.1007/s11263-024-01996-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Schuler KP, et al. An algorithm to identify cases of pulmonary arterial hypertension from the electronic medical record. Respir Res. 2022;23:138. 10.1186/s12931-022-02055-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Shikhare S, et al. Right-to-left ventricle ratio determined by machine learning algorithms on CT pulmonary angiography images predicts prolonged ICU length of stay in operated chronic thromboembolic pulmonary hypertension. Br J Radiol. 2022;95:20210722. 10.1259/bjr.20210722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Suvon M, Tripathi P, Alabed S, Swift A, Lu H. Multimodal learning for predicting mortality in patients with pulmonary arterial hypertension. 2022.
- 59.Swift AJ, et al. A machine learning cardiac magnetic resonance approach to extract disease features and automate pulmonary arterial hypertension diagnosis. Eur Heart J Cardiovasc Imaging. 2020;22:236–45. 10.1093/ehjci/jeaa001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Swinnen K, et al. Machine learning to differentiate pulmonary hypertension due to left heart disease from pulmonary arterial hypertension. ERJ Open Res. 2023. 10.1183/23120541.00229-2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Yang J, Chen S, Chen K, Wu J, Yuan H. Exploring IRGs as a biomarker of pulmonary hypertension using multiple machine learning algorithms. Diagnostics. 2024. 10.3390/diagnostics14212398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Zeng H, Liu X, Zhang Y. Identification of potential biomarkers and immune infiltration characteristics in idiopathic pulmonary arterial hypertension using bioinformatics analysis. Front Cardiovasc Med. 2021. 10.3389/fcvm.2021.624714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Zhang N, et al. Machine learning based on computed tomography pulmonary angiography in evaluating pulmonary artery pressure in patients with pulmonary hypertension. J Clin Med. 2023;12:1297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhao M, et al. Non-contrasted computed tomography (NCCT) based chronic thromboembolic pulmonary hypertension (CTEPH) automatic diagnosis using cascaded network with multiple instance learning. Phys Med Biol. 2024;69:185011. 10.1088/1361-6560/ad7455. [DOI] [PubMed] [Google Scholar]
- 65.Zou X-L, et al. A promising approach for screening pulmonary hypertension based on frontal chest radiographs using deep learning: a retrospective study. PLoS ONE. 2020;15:e0236378. 10.1371/journal.pone.0236378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Dawes TJW, et al. Machine learning of three-dimensional right ventricular motion enables outcome prediction in pulmonary hypertension: a cardiac MR imaging study. Radiology. 2017;283:381–90. 10.1148/radiol.2016161315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Guo L, et al. Development and evaluation of a deep learning-based pulmonary hypertension screening algorithm using a digital stethoscope. J Am Heart Assoc. 2025;14:e036882. 10.1161/jaha.124.036882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Kheyfets VO, et al. Computational platform for doctor-artificial intelligence cooperation in pulmonary arterial hypertension prognostication: a pilot study. ERJ Open Res. 2023. 10.1183/23120541.00484-2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Liao Z, et al. Automatic echocardiographic evaluation of the probability of pulmonary hypertension using machine learning. Pulm Circ. 2023;13:e12272. 10.1002/pul2.12272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Sweatt AJ, et al. Discovery of distinct immune phenotypes using machine learning in pulmonary arterial hypertension. Circ Res. 2019;124:904–19. 10.1161/circresaha.118.313911. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Zhao W, et al. Development and validation of multimodal deep learning algorithms for detecting pulmonary hypertension. NPJ Digit Med. 2025;8:198. 10.1038/s41746-025-01593-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Dubrock HM, et al. Use of machine-learning models to identify clinical features in patients with pulmonary arterial hypertension associated with a future clinical worsening event. Chest. 2023;164:A5931–2. 10.1016/j.chest.2023.07.3821. [Google Scholar]
- 73.Hoeper MM, et al. Complications of right heart catheterization procedures in patients with pulmonary hypertension in experienced centers. J Am Coll Cardiol. 2006;48:2546–52. 10.1016/j.jacc.2006.07.061. [DOI] [PubMed] [Google Scholar]
- 74.Salih A, et al. Explainable artificial intelligence and cardiac imaging: toward more interpretable models. Circ Cardiovasc Imaging. 2023;16:e014519. 10.1161/circimaging.122.014519. [DOI] [PubMed] [Google Scholar]
- 75.Marey A, et al. Explainability, transparency and black box challenges of AI in radiology: impact on patient care in cardiovascular radiology. Egypt J Radiol Nucl Med. 2024;55:183. 10.1186/s43055-024-01356-2. [Google Scholar]
- 76.Tonekaboni S, Joshi S, McCradden MD, Goldenberg A. Proceedings of the 4th Machine Learning for Healthcare Conference. 2019; vol. 106, p. 359–380 (PMLR, Proceedings of Machine Learning Research).
- 77.Goto S, Ozawa H. The importance of external validation for neural network models. JACC Adv. 2023;2:100610. 10.1016/j.jacadv.2023.100610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Cabitza F, et al. The importance of being external. Methodological insights for the external validation of machine learning models in medicine. Comput Methods Programs Biomed. 2021;208:106288. 10.1016/j.cmpb.2021.106288. [DOI] [PubMed] [Google Scholar]
- 79.Hassija V, et al. Interpreting black-box models: a review on explainable artificial intelligence. Cogn Comput. 2024;16:45–74. 10.1007/s12559-023-10179-8. [Google Scholar]
- 80.Ribeiro MT, Singh S, Guestrin C. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA: Association for Computing Machinery; 2016. p. 1135–1144.
- 81.Lundberg S. M. & Lee, S.-I. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, CA: Curran Associates Inc.; 2017. p. 4768–4777.
- 82.Ueda D, et al. Fairness of artificial intelligence in healthcare: review and recommendations. Jpn J Radiol. 2024;42:3–15. 10.1007/s11604-023-01474-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A survey on bias and fairness in machine learning. ACM Comput Surv. 2021;54:115. 10.1145/3457607. [Google Scholar]
- 84.Winter P, Carusi A. If you’re going to trust the machine, then that trust has got to be based on something’: : validation and the co-constitution of trust in developing artificial intelligence (AI) for the early diagnosis of pulmonary hypertension (PH). Sci Technol Stud. 2022;35:58–77. 10.23987/sts.102198. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
This systematic review is based on publicly available data from previously published studies. As no original data were collected or generated, no new datasets are available. All relevant data from the included studies are cited in the manuscript and summarized in the main text and supplementary tables. The corresponding review protocol was prospectively registered and is publicly available in PROSPERO (registration number: CRD420251074202). No analytic code was generated, as data synthesis was conducted narratively following the SWiM (Synthesis Without Meta-analysis) approach. A pre-defined data extraction form was used but is not publicly available; it can be obtained upon reasonable request from the corresponding author. Further inquiries regarding specific studies or data can also be directed to the corresponding author.

