Skip to main content
European Heart Journal. Digital Health logoLink to European Heart Journal. Digital Health
letter
. 2026 Apr 15;7(4):ztag057. doi: 10.1093/ehjdh/ztag057

Open science requires trust and rigour: a framework for responsible evaluation of shared AI-ECG tools

Lovedeep S Dhingra 1,2, Philip M Croon 3,4, Evangelos K Oikonomou 5,6, Rohan Khera 7,8,9,10,11,✉,2
PMCID: PMC13080934  PMID: 41994368

We read with concern the study by Babur Guler and colleagues, which evaluated three AI-enhanced electrocardiography (AI-ECG) tools in 681 patients with confirmed hypertrophic cardiomyopathy (HCM).1 These tools include one for detecting underrecognized HCM, PRESENT-SHD, for screening for structural heart disease (SHD), and a multilabel rhythm and conduction classifier (ECGDx). These models developed by our group are freely available for research use on our lab’s website.2–5

We are committed to open and transparent research and welcome independent external evaluations. We agree with the authors that independent assessment of AI tools in disease-specific populations can inform understanding of the model’s behaviour in specific clinical scenarios, which is essential for the field. However, responsible external validation carries a reciprocal responsibility: Models must be represented accurately with respect to their intended use case, validated operating thresholds, and prior evidence base, and evaluated using an appropriate study design.

On several of these fronts, the current study falls short in ways that undermine its conclusions. First, and most fundamentally, the study includes no control or reference group. All three tools are discriminative classifiers, developed and validated in mixed case-control populations. Applying them to a cohort composed entirely of confirmed HCM patients makes it mathematically impossible to assess diagnostic performance. The authors observe that ‘when applied to an exclusively HCM population, the tool appears to lose discriminative power, possibly due to the absence of the contrast provided by non-HCM cases in the training data’. This interpretation is fundamentally flawed. The apparent loss of discriminative power is not an empirical finding but a mathematical inevitability of applying a discriminative classifier to a case-only cohort. While the paper repeatedly frames these distributions as ‘modest performance’, sensitivity, specificity, and discrimination cannot be estimated from a case-only sample, and the resulting probability distributions cannot be benchmarked against any meaningful clinical reference.

Second, the study does not use the validated operating thresholds from the primary publications. The validated thresholds are 15% for the HCM model and 20% for PRESENT-SHD, each selected to achieve approximately 90% sensitivity in internal validation.3–5 Instead, it emphasizes proportions of patients with HCM or SHD probabilities above 50% and 75%, which represent arbitrary cut-offs.1 As a result, the conclusion that relatively few patients ‘receive a high score’ is driven by threshold choice, which conflates a study design decision with model failure.

Third, applying screening tools developed in treatment-naive populations to a mixed post-intervention cohort without stratification introduces avoidable confounding. This cohort includes patients who had undergone ventricular myectomy (2.5%), alcohol septal ablation (5.5%), and implantable cardioverter-defibrillator implantation (11%), representing interventions known from previous work to substantially alter the electrocardiographic HCM signature. Our multicentre evaluation demonstrated that AI-ECG HCM scores do not decrease, and may paradoxically rise after myectomy, reflecting potential scarring from the procedure.5 In clinical practice, there is limited utility for an HCM screening tool being used in a post-intervention population with an established diagnosis. Although evaluating model outputs in such individuals may still be informative for understanding model behaviour or exploring whether these scores capture features of disease severity, such analyses should not be used to draw conclusions about discriminative performance without appropriate stratification and interpretation.

Fourth, deployment details are insufficiently documented. No ECG images are shown to confirm that pre-processing, including cropping and de-identification, aligned with validated input specifications. Finally, no information is provided on when ECGs were acquired relative to diagnosis or treatment, making pooled results difficult to interpret across what may be vastly different stages of the disease trajectory.

Importantly, the results cited as evidence of underperformance could equally be interpreted as evidence that these models retain meaningful biological signal in a highly complex, disease-enriched cohort. Figure 1 of the manuscript shows that, at prespecified operating thresholds, most individuals in this cohort would in fact have been classified as having HCM. Moreover, even in a population outside the model’s intended screening setting, HCM model outputs correlated significantly with maximal wall thickness (r = 0.30), NT-proBNP (r = 0.41), late gadolinium enhancement, T-wave axis (r = 0.48), and Sokolow-Lyon index (r = 0.39). Apical HCM, a phenotype characterized by marked repolarization abnormalities, yielded the highest model probabilities. Likewise, PRESENT-SHD probabilities correlated inversely with LVEF measured by echocardiography (r = −0.22) and CMR (r = −0.16), as well as with TAPSE (r = −0.24), aligning with the functional parameters a structural heart disease screening model would be expected to capture in a population with established cardiomyopathy. Thus, the models continued to track clinically meaningful gradients of disease expression, despite this setting not being designed to assess diagnostic discrimination. These findings support preservation of biologically relevant signal rather than model failure.

The misinterpretation of publicly available AI-ECG tools, as illustrated in the present study, has implications that extend well beyond any individual model. Open science in cardiovascular AI remains uncommon, and its continued progress depends on a culture of collaborative, methodologically rigorous validation.6–8 When openly shared research tools are applied outside their intended use cases, evaluated without appropriate comparators, and then framed as underperforming despite preserving meaningful biological signal, the consequences extend beyond local misinterpretation: such practices risk discouraging future model sharing. Yet this problem is readily preventable. Just as commercial AI diagnostics typically come with implementation guidance and technical consultation to promote appropriate deployment, openly shared research tools can also be evaluated responsibly.9 In many cases, brief communication with corresponding authors would be sufficient to clarify intended use cases, validated thresholds, and key study design considerations. We raise these points not to restrict access or diminish the investigators’ considerable effort, but to emphasize that this exceptional cohort with evident scientific depth deserved a study design equal to its quality. To support future evaluations and provide a more generalizable framework for the field, we propose a checklist for responsible external validation of AI-ECG tools (Table 1). We welcome independent validation of our tools and would be glad to collaborate on a rigorous analysis of this registry, because fair and methodologically sound evaluation is essential not only for trust in individual models but also for preserving the culture of openness on which progress in cardiovascular AI depends.10

Table 1.

Checklist for responsible external validation of artificial intelligence enhanced electrocardiography tools

Domain No. Checklist Item Rationale
Study Design 1a Match the study population, including appropriate controls or comparators, to the model’s intended use case and clinical setting Models should be evaluated in light of their clinical applicability and intended use; discriminative classifiers require cases and controls to estimate sensitivity, specificity, and AUROC
1b Define whether the evaluation targets screening, diagnosis, severity stratification, or monitoring, and match the study design accordingly Case-only cohorts can assess calibration and severity gradients, but not diagnostic discrimination
Technical Deployment 2a Confirm that input data format, resolution, pre-processing, and de-identification match the model’s validated specifications; provide representative input examples AI models can be sensitive to format, resolution, and pre-processing differences; examples enable readers to assess deployment fidelity
2b Use validated operating thresholds from the primary publication; if alternative thresholds are explored, justify them and report performance at the original thresholds Thresholds are calibrated to specific sensitivity/specificity trade-offs; arbitrary cutoffs misrepresent performance
2c Report the hardware, software environment, and library versions used for model inference Reproducibility requires transparency about the computational environment; version differences can alter outputs
Interpretation & Reporting 3a Distinguish between expected model behaviour given the study design and actual model failure, particularly in non-standard settings such as case-only or post-intervention cohorts Probability distributions in non-standard cohorts reflect population composition, not necessarily model deficiency
3b Report clinically meaningful associations (e.g. correlation with disease severity markers, treatment response) alongside classification metrics Biological signal preservation demonstrates model validity even outside standard case-control designs
3c Ensure that conclusions about model performance are benchmarked against validated reference standards and operating thresholds Conclusions drawn without validated benchmarks may misrepresent model capabilities and discourage adoption of effective tools
Communication with Developers 4a Contact model developers to clarify intended use cases, validated thresholds, known limitations, and deployment specifications A low-cost step that prevents avoidable methodological errors, especially for openly shared research tools
4b Report any discrepancies between the evaluation setting and the model’s documented intended use Transparency about deviations from intended use enables appropriate interpretation of results

Contributor Information

Lovedeep S Dhingra, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.

Philip M Croon, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.

Evangelos K Oikonomou, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.

Rohan Khera, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA; Center for Outcomes Research and Evaluation (CORE), Yale New Haven Hospital, New Haven, CT, USA; Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA; Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.

Author contributions

Lovedeep Singh Dhingra (Conceptualization, Methodology, Writing—original draft [lead]), Philip M. Croon (Conceptualization, Methodology [supporting], Writing—original draft [equal]), Evangelos K. Oikonomou (Conceptualization, Methodology [supporting], Writing—review & editing [equal]), and Rohan Khera (Conceptualization, Project administration, Writing—review & editing [lead], Methodology [supporting])

Funding

The authors acknowledge support from the National Heart, Lung, and Blood Institute (R01HL167858 and K23HL153775 to R.K., and F32HL170592 to E.K.O.), the National Institute on Aging (R01AG089981 to R.K.), the Doris Duke Charitable Foundation (2022060 to R.K.), and the Robert A. Winn Excellence in Clinical Trials Career Development Award (to E.K.O.), during the conduct of the study. The funders had no role in the design and conduct of the study; the collection, management, analysis, and interpretation of the data; the preparation, review, or approval of the manuscript; or the decision to submit the manuscript for publication.

References

  • 1. Babur Guler  G, Guler  A, Surgit  O, Turkmen  I, Atmaca  S, Sahin  H, et al.  Evaluation of artificial intelligence-based electrocardiogram analysis tools in patients with hypertrophic cardiomyopathy. Eur Heart J Digit Health  2026;7:ztag026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Sangha  V, Mortazavi  BJ, Haimovich  AD, Ribeiro  AH, Brandt  CA, Jacoby  DL, et al.  Automated multilabel diagnosis on electrocardiographic images and signals. Nat Commun  2022;13:1583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Dhingra  LS, Aminorroaya  A, Sangha  V, Pedroso  AF, Shankar  SV, Coppi  A, et al.  Ensemble deep learning algorithm for structural heart disease screening using electrocardiographic images: PRESENT SHD. J Am Coll Cardiol  2025;85:1302–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Sangha  V, Dhingra  LS, Aminorroaya  A, Croon  PM, Sikand  NV, Sen  S, et al.  Identification of hypertrophic cardiomyopathy on electrocardiographic images with deep learning. Nat Cardiovasc Res  2025;4:991–1000. [DOI] [PubMed] [Google Scholar]
  • 5. Dhingra  LS, Sangha  V, Aminorroaya  A, Bryde  R, Gaballa  A, Ali  AH, et al.  A multicenter evaluation of the impact of therapies on deep learning-based electrocardiographic hypertrophic cardiomyopathy markers. Am J Cardiol  2024;237:35–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Croon  PM, Boonstra  MJ, Allaart  CP, Arends  BKO, Dhingra  LS, Huang  Y-C, et al.  Artificial intelligence-enhanced electrocardiogram models for detection of left ventricular dysfunction: a comparison study. JACC Adv  2026;5:102572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Callahan  A, McElfresh  D, Banda  JM, Bunney  G, Char  D, Chen  J, et al.  Standing on FURM ground: a framework for evaluating fair, useful, and reliable AI models in health care systems. NEJM Catal Innov Care Deliv  2024;5:CAT-24. [Google Scholar]
  • 8. Angus  DC, Khera  R, Lieu  T, Liu  V, Ahmad  FS, Anderson  B, et al.  AI, health, and health care today and tomorrow: the JAMA summit report on artificial intelligence. JAMA  2025;334:1650–1664. [DOI] [PubMed] [Google Scholar]
  • 9. Dhingra  LS, Pedroso  AF, Aminorroaya  A, Caraballo  C, Mahajan  S, Bansal  B, et al.  Follow-up assessment of adherence to methodological standards in national inpatient sample research. JAMA Netw Open  2026;9:e2555753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Khera  R, Oikonomou  EK, Nadkarni  GN, Morley  JR, Wiens  J, Butte  AJ, et al.  Transforming cardiovascular care with artificial intelligence: from discovery to practice. J Am Coll Cardiol  2024;84:97–114. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from European Heart Journal. Digital Health are provided here courtesy of Oxford University Press on behalf of the European Society of Cardiology

RESOURCES