We read with concern the study by Babur Guler and colleagues, which evaluated three AI-enhanced electrocardiography (AI-ECG) tools in 681 patients with confirmed hypertrophic cardiomyopathy (HCM).1 These tools include one for detecting underrecognized HCM, PRESENT-SHD, for screening for structural heart disease (SHD), and a multilabel rhythm and conduction classifier (ECGDx). These models developed by our group are freely available for research use on our lab’s website.2–5
We are committed to open and transparent research and welcome independent external evaluations. We agree with the authors that independent assessment of AI tools in disease-specific populations can inform understanding of the model’s behaviour in specific clinical scenarios, which is essential for the field. However, responsible external validation carries a reciprocal responsibility: Models must be represented accurately with respect to their intended use case, validated operating thresholds, and prior evidence base, and evaluated using an appropriate study design.
On several of these fronts, the current study falls short in ways that undermine its conclusions. First, and most fundamentally, the study includes no control or reference group. All three tools are discriminative classifiers, developed and validated in mixed case-control populations. Applying them to a cohort composed entirely of confirmed HCM patients makes it mathematically impossible to assess diagnostic performance. The authors observe that ‘when applied to an exclusively HCM population, the tool appears to lose discriminative power, possibly due to the absence of the contrast provided by non-HCM cases in the training data’. This interpretation is fundamentally flawed. The apparent loss of discriminative power is not an empirical finding but a mathematical inevitability of applying a discriminative classifier to a case-only cohort. While the paper repeatedly frames these distributions as ‘modest performance’, sensitivity, specificity, and discrimination cannot be estimated from a case-only sample, and the resulting probability distributions cannot be benchmarked against any meaningful clinical reference.
Second, the study does not use the validated operating thresholds from the primary publications. The validated thresholds are 15% for the HCM model and 20% for PRESENT-SHD, each selected to achieve approximately 90% sensitivity in internal validation.3–5 Instead, it emphasizes proportions of patients with HCM or SHD probabilities above 50% and 75%, which represent arbitrary cut-offs.1 As a result, the conclusion that relatively few patients ‘receive a high score’ is driven by threshold choice, which conflates a study design decision with model failure.
Third, applying screening tools developed in treatment-naive populations to a mixed post-intervention cohort without stratification introduces avoidable confounding. This cohort includes patients who had undergone ventricular myectomy (2.5%), alcohol septal ablation (5.5%), and implantable cardioverter-defibrillator implantation (11%), representing interventions known from previous work to substantially alter the electrocardiographic HCM signature. Our multicentre evaluation demonstrated that AI-ECG HCM scores do not decrease, and may paradoxically rise after myectomy, reflecting potential scarring from the procedure.5 In clinical practice, there is limited utility for an HCM screening tool being used in a post-intervention population with an established diagnosis. Although evaluating model outputs in such individuals may still be informative for understanding model behaviour or exploring whether these scores capture features of disease severity, such analyses should not be used to draw conclusions about discriminative performance without appropriate stratification and interpretation.
Fourth, deployment details are insufficiently documented. No ECG images are shown to confirm that pre-processing, including cropping and de-identification, aligned with validated input specifications. Finally, no information is provided on when ECGs were acquired relative to diagnosis or treatment, making pooled results difficult to interpret across what may be vastly different stages of the disease trajectory.
Importantly, the results cited as evidence of underperformance could equally be interpreted as evidence that these models retain meaningful biological signal in a highly complex, disease-enriched cohort. Figure 1 of the manuscript shows that, at prespecified operating thresholds, most individuals in this cohort would in fact have been classified as having HCM. Moreover, even in a population outside the model’s intended screening setting, HCM model outputs correlated significantly with maximal wall thickness (r = 0.30), NT-proBNP (r = 0.41), late gadolinium enhancement, T-wave axis (r = 0.48), and Sokolow-Lyon index (r = 0.39). Apical HCM, a phenotype characterized by marked repolarization abnormalities, yielded the highest model probabilities. Likewise, PRESENT-SHD probabilities correlated inversely with LVEF measured by echocardiography (r = −0.22) and CMR (r = −0.16), as well as with TAPSE (r = −0.24), aligning with the functional parameters a structural heart disease screening model would be expected to capture in a population with established cardiomyopathy. Thus, the models continued to track clinically meaningful gradients of disease expression, despite this setting not being designed to assess diagnostic discrimination. These findings support preservation of biologically relevant signal rather than model failure.
The misinterpretation of publicly available AI-ECG tools, as illustrated in the present study, has implications that extend well beyond any individual model. Open science in cardiovascular AI remains uncommon, and its continued progress depends on a culture of collaborative, methodologically rigorous validation.6–8 When openly shared research tools are applied outside their intended use cases, evaluated without appropriate comparators, and then framed as underperforming despite preserving meaningful biological signal, the consequences extend beyond local misinterpretation: such practices risk discouraging future model sharing. Yet this problem is readily preventable. Just as commercial AI diagnostics typically come with implementation guidance and technical consultation to promote appropriate deployment, openly shared research tools can also be evaluated responsibly.9 In many cases, brief communication with corresponding authors would be sufficient to clarify intended use cases, validated thresholds, and key study design considerations. We raise these points not to restrict access or diminish the investigators’ considerable effort, but to emphasize that this exceptional cohort with evident scientific depth deserved a study design equal to its quality. To support future evaluations and provide a more generalizable framework for the field, we propose a checklist for responsible external validation of AI-ECG tools (Table 1). We welcome independent validation of our tools and would be glad to collaborate on a rigorous analysis of this registry, because fair and methodologically sound evaluation is essential not only for trust in individual models but also for preserving the culture of openness on which progress in cardiovascular AI depends.10
Table 1.
Checklist for responsible external validation of artificial intelligence enhanced electrocardiography tools
| Domain | No. | Checklist Item | Rationale |
|---|---|---|---|
| Study Design | 1a | Match the study population, including appropriate controls or comparators, to the model’s intended use case and clinical setting | Models should be evaluated in light of their clinical applicability and intended use; discriminative classifiers require cases and controls to estimate sensitivity, specificity, and AUROC |
| 1b | Define whether the evaluation targets screening, diagnosis, severity stratification, or monitoring, and match the study design accordingly | Case-only cohorts can assess calibration and severity gradients, but not diagnostic discrimination | |
| Technical Deployment | 2a | Confirm that input data format, resolution, pre-processing, and de-identification match the model’s validated specifications; provide representative input examples | AI models can be sensitive to format, resolution, and pre-processing differences; examples enable readers to assess deployment fidelity |
| 2b | Use validated operating thresholds from the primary publication; if alternative thresholds are explored, justify them and report performance at the original thresholds | Thresholds are calibrated to specific sensitivity/specificity trade-offs; arbitrary cutoffs misrepresent performance | |
| 2c | Report the hardware, software environment, and library versions used for model inference | Reproducibility requires transparency about the computational environment; version differences can alter outputs | |
| Interpretation & Reporting | 3a | Distinguish between expected model behaviour given the study design and actual model failure, particularly in non-standard settings such as case-only or post-intervention cohorts | Probability distributions in non-standard cohorts reflect population composition, not necessarily model deficiency |
| 3b | Report clinically meaningful associations (e.g. correlation with disease severity markers, treatment response) alongside classification metrics | Biological signal preservation demonstrates model validity even outside standard case-control designs | |
| 3c | Ensure that conclusions about model performance are benchmarked against validated reference standards and operating thresholds | Conclusions drawn without validated benchmarks may misrepresent model capabilities and discourage adoption of effective tools | |
| Communication with Developers | 4a | Contact model developers to clarify intended use cases, validated thresholds, known limitations, and deployment specifications | A low-cost step that prevents avoidable methodological errors, especially for openly shared research tools |
| 4b | Report any discrepancies between the evaluation setting and the model’s documented intended use | Transparency about deviations from intended use enables appropriate interpretation of results |
Contributor Information
Lovedeep S Dhingra, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.
Philip M Croon, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.
Evangelos K Oikonomou, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.
Rohan Khera, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA; Center for Outcomes Research and Evaluation (CORE), Yale New Haven Hospital, New Haven, CT, USA; Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA; Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
Author contributions
Lovedeep Singh Dhingra (Conceptualization, Methodology, Writing—original draft [lead]), Philip M. Croon (Conceptualization, Methodology [supporting], Writing—original draft [equal]), Evangelos K. Oikonomou (Conceptualization, Methodology [supporting], Writing—review & editing [equal]), and Rohan Khera (Conceptualization, Project administration, Writing—review & editing [lead], Methodology [supporting])
Funding
The authors acknowledge support from the National Heart, Lung, and Blood Institute (R01HL167858 and K23HL153775 to R.K., and F32HL170592 to E.K.O.), the National Institute on Aging (R01AG089981 to R.K.), the Doris Duke Charitable Foundation (2022060 to R.K.), and the Robert A. Winn Excellence in Clinical Trials Career Development Award (to E.K.O.), during the conduct of the study. The funders had no role in the design and conduct of the study; the collection, management, analysis, and interpretation of the data; the preparation, review, or approval of the manuscript; or the decision to submit the manuscript for publication.
References
- 1. Babur Guler G, Guler A, Surgit O, Turkmen I, Atmaca S, Sahin H, et al. Evaluation of artificial intelligence-based electrocardiogram analysis tools in patients with hypertrophic cardiomyopathy. Eur Heart J Digit Health 2026;7:ztag026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Sangha V, Mortazavi BJ, Haimovich AD, Ribeiro AH, Brandt CA, Jacoby DL, et al. Automated multilabel diagnosis on electrocardiographic images and signals. Nat Commun 2022;13:1583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Dhingra LS, Aminorroaya A, Sangha V, Pedroso AF, Shankar SV, Coppi A, et al. Ensemble deep learning algorithm for structural heart disease screening using electrocardiographic images: PRESENT SHD. J Am Coll Cardiol 2025;85:1302–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Sangha V, Dhingra LS, Aminorroaya A, Croon PM, Sikand NV, Sen S, et al. Identification of hypertrophic cardiomyopathy on electrocardiographic images with deep learning. Nat Cardiovasc Res 2025;4:991–1000. [DOI] [PubMed] [Google Scholar]
- 5. Dhingra LS, Sangha V, Aminorroaya A, Bryde R, Gaballa A, Ali AH, et al. A multicenter evaluation of the impact of therapies on deep learning-based electrocardiographic hypertrophic cardiomyopathy markers. Am J Cardiol 2024;237:35–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Croon PM, Boonstra MJ, Allaart CP, Arends BKO, Dhingra LS, Huang Y-C, et al. Artificial intelligence-enhanced electrocardiogram models for detection of left ventricular dysfunction: a comparison study. JACC Adv 2026;5:102572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Callahan A, McElfresh D, Banda JM, Bunney G, Char D, Chen J, et al. Standing on FURM ground: a framework for evaluating fair, useful, and reliable AI models in health care systems. NEJM Catal Innov Care Deliv 2024;5:CAT-24. [Google Scholar]
- 8. Angus DC, Khera R, Lieu T, Liu V, Ahmad FS, Anderson B, et al. AI, health, and health care today and tomorrow: the JAMA summit report on artificial intelligence. JAMA 2025;334:1650–1664. [DOI] [PubMed] [Google Scholar]
- 9. Dhingra LS, Pedroso AF, Aminorroaya A, Caraballo C, Mahajan S, Bansal B, et al. Follow-up assessment of adherence to methodological standards in national inpatient sample research. JAMA Netw Open 2026;9:e2555753. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Khera R, Oikonomou EK, Nadkarni GN, Morley JR, Wiens J, Butte AJ, et al. Transforming cardiovascular care with artificial intelligence: from discovery to practice. J Am Coll Cardiol 2024;84:97–114. [DOI] [PMC free article] [PubMed] [Google Scholar]
