Open science requires trust and rigour: a framework for responsible evaluation of shared AI-ECG tools

Lovedeep S Dhingra; Philip M Croon; Evangelos K Oikonomou; Rohan Khera

doi:10.1093/ehjdh/ztag057

letter

. 2026 Apr 15;7(4):ztag057. doi: 10.1093/ehjdh/ztag057

Open science requires trust and rigour: a framework for responsible evaluation of shared AI-ECG tools

Lovedeep S Dhingra ^1,², Philip M Croon ^3,⁴, Evangelos K Oikonomou ^5,⁶, Rohan Khera ^7,^8,^9,^10,^11,^✉,²

¹ Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA

² Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA

³ Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA

⁴ Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA

⁵ Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA

⁶ Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA

⁷ Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA

⁸ Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA

⁹ Center for Outcomes Research and Evaluation (CORE), Yale New Haven Hospital, New Haven, CT, USA

¹⁰ Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA

¹¹ Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA

^✉

Corresponding author. Tel: 203 764 5885 , Email: rohan.khera@yale.edu, @rohan_khera

²

Conflict of interest: R.K. is an Associate Editor of JAMA. R.K. is the coinventor of Pending US and European Patent Application US20250372251A1/EP4533482A1, ‘Articles and methods for format-independent detection of hidden cardiovascular disease from printed electrocardiographic images using deep learning’ and a co-founder of Ensight-AI. R.K. receives support from the National Institutes of Health (under awards R01AG089981, R01HL167858, and K23HL153775) and the Doris Duke Charitable Foundation (under award 2022060). He receives support from the Blavatnik Foundation through the Blavatnik Fund for Innovation at Yale. He also receives research support, through Yale, from Bristol-Myers Squibb, BridgeBio, and Novo Nordisk. In addition to 63/346,610, R.K. is a coinventor of US Pending Patent Applications WO2023230345A1, US20220336048A1, 63/484,426, 63/508,315, 63/580,137, 63/606,203, 63/619,241, and 63/562,335. R.K. and E.K.O. are co-founders of Evidence2Health, a precision health platform to improve evidence-based cardiovascular care. E.K.O. acknowledges research support through a Robert A. Winn Excellence in Clinical Trials Career Development Award and a Wiesman Award for Excellence in Early-Career ATTR Research from Cornerstone Medical Education (through Yale University). He is a co-inventor in patent applications (18/813,882, 17/720,068, 63/508,315, 63/580,137, 63/619,241, 63/562 335) and granted patents (US12067714B2, US11948230B2, licensed through the University of Oxford to Caristo Diagnostics Ltd). He has been a consultant for Caristo Diagnostics Ltd and Ensight-AI Inc, and has received honoraria from Clinical Education Alliance. He also serves as Associate Editor for European Heart Journal. P.M.C. is a co-founder of Ensight-AI, and former owner of DGTL Health B.V, for which he still serves as an advisor.

Roles

Lovedeep S Dhingra: Conceptualization, Methodology, Writing - original draft

Philip M Croon: Conceptualization, Methodology, Writing - original draft

Evangelos K Oikonomou: Conceptualization, Methodology, Writing - review & editing

Rohan Khera: Conceptualization, Methodology, Project administration, Writing - review & editing

PMCID: PMC13080934 PMID: 41994368

We read with concern the study by Babur Guler and colleagues, which evaluated three AI-enhanced electrocardiography (AI-ECG) tools in 681 patients with confirmed hypertrophic cardiomyopathy (HCM).¹ These tools include one for detecting underrecognized HCM, PRESENT-SHD, for screening for structural heart disease (SHD), and a multilabel rhythm and conduction classifier (ECGDx). These models developed by our group are freely available for research use on our lab’s website.^2–5

We are committed to open and transparent research and welcome independent external evaluations. We agree with the authors that independent assessment of AI tools in disease-specific populations can inform understanding of the model’s behaviour in specific clinical scenarios, which is essential for the field. However, responsible external validation carries a reciprocal responsibility: Models must be represented accurately with respect to their intended use case, validated operating thresholds, and prior evidence base, and evaluated using an appropriate study design.

On several of these fronts, the current study falls short in ways that undermine its conclusions. First, and most fundamentally, the study includes no control or reference group. All three tools are discriminative classifiers, developed and validated in mixed case-control populations. Applying them to a cohort composed entirely of confirmed HCM patients makes it mathematically impossible to assess diagnostic performance. The authors observe that ‘when applied to an exclusively HCM population, the tool appears to lose discriminative power, possibly due to the absence of the contrast provided by non-HCM cases in the training data’. This interpretation is fundamentally flawed. The apparent loss of discriminative power is not an empirical finding but a mathematical inevitability of applying a discriminative classifier to a case-only cohort. While the paper repeatedly frames these distributions as ‘modest performance’, sensitivity, specificity, and discrimination cannot be estimated from a case-only sample, and the resulting probability distributions cannot be benchmarked against any meaningful clinical reference.

Second, the study does not use the validated operating thresholds from the primary publications. The validated thresholds are 15% for the HCM model and 20% for PRESENT-SHD, each selected to achieve approximately 90% sensitivity in internal validation.^3–5 Instead, it emphasizes proportions of patients with HCM or SHD probabilities above 50% and 75%, which represent arbitrary cut-offs.¹ As a result, the conclusion that relatively few patients ‘receive a high score’ is driven by threshold choice, which conflates a study design decision with model failure.

Third, applying screening tools developed in treatment-naive populations to a mixed post-intervention cohort without stratification introduces avoidable confounding. This cohort includes patients who had undergone ventricular myectomy (2.5%), alcohol septal ablation (5.5%), and implantable cardioverter-defibrillator implantation (11%), representing interventions known from previous work to substantially alter the electrocardiographic HCM signature. Our multicentre evaluation demonstrated that AI-ECG HCM scores do not decrease, and may paradoxically rise after myectomy, reflecting potential scarring from the procedure.⁵ In clinical practice, there is limited utility for an HCM screening tool being used in a post-intervention population with an established diagnosis. Although evaluating model outputs in such individuals may still be informative for understanding model behaviour or exploring whether these scores capture features of disease severity, such analyses should not be used to draw conclusions about discriminative performance without appropriate stratification and interpretation.

Fourth, deployment details are insufficiently documented. No ECG images are shown to confirm that pre-processing, including cropping and de-identification, aligned with validated input specifications. Finally, no information is provided on when ECGs were acquired relative to diagnosis or treatment, making pooled results difficult to interpret across what may be vastly different stages of the disease trajectory.

Importantly, the results cited as evidence of underperformance could equally be interpreted as evidence that these models retain meaningful biological signal in a highly complex, disease-enriched cohort. Figure 1 of the manuscript shows that, at prespecified operating thresholds, most individuals in this cohort would in fact have been classified as having HCM. Moreover, even in a population outside the model’s intended screening setting, HCM model outputs correlated significantly with maximal wall thickness (r = 0.30), NT-proBNP (r = 0.41), late gadolinium enhancement, T-wave axis (r = 0.48), and Sokolow-Lyon index (r = 0.39). Apical HCM, a phenotype characterized by marked repolarization abnormalities, yielded the highest model probabilities. Likewise, PRESENT-SHD probabilities correlated inversely with LVEF measured by echocardiography (r = −0.22) and CMR (r = −0.16), as well as with TAPSE (r = −0.24), aligning with the functional parameters a structural heart disease screening model would be expected to capture in a population with established cardiomyopathy. Thus, the models continued to track clinically meaningful gradients of disease expression, despite this setting not being designed to assess diagnostic discrimination. These findings support preservation of biologically relevant signal rather than model failure.

The misinterpretation of publicly available AI-ECG tools, as illustrated in the present study, has implications that extend well beyond any individual model. Open science in cardiovascular AI remains uncommon, and its continued progress depends on a culture of collaborative, methodologically rigorous validation.^6–8 When openly shared research tools are applied outside their intended use cases, evaluated without appropriate comparators, and then framed as underperforming despite preserving meaningful biological signal, the consequences extend beyond local misinterpretation: such practices risk discouraging future model sharing. Yet this problem is readily preventable. Just as commercial AI diagnostics typically come with implementation guidance and technical consultation to promote appropriate deployment, openly shared research tools can also be evaluated responsibly.⁹ In many cases, brief communication with corresponding authors would be sufficient to clarify intended use cases, validated thresholds, and key study design considerations. We raise these points not to restrict access or diminish the investigators’ considerable effort, but to emphasize that this exceptional cohort with evident scientific depth deserved a study design equal to its quality. To support future evaluations and provide a more generalizable framework for the field, we propose a checklist for responsible external validation of AI-ECG tools (Table 1). We welcome independent validation of our tools and would be glad to collaborate on a rigorous analysis of this registry, because fair and methodologically sound evaluation is essential not only for trust in individual models but also for preserving the culture of openness on which progress in cardiovascular AI depends.¹⁰

Table 1.

Checklist for responsible external validation of artificial intelligence enhanced electrocardiography tools

Domain	No.	Checklist Item	Rationale
Study Design	1a	Match the study population, including appropriate controls or comparators, to the model’s intended use case and clinical setting	Models should be evaluated in light of their clinical applicability and intended use; discriminative classifiers require cases and controls to estimate sensitivity, specificity, and AUROC
Study Design	1b	Define whether the evaluation targets screening, diagnosis, severity stratification, or monitoring, and match the study design accordingly	Case-only cohorts can assess calibration and severity gradients, but not diagnostic discrimination
Technical Deployment	2a	Confirm that input data format, resolution, pre-processing, and de-identification match the model’s validated specifications; provide representative input examples	AI models can be sensitive to format, resolution, and pre-processing differences; examples enable readers to assess deployment fidelity
	2b	Use validated operating thresholds from the primary publication; if alternative thresholds are explored, justify them and report performance at the original thresholds	Thresholds are calibrated to specific sensitivity/specificity trade-offs; arbitrary cutoffs misrepresent performance
	2c	Report the hardware, software environment, and library versions used for model inference	Reproducibility requires transparency about the computational environment; version differences can alter outputs
Interpretation & Reporting	3a	Distinguish between expected model behaviour given the study design and actual model failure, particularly in non-standard settings such as case-only or post-intervention cohorts	Probability distributions in non-standard cohorts reflect population composition, not necessarily model deficiency
	3b	Report clinically meaningful associations (e.g. correlation with disease severity markers, treatment response) alongside classification metrics	Biological signal preservation demonstrates model validity even outside standard case-control designs
	3c	Ensure that conclusions about model performance are benchmarked against validated reference standards and operating thresholds	Conclusions drawn without validated benchmarks may misrepresent model capabilities and discourage adoption of effective tools
Communication with Developers	4a	Contact model developers to clarify intended use cases, validated thresholds, known limitations, and deployment specifications	A low-cost step that prevents avoidable methodological errors, especially for openly shared research tools
Communication with Developers	4b	Report any discrepancies between the evaluation setting and the model’s documented intended use	Transparency about deviations from intended use enables appropriate interpretation of results

Open in a new tab

Contributor Information

Lovedeep S Dhingra, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.

Philip M Croon, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.

Evangelos K Oikonomou, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.

Rohan Khera, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, 100 Church St S, Ste F250, New Haven, CT, USA; Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA; Center for Outcomes Research and Evaluation (CORE), Yale New Haven Hospital, New Haven, CT, USA; Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, USA; Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.

Author contributions

Lovedeep Singh Dhingra (Conceptualization, Methodology, Writing—original draft [lead]), Philip M. Croon (Conceptualization, Methodology [supporting], Writing—original draft [equal]), Evangelos K. Oikonomou (Conceptualization, Methodology [supporting], Writing—review & editing [equal]), and Rohan Khera (Conceptualization, Project administration, Writing—review & editing [lead], Methodology [supporting])

Funding

The authors acknowledge support from the National Heart, Lung, and Blood Institute (R01HL167858 and K23HL153775 to R.K., and F32HL170592 to E.K.O.), the National Institute on Aging (R01AG089981 to R.K.), the Doris Duke Charitable Foundation (2022060 to R.K.), and the Robert A. Winn Excellence in Clinical Trials Career Development Award (to E.K.O.), during the conduct of the study. The funders had no role in the design and conduct of the study; the collection, management, analysis, and interpretation of the data; the preparation, review, or approval of the manuscript; or the decision to submit the manuscript for publication.

References

1. Babur Guler G, Guler A, Surgit O, Turkmen I, Atmaca S, Sahin H, et al. Evaluation of artificial intelligence-based electrocardiogram analysis tools in patients with hypertrophic cardiomyopathy. Eur Heart J Digit Health 2026;7:ztag026. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Sangha V, Mortazavi BJ, Haimovich AD, Ribeiro AH, Brandt CA, Jacoby DL, et al. Automated multilabel diagnosis on electrocardiographic images and signals. Nat Commun 2022;13:1583. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Dhingra LS, Aminorroaya A, Sangha V, Pedroso AF, Shankar SV, Coppi A, et al. Ensemble deep learning algorithm for structural heart disease screening using electrocardiographic images: PRESENT SHD. J Am Coll Cardiol 2025;85:1302–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Sangha V, Dhingra LS, Aminorroaya A, Croon PM, Sikand NV, Sen S, et al. Identification of hypertrophic cardiomyopathy on electrocardiographic images with deep learning. Nat Cardiovasc Res 2025;4:991–1000. [DOI] [PubMed] [Google Scholar]
5. Dhingra LS, Sangha V, Aminorroaya A, Bryde R, Gaballa A, Ali AH, et al. A multicenter evaluation of the impact of therapies on deep learning-based electrocardiographic hypertrophic cardiomyopathy markers. Am J Cardiol 2024;237:35–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Croon PM, Boonstra MJ, Allaart CP, Arends BKO, Dhingra LS, Huang Y-C, et al. Artificial intelligence-enhanced electrocardiogram models for detection of left ventricular dysfunction: a comparison study. JACC Adv 2026;5:102572. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Callahan A, McElfresh D, Banda JM, Bunney G, Char D, Chen J, et al. Standing on FURM ground: a framework for evaluating fair, useful, and reliable AI models in health care systems. NEJM Catal Innov Care Deliv 2024;5:CAT-24. [Google Scholar]
8. Angus DC, Khera R, Lieu T, Liu V, Ahmad FS, Anderson B, et al. AI, health, and health care today and tomorrow: the JAMA summit report on artificial intelligence. JAMA 2025;334:1650–1664. [DOI] [PubMed] [Google Scholar]
9. Dhingra LS, Pedroso AF, Aminorroaya A, Caraballo C, Mahajan S, Bansal B, et al. Follow-up assessment of adherence to methodological standards in national inpatient sample research. JAMA Netw Open 2026;9:e2555753. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Khera R, Oikonomou EK, Nadkarni GN, Morley JR, Wiens J, Butte AJ, et al. Transforming cardiovascular care with artificial intelligence: from discovery to practice. J Am Coll Cardiol 2024;84:97–114. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ztag057-B1] 1. Babur Guler G, Guler A, Surgit O, Turkmen I, Atmaca S, Sahin H, et al. Evaluation of artificial intelligence-based electrocardiogram analysis tools in patients with hypertrophic cardiomyopathy. Eur Heart J Digit Health 2026;7:ztag026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ztag057-B2] 2. Sangha V, Mortazavi BJ, Haimovich AD, Ribeiro AH, Brandt CA, Jacoby DL, et al. Automated multilabel diagnosis on electrocardiographic images and signals. Nat Commun 2022;13:1583. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ztag057-B3] 3. Dhingra LS, Aminorroaya A, Sangha V, Pedroso AF, Shankar SV, Coppi A, et al. Ensemble deep learning algorithm for structural heart disease screening using electrocardiographic images: PRESENT SHD. J Am Coll Cardiol 2025;85:1302–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ztag057-B4] 4. Sangha V, Dhingra LS, Aminorroaya A, Croon PM, Sikand NV, Sen S, et al. Identification of hypertrophic cardiomyopathy on electrocardiographic images with deep learning. Nat Cardiovasc Res 2025;4:991–1000. [DOI] [PubMed] [Google Scholar]

[ztag057-B5] 5. Dhingra LS, Sangha V, Aminorroaya A, Bryde R, Gaballa A, Ali AH, et al. A multicenter evaluation of the impact of therapies on deep learning-based electrocardiographic hypertrophic cardiomyopathy markers. Am J Cardiol 2024;237:35–40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ztag057-B6] 6. Croon PM, Boonstra MJ, Allaart CP, Arends BKO, Dhingra LS, Huang Y-C, et al. Artificial intelligence-enhanced electrocardiogram models for detection of left ventricular dysfunction: a comparison study. JACC Adv 2026;5:102572. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ztag057-B7] 7. Callahan A, McElfresh D, Banda JM, Bunney G, Char D, Chen J, et al. Standing on FURM ground: a framework for evaluating fair, useful, and reliable AI models in health care systems. NEJM Catal Innov Care Deliv 2024;5:CAT-24. [Google Scholar]

[ztag057-B8] 8. Angus DC, Khera R, Lieu T, Liu V, Ahmad FS, Anderson B, et al. AI, health, and health care today and tomorrow: the JAMA summit report on artificial intelligence. JAMA 2025;334:1650–1664. [DOI] [PubMed] [Google Scholar]

[ztag057-B9] 9. Dhingra LS, Pedroso AF, Aminorroaya A, Caraballo C, Mahajan S, Bansal B, et al. Follow-up assessment of adherence to methodological standards in national inpatient sample research. JAMA Netw Open 2026;9:e2555753. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ztag057-B10] 10. Khera R, Oikonomou EK, Nadkarni GN, Morley JR, Wiens J, Butte AJ, et al. Transforming cardiovascular care with artificial intelligence: from discovery to practice. J Am Coll Cardiol 2024;84:97–114. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Open science requires trust and rigour: a framework for responsible evaluation of shared AI-ECG tools

Lovedeep S Dhingra

Philip M Croon

Evangelos K Oikonomou

Rohan Khera

Roles

Table 1.

Contributor Information

Author contributions

Funding

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Open science requires trust and rigour: a framework for responsible evaluation of shared AI-ECG tools

Lovedeep S Dhingra

Philip M Croon

Evangelos K Oikonomou

Rohan Khera

Roles

Table 1.

Contributor Information

Author contributions

Funding

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases