Abstract
Artificial intelligence (AI) has been developed for echocardiography1–3, although it has not yet been tested with blinding and randomization. Here we designed a blinded, randomized non-inferiority clinical trial (ClinicalTrials.gov ID: NCT05140642; no outside funding) of AI versus sonographer initial assessment of left ventricular ejection fraction (LVEF) to evaluate the impact of AI in the interpretation workflow. The primary end point was the change in the LVEF between initial AI or sonographer assessment and final cardiologist assessment, evaluated by the proportion of studies with substantial change (more than 5% change). From 3,769 echocardiographic studies screened, 274 studies were excluded owing to poor image quality. The proportion of studies substantially changed was 16.8% in the AI group and 27.2% in the sonographer group (difference of −10.4%, 95% confidence interval: −13.2% to −7.7%, P < 0.001 for non-inferiority, P < 0.001 for superiority). The mean absolute difference between final cardiologist assessment and independent previous cardiologist assessment was 6.29% in the AI group and 7.23% in the sonographer group (difference of −0.96%, 95% confidence interval: −1.34% to −0.54%, P < 0.001 for superiority). The AI-guided workflow saved time for both sonographers and cardiologists, and cardiologists were not able to distinguish between the initial assessments by AI versus the sonographer (blinding index of 0.088). For patients undergoing echocardiographic quantification of cardiac function, initial assessment of LVEF by AI was non-inferior to assessment by sonographers.
Subject terms: Echocardiography, Randomized controlled trials
The impact of artificial intelligence in cardiac function assessment is evaluated by a blinded, randomized non-inferiority trial of artificial intelligence versus sonographer initial assessment of the left ventricular ejection fraction.
Main
Accurate quantification of cardiac function is necessary for disease diagnosis, risk stratification and assessment of treatment response4,5. LVEF, in particular, is routinely used to guide clinical decisions regarding patient appropriateness for a wide range of medical and device therapies as well as interventions including surgeries4,6,7. Despite the importance of LVEF assessment in daily clinical practice and clinical research protocols, conventional approaches to measuring LVEF are well recognized as being subject to heterogeneity and variance given that they rely on manual and subjective human tracings8,9.
Clinical practice guidelines recommend when assessing LVEF based on cardiac imaging—most commonly echocardiography—that the measurements should be performed repeatedly over multiple cardiac cycles to improve precision and account for arrhythmic or haemodynamic sources of variation10. Unfortunately, such repeated human measurements are rarely done in practice given logistical constraints present in most clinical imaging laboratories and single tracings or a visual estimation of LVEF is often used as a pragmatic alternative11,12. Such an approach is suboptimal for detection of subtle changes in LVEF, which is needed for making important therapeutic decisions (for example, eligibility for continued chemotherapy or defibrillator implantation)13.
Extending from tremendous progress in the field of AI over the past decade14, numerous algorithms have been developed with the goal of automating assessment of LVEF in real-world patient care settings1,3,15. Although such AI algorithms have demonstrated improved precision on limited retrospective datasets, to date, there are no current cardiovascular AI technologies validated in blinded, randomized clinical trials16–19. In addition, human–computer interaction and the effect of AI prompting on clinical interpretations is underexplored in clinical studies. To address this need, we conducted a blinded, randomized non-inferiority clinical trial to prospectively assess the effect of initial assessment by AI versus conventional initial assessment by a sonographer on final cardiologist interpretation of LVEF.
Cohort characteristics
We enroled 3,769 transthoracic echocardiogram studies originally performed at an academic medical centre between 1 June 2019 and 8 August 2019; these studies were prospectively re-evaluated by 25 cardiac sonographers (mean of 14.1 years of practice) and 10 cardiologists (mean of 12.7 years of practice). In total, 3,495 studies from 3,035 patients were able to be annotated by sonographers using Simpson’s method of disc calculation of LVEF, and 274 studies were excluded for being of insufficient image quality to contour the left ventricle (Fig. 1). The eligible studies were randomized 1:1 to AI versus sonographer for initial evaluation, with 1,740 studies assigned to the AI group and 1,755 studies assigned to the sonographer group. The baseline characteristics of the two groups were well balanced (Table 1). At a separate workstation and different time, cardiologists were presented with the full echocardiogram study with initial annotations for the final blinded assessment of LVEF.
Table 1.
Characteristic | Total (n = 3,495) |
AI (n = 1,740) |
Sonographer (n = 1,755) |
---|---|---|---|
Age (year) | 66.3 ± 17.0 | 66.1 ± 16.8 | 66.6 ± 17.1 |
Sex (n, %) | |||
Male | 1,983 (57%) | 982 (56%) | 1,001 (57%) |
Female | 1,512 (43%) | 758 (44%) | 754 (43%) |
Race (n, %) | |||
Non-Hispanic white | 2,041 (58%) | 1,032 (59%) | 1,009 (57%) |
Black | 479 (14%) | 230 (13%) | 249 (14%) |
Hispanic | 405 (12%) | 203 (12%) | 202 (12%) |
Asian | 273 (8%) | 123 (7%) | 150 (9%) |
Other | 237 (7%) | 120 (7%) | 117 (7%) |
Unknown | 38 (1%) | 20 (1%) | 18 (1%) |
Pacific Islander | 14 (0%) | 8 (0%) | 6 (0%) |
American Indian | 8 (0%) | 4 (0%) | 4 (0%) |
Body mass indexa | 26.5 ± 6.3 | 26.6 ± 6.3 | 26.5 ± 6.2 |
Comorbidities (n, %) | |||
Hypertension | 2,019 (58%) | 990 (57%) | 1,029 (59%) |
Diabetes | 884 (25%) | 441 (25%) | 443 (25%) |
Coronary artery disease | 1,099 (31%) | 547 (31%) | 552 (31%) |
Chronic kidney disease | 882 (25%) | 460 (26%) | 422 (24%) |
Atrial fibrillation | 867 (25%) | 450 (26%) | 417 (24%) |
Previous stroke | 459 (13%) | 225 (13%) | 234 (13%) |
Previous clinical EF | 58.1 ± 14.3 | 58.1 ± 14.2 | 58.0 ± 14.4 |
Method of LVEF evaluation (n, %) | |||
Single plane (A4C) | 2,249 (64%) | 1,107 (64%) | 1,142 (65%) |
Biplane | 1,246 (36%) | 633 (36%) | 613 (35%) |
Study quality (n, %) | |||
Poor | 648 (19%) | 314 (18%) | 334 (19%) |
Adequate | 1,725 (49%) | 875 (50%) | 850 (48%) |
Good | 236 (7%) | 114 (7%) | 122 (7%) |
Not specified | 886 (25%) | 437 (25%) | 449 (26%) |
Location (n, %) | |||
Inpatient | 2,067 (59%) | 1,033 (59%) | 1,034 (59%) |
Outpatient | 1,428 (41%) | 707 (41%) | 721 (41%) |
A4C, apical-4-chamber; EF, ejection fraction. aBody mass index missing in 52 studies.
Assessment of blinding
After completing each study, cardiologists were asked to predict the agent of initial interpretation. Cardiologists correctly predicted the method of initial assessment for 1,130 (32.3%) studies, incorrectly guessed for 845 (24.2%) studies and were unsure whether initial assessment was provided by AI or sonographers for 1,520 (43.4%) studies. The Bang’s blinding index, a metric of blinding in which 0 is perfect blinding and −1 or 1 is perfectly unblinding, was used to assess the trial20. A blinding metric between −0.2 and 0.2 is typically considered good blinding, and the blinding index was 0.075 for the sonographer group and 0.088 for the AI group. The bootstrapped confidence interval was consistently within −0.2 and 0.2 (P > 0.99).
Primary outcome
The primary outcome of substantial change between the initial and final assessments occurred in 292 (16.8%) studies in the AI group compared with 478 (27.2%) studies in the sonographer group (difference of −10.4%, 95% confidence interval: −13.2% to −7.7%, P < 0.001 for non-inferiority, P < 0.001 for superiority) (Table 2). The mean absolute difference between the initial and final assessments of LVEF was 2.79% in the AI group compared with 3.77% in the sonographer group (difference −0.97%, 95% confidence interval: −1.33% to −0.54%, P < 0.001 for superiority; Fig. 2).
Table 2.
Outcome | AI (n = 1,740) |
Sonographer (n = 1,755) |
Mean difference (95% confidence interval) |
P value |
---|---|---|---|---|
Primary efficacy outcome: initial versus final assessment | ||||
Substantial change | 292 (16.8%) | 478 (27.2%) | −10.5% (−13.2% to −7.7%) | <0.001a |
Mean absolute difference | 2.79 ± 5.53 | 3.77 ± 5.22 | −0.97 (−1.31 to −0.61) | <0.001 |
Key secondary safety outcome: final versus previous cardiologist assessment | ||||
Substantial change | 871 (50.1%) | 957 (54.5%) | −4.5% (−7.8% to −1.2%) | 0.008 |
Mean absolute difference | 6.29 ± 5.94 | 7.23 ± 6.18 | −0.94 (−1.34 to −0.54) | <0.001 |
Other secondary outcomes | ||||
Sonographer time (s), median (IQR) | 0 (0–0) | 119 (77–173) | −131 (−134 to −127) | <0.001 |
Cardiologist time (s), median (IQR) | 54 (31–95) | 64 (36–108) | −8 (−12 to −4) | <0.001 |
Any change | 1,100 (63.2%) | 1,218 (69.4%) | −6.2% (−9.3% to −3.1%) | <0.001 |
aFor both non-inferiority and superiority tests, all other tests were for superiority. Fisher’s exact test was used for categorial outcomes and double-sided Student’s t-test was used for quantitative outcomes.
Secondary safety outcome
The secondary safety outcome of substantial difference between final cardiologist-adjudicated LVEF compared with the previously clinically reported LVEF occurred in 871 (50.1%) studies in the AI group compared with 957 (54.5%) studies in the sonographer group (difference of −4.5%, 95% confidence interval: −7.8% to −1.2%, P = 0.008). The mean absolute difference between previous and final cardiologist assessments was 6.29% in the AI group compared with 7.23% in the sonographer group (difference −0.94%, 95% confidence interval: −1.34% to −0.54%, P < 0.001 for superiority; Fig. 2).
Other outcomes and subgroup analyses
The reduction in the primary end point with the AI group was consistent across all major subgroups (Table 3). Between the initial and final assessments, 1,100 (63.2%) studies in the AI group and 1,218 (69.4%) studies in the sonographer group had any change (difference of −6.2%, 95% confidence interval: −9.3% to −3.1%, P < 0.001 for superiority). Sonographers took a median of 119 s (interquartile range (IQR): 77–173 s) to assess and annotate LVEF. Cardiologists took a median of 54 s (IQR: 31–95 s) to adjudicate initial assessments in the AI group and a median of 64 s (IQR: 36–108 s) to adjudicate initial assessments in the sonographer group. The mean difference in sonographer time between AI and sonographer groups was −131 s (95% confidence interval: −134 to −127 s, P < 0.001). The mean difference in cardiologist time between AI and sonographer groups was −8 s (95% confidence interval: −12 to −4 s, P < 0.001).
Table 3.
Subgroup | AI (n) |
AI (MAD) |
Sonographer (n) |
Sonographer (MAD) |
Difference (95% confidence interval) |
---|---|---|---|---|---|
Method of LVEF evaluation | |||||
Single plane | 1,107 | 3.20 ± 6.41 | 1,142 | 4.38 ± 5.75 | −1.19 (−1.69 to −0.68) |
Biplane | 633 | 2.09 ± 3.37 | 613 | 2.61 ± 3.77 | −0.52 (−0.92 to −0.11) |
Race | |||||
White | 1,032 | 2.58 ± 5.36 | 1,009 | 3.71 ± 5.18 | −1.13 (−1.57 to −0.66) |
Black | 230 | 3.91 ± 7.59 | 249 | 4.00 ± 5.27 | −0.08 (−1.22 to 1.14) |
Hispanic | 203 | 2.44 ± 4.26 | 202 | 3.39 ± 4.57 | −0.95 (−1.82 to −0.09) |
Asian | 123 | 3.11 ± 5.44 | 150 | 4.29 ± 6.04 | −1.18 (−2.55 to −0.23) |
Other | 152 | 2.77 ± 4.11 | 145 | 3.74 ± 5.26 | −0.97 (−2.07 to −0.08) |
Sex | |||||
Male | 982 | 2.75 ± 5.92 | 1,001 | 3.67 ± 5.18 | −0.92 (−1.40 to −0.42) |
Female | 758 | 2.85 ± 4.97 | 754 | 3.89 ± 5.26 | −1.04 (−1.56 to −0.52) |
Image quality | |||||
Poor | 314 | 4.22 ± 7.12 | 334 | 4.27 ± 5.92 | −0.05 (−1.04 to 0.97) |
Adequate | 875 | 2.45 ± 5.36 | 850 | 3.53 ± 5.01 | −1.08 (−1.56 to −0.58) |
Good | 114 | 2.01 ± 3.15 | 122 | 3.51 ± 5.11 | −1.51 (−2.62 to −0.45) |
Not specified | 437 | 2.66 ± 4.82 | 449 | 3.90 ± 5.02 | −1.24 (−1.89 to −0.58) |
Location | |||||
Inpatient | 1,033 | 3.09 ± 5.59 | 1,034 | 4.01 ± 5.49 | −0.92 (−1.40 to −0.45) |
Outpatient | 707 | 2.36 ± 5.41 | 721 | 3.42 ± 4.78 | −1.05 (−1.57 to −0.51) |
Cardiologist prediction of group | |||||
AI | 557 | 3.64 ± 6.42 | 418 | 3.82 ± 5.09 | −0.18 (−0.91 to 0.54) |
Sonographer | 427 | 3.38 ± 4.95 | 573 | 4.00 ± 4.62 | −0.62 (−1.21 to 0) |
Uncertain | 756 | 1.85 ± 4.95 | 764 | 3.56 ± 5.68 | −1.72 (−2.26 to −1.17) |
Correct prediction | 557 | 3.64 ± 6.42 | 573 | 4.00 ± 4.62 | −0.36 (−0.98 to 0.31) |
Incorrect prediction | 427 | 3.38 ± 4.95 | 418 | 3.82 ± 5.09 | −0.44 (11.12 to 0.22) |
We additionally assessed the frequency of changes from initial to final assessment crossing a clinically meaningful threshold (that is, LVEF of 35% for consideration of implantable defibrillator therapy) post-hoc. In the AI group, 22 of 1,740 (1.3%) studies crossed the 35% threshold between initial and final cardiologist assessments. In the sonographer group, 54 of 1,755 (3.1%) studies crossed the threshold between initial and final assessments (P = 0.0004, Chi-squared test).
Discussion
In this trial of board-certified cardiologists adjudicating clinical transthoracic echocardiographic exams, AI-guided initial evaluation of LVEF was found to be non-inferior and even superior to sonographer-guided initial evaluation. After blinded review of AI versus sonographer-guided LVEF assessment, cardiologists were less likely to substantially change the LVEF assessment for their final report with initial AI assessment. Furthermore, the AI-guided assessment took less time for cardiologists to overread and was more consistent with cardiologist assessment from the previous clinical report. Although not the first trial of AI technology in clinical cardiology21–23, to our knowledge, this study represents the first blinded implementation of a randomized trial in this space.
In addition to prospectively evaluating the impact of AI in a clinical trial, our study represents the largest test–retest assessment of clinician variability in assessing LVEF to date. The degree of human variability between repeated LVEF assessments in our study is consistent with previous studies11,12,24, and the introduction of AI guidance decreased variance between independent clinician assessments. In this trial, we utilized experienced sonographers as an active comparator versus the AI for the initial assessment of LVEF; however, different levels of experience and types of training can change the relative impact of AI compared with clinician judgement. The smaller difference between final and initial assessments, seen in this study for both methods of initial assessment, compared with the difference between final and previous cardiologist assessments highlights the anchoring effect of an initial assessment in practice, and the importance of blinding for quantifying effect size in clinical trials of diagnostic imaging. In both the anchored outcome (comparison of preliminary assessment to final assessment) and independent outcome (comparison of final assessment in the trial versus historical cardiologist assessment), the AI arm showed less variation and more precision in the assessment of LVEF.
Notwithstanding tremendous interest in AI technologies, there have been few prospective trials evaluating their efficacy and effect on clinician assessments. Important clinical trials of AI technology have already shown the efficaciousness of AI in cardiology21,25; however, given the difficulty of blinding a diagnostic tool, previous trials are often open-label and compared with a placebo or no diagnostic assistance. Previous works have shown that there can be a Hawthorne effect when studying novel technologies such as AI systems26,27. By introducing blinding with an active comparator arm, studies can better distinguish between the effect of the AI technology itself versus the impact of being observed or the act of introducing an intervention. Current FDA-approved technologies for LVEF assessment were not prospectively evaluated with randomization and blinding16. By integrating AI into the reporting software, our study sought to minimize bias in assessing the effect size of AI intervention.
To enable effective blinding, we implemented a single cardiac cycle annotation workflow representative of many real-world high-volume echocardiography laboratories. Despite this framework, there was a small signal for cardiologists to be more likely to be correct than incorrect in guessing the agent of initial assessment. However, the blinding index is within the range typically described as good blinding, and regardless of whether the cardiologist thought the initial agent was AI, sonographer or uncertain, the results trended towards improved performance in the AI arm. Our findings of non-inferiority and even superiority of initial AI assessment are encouraging given that AI assessment reduces the time and effort required of the tedious manual processing that is typically required by routine clinical workflows. Given these promising results, further developments of AI could eventually facilitate additional workflows that are required for conducting comprehensive cardiac assessments in routine clinical practice and in accordance with guideline recommendations28–31.
Several limitations of our trial should be mentioned. First, our study was single centre, reflecting the demographics and clinical practices of a particular population. Nevertheless, the AI model was trained on example images from another centre and the clinical trial was performed as prospective external validation, suggesting generalizability of the AI techniques and workflow. Second, the study was not powered to assess long-term outcomes based on differences in LVEF assessment. Although the results were consistent across subgroups, further analyses are needed to evaluate the impact of video selection, frame selection and intra-provider variability. Third, this trial used previously acquired echocardiogram studies, and although prospectively evaluated by sonographers and cardiologists, there can be bias when a different sonographer than the scanning sonographer interprets the images. Last, consistent with findings from most AI studies, we found model performance improvement scales with the number of training examples. Thus, we anticipate that future studies could improve on the AI performance that we observed in the current study by implementing AI models developed based on an even greater number of training examples derived from a broad and diverse cohort of patients. Of note, this clinical trial utilized an AI model entirely trained from an independent site, representing external validation of the model. Effective deployment of AI models in cardiology clinical practice will require additional regulatory oversight, adoption and appropriate use by clinicians, and functional integration with clinical systems, all of which need to be carefully considered and further studied.
In summary, we found that an AI-guided workflow for the initial assessment of cardiac function in echocardiography was non-inferior and even superior to the initial assessment by the sonographer. Cardiologists required less time, substantially changed the initial assessment less frequently and were more consistent with previous clinical assessments by the cardiologist when using an AI-guided workflow. This finding was consistent across subgroups of different demographic and imaging characteristics. In the context of an ongoing need for precision phenotyping, our trial results suggest that AI tools can improve efficacy as well as efficiency in assessing cardiac function. Next steps include studying the effect of AI guidance on cardiac function assessment across multiple centres.
Methods
Study design and oversight
Cardiologists with board certification in echocardiography were assigned to read independent transthoracic echocardiogram studies randomized to initial assessment by AI versus sonographer. Imaging studies were initially acquired and interpreted clinically by a board-certified cardiologist between 1 June 2019 and 8 August 2019 at Cedars-Sinai Medical Center. Studies were randomly sampled within the time range without regard to patient identity, so that multiple studies from the same patient would be randomized and assessed independently. Sonographers were asked to use their standard clinical practice to annotate the left ventricle for either single-plane or biplane method-of-discs calculation of LVEF10. Studies were excluded from randomization if the sonographer was unable to quantify the LVEF owing to inadequate image quality.
Eligible studies were randomly assigned, in a 1:1 ratio, to initial assessment by AI or sonographer and presented to the cardiologists in the standard clinical reporting workflow and software (Siemens Syngo Dynamics VA40D) for adjusting the LV annotation and calculating EF. Although the AI model could annotate every single frame and cardiac cycle, to facilitate blinding, one representative cardiac cycle was annotated and presented to the cardiologist in the AI group. To preserve blinding, the same proportion of single-plane-annotated and biplane-annotated studies was generated for the AI group as the sonographer group.
The trial was designed as a blinded, randomized non-inferiority trial with a prespecified margin of difference by academic study investigators without industry sponsorship or representation in trial design. Approval by the Cedars-Sinai Medical Center Institutional Review Board was obtained before the start of the study. All reading echocardiographers gave informed consent and were excluded from the data analysis. The last author prepared the first draft of the manuscript. The first and last authors had full access to the data. All the authors vouch for the accuracy and completeness of the data and analyses and for the fidelity of the study to the protocol. This study satisfies the requirements set forth by the CONSORT-AI and SPIRIT-AI reporting guidelines27,31.
Model design and clinical integration
The architecture and design of the AI model have been previously described1. The AI model was trained using 144,184 echocardiogram videos from Stanford Healthcare and was never exposed to videos from the Cedars-Sinai Medical Center. An end-to-end workflow was developed including view classification, LV annotation and LVEF assessment. The AI model was fully embedded within the clinical reporting system at the start of the trial without subsequent changes or adjustments. An in-depth technical description can be found in Supplementary Appendix.
For both sonographers and cardiologists, the entire study (most often between 60 and 120 videos) was shown in the standard clinical reporting software. The study was shown without any annotations to the sonographer, who chose the apical-4-chamber and apical-2-chamber videos and traced the endocardium to assess LVEF. For the cardiologist, the study was shown with one set of annotations (provided by either AI or the sonographer) and can adjust the endocardium to change the reported LVEF (example video 1). Standard method of discs evaluation of the left ventricle, either biplane or single plane depending on sonographer input, was used to calculate LVEF.
Outcomes assessment
The primary outcome was the change in LVEF between the initial assessment by AI or sonographer and the final cardiologist assessment. The primary outcome was evaluated both as the proportion of studies with substantial change and the mean absolute difference between initial and final assessments. Substantial change was defined as greater than 5% change in LVEF between initial and final assessments. The analysis was performed as randomized and there was no crossover between the two groups.
The duration of time for contouring and adjustment by the sonographer and cardiologist was tracked and compared between the sonographer and AI arms. To assess blinding, cardiologists were asked to predict whether the initial interpretation was by AI, sonographer or unable to tell for each study. A key secondary safety end point was change in final cardiologist-adjudicated LVEF compared with the previous cardiologist-reported LVEF. An additional secondary end point includes the proportion of studies with no change in LVEF between initial and final interpretations.
Statistical analysis
The trial was designed to test for non-inferiority, with a secondary objective of testing for superiority with respect to the primary end point. Non-inferiority is shown if the lower limit of the 95% confidence interval for the between-group difference in the primary end point was less than 5% (less than the natural variation of test–retest variability in the blinded human assessment of LVEF)8,24. With an α = 0.05, power of 0.9 and estimating a 5% occurrence rate of substantial change in the AI-driven workflow versus 8% in the sonographer-driven workflow, we estimated a sample size of 2,834 studies needed for statistical power and pre-hoc planned to enrol 3,500 studies for this trial as a buffer for dropout and unforeseen challenges. Non-block randomization was performed once at the beginning of the trial, and the analyses included all imaging studies that underwent randomization (intention-to-treat population). All reported P values for non-inferiority are one-sided, and all reported P values for superiority are two-sided. Bang’s blinding index was used to evaluate the quality of blinding during the trial20. Statistical analyses were performed with the use of R 4.1.0 (R Foundation for Statistical Computing).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-023-05947-3.
Supplementary information
Acknowledgements
No external funding was obtained for this study. We thank the Cedars-Sinai Medical Center Enterprise Information Systems Enterprise Imaging team for their support with clinical integration and deployment.
Author contributions
B.H., S.C., J.Y.Z. and D.O. designed the clinical trial, study protocol and implementation of the AI model. B.H., G.D. and M.J. engineered technical design and clinical deployment. A.C.K., J.H.C., N.Y., C.P., T.S., J.E., N.A.B., J.W., K.J. and R.S. performed the blinded review of AI versus sonographer assessments. B.H., S.C., J.Y.Z. and D.O. wrote the manuscript with feedback from all authors.
Peer review
Peer review information
Nature thanks Jianying Hu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
The AI model was trained using echocardiogram videos from Stanford Healthcare following Stanford IRB protocol 43721 with waiver of individual consent. A set of de-identified Stanford Healthcare echocardiogram videos is publicly available at EchoNet-Dynamic (https://echonet.github.io/dynamic/). The clinical trial was performed at Cedars-Sinai Medical Center under IRB protocol STUDY1707, and the study protocol, statistical analysis plan and de-identified trial results will be available at https://github.com/echonet/blinded_rct.
Code availability
The code for the AI model is available at GitHub (https://github.com/echonet/dynamic).
Competing interests
Stanford University is in the process of applying for a patent application covering video-based deep learning models for assessing cardiac function that lists B.H., J.Y.Z. and D.O. as inventors1. All other authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Susan Cheng, James Y. Zou, David Ouyang
Contributor Information
Susan Cheng, Email: susan.cheng@cshs.org.
James Y. Zou, Email: jamesz@stanford.edu
David Ouyang, Email: david.ouyang@cshs.org.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-023-05947-3.
References
- 1.Ouyang D, et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature. 2020;580:252–256. doi: 10.1038/s41586-020-2145-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Duffy G, et al. High-throughput precision phenotyping of left ventricular hypertrophy with cardiovascular deep learning. JAMA. 2022;7:386–395. doi: 10.1001/jamacardio.2021.6059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhang J, et al. Fully automated echocardiogram interpretation in clinical practice. Circulation. 2018;138:1623–1635. doi: 10.1161/CIRCULATIONAHA.118.034338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Heidenreich PA, et al. 2022 AHA/ACC/HFSA Guideline for the Management of Heart Failure: a report of the American College of Cardiology/American Heart Association Joint Committee on Clinical Practice Guidelines. Circulation. 2022;145:e895–e1032. doi: 10.1161/CIR.0000000000001063. [DOI] [PubMed] [Google Scholar]
- 5.Dunlay SM, Roger VL, Redfield MM. Epidemiology of heart failure with preserved ejection fraction. Nat. Rev. Cardiol. 2017;14:591–602. doi: 10.1038/nrcardio.2017.65. [DOI] [PubMed] [Google Scholar]
- 6.Al-Khatib SM, et al. 2017 AHA/ACC/HRS Guideline for Management of Patients with Ventricular Arrhythmias and the Prevention of Sudden Cardiac Death: a report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines and the Heart Rhythm Society. Circulation. 2018;138:e272–e391. doi: 10.1161/CIR.0000000000000549. [DOI] [PubMed] [Google Scholar]
- 7.Wilcox JE, Fang JC, Margulies KB, Mann DL. Heart failure with recovered left ventricular ejection fraction: JACC Scientific Expert Panel. J. Am. Coll. Cardiol. 2020;76:719–734. doi: 10.1016/j.jacc.2020.05.075. [DOI] [PubMed] [Google Scholar]
- 8.Yuan N, et al. Systematic quantification of sources of variation in ejection fraction calculation using deep learning. JACC Cardiovasc. Imaging. 2021;14:2260–2262. doi: 10.1016/j.jcmg.2021.06.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Pellikka PA, et al. Variability in ejection fraction measured by echocardiography, gated single-photon emission computed tomography, and cardiac magnetic resonance in patients with coronary artery disease and left ventricular dysfunction. JAMA Netw. Open. 2018;1:e181456. doi: 10.1001/jamanetworkopen.2018.1456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Lang RM, et al. Recommendations for cardiac chamber quantification by echocardiography in adults: an update from the American Society of Echocardiography and the European Association of Cardiovascular Imaging. Eur. Heart J. Cardiovasc. Imaging. 2015;16:233–270. doi: 10.1093/ehjci/jev014. [DOI] [PubMed] [Google Scholar]
- 11.Cole GD, et al. Defining the real-world reproducibility of visual grading of left ventricular function and visual estimation of left ventricular ejection fraction: impact of image quality, experience and accreditation. Int. J. Cardiovasc. Imaging. 2015;31:1303–1314. doi: 10.1007/s10554-015-0659-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Shahgaldi K, Gudmundsson P, Manouras A, Brodin L-A, Winter R. Visually estimated ejection fraction by two dimensional and triplane echocardiography is closely correlated with quantitative ejection fraction by real-time three dimensional echocardiography. Cardiovasc. Ultrasound. 2009;7:41. doi: 10.1186/1476-7120-7-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Stone JR, Kanneganti R, Abbasi M, Akhtari M. Monitoring for chemotherapy-related cardiotoxicity in the form of left ventricular systolic dysfunction: a review of current recommendations. JCO Oncol. Pract. 2021;17:228–236. doi: 10.1200/OP.20.00924. [DOI] [PubMed] [Google Scholar]
- 14.Chen JH, Asch SM. Machine learning and prediction in medicine—beyond the peak of inflated expectations. N. Engl. J. Med. 2017;376:2507–2509. doi: 10.1056/NEJMp1702071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Huang M-S, Wang C-S, Chiang J-H, Liu P-Y, Tsai W-C. Automated recognition of regional wall motion abnormalities through deep neural network interpretation of transthoracic echocardiography. Circulation. 2020;142:1510–1520. doi: 10.1161/CIRCULATIONAHA.120.047530. [DOI] [PubMed] [Google Scholar]
- 16.Wu E, et al. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 2021;27:582–584. doi: 10.1038/s41591-021-01312-x. [DOI] [PubMed] [Google Scholar]
- 17.Schiano C, et al. Machine learning and bioinformatics framework integration to potential familial DCM-related markers discovery. Genes. 2021;12:1946. doi: 10.3390/genes12121946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Infante T, et al. Radiogenomics and artificial intelligence approaches applied to cardiac computed tomography angiography and cardiac magnetic resonance for precision medicine in coronary heart disease: a systematic review. Circ. Cardiovasc. Imaging. 2021;14:1133–1146. doi: 10.1161/CIRCIMAGING.121.013025. [DOI] [PubMed] [Google Scholar]
- 19.Bazoukis G, et al. The inclusion of augmented intelligence in medicine: a framework for successful implementation. Cell Rep. Med. 2022;3:100485. doi: 10.1016/j.xcrm.2021.100485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bang H, Ni L, Davis CE. Assessment of blinding in clinical trials. Control. Clin. Trials. 2004;25:143–156. doi: 10.1016/j.cct.2003.10.016. [DOI] [PubMed] [Google Scholar]
- 21.Yao X, et al. Artificial intelligence-enabled electrocardiograms for identification of patients with low ejection fraction: a pragmatic, randomized clinical trial. Nat. Med. 2021;27:815–819. doi: 10.1038/s41591-021-01335-4. [DOI] [PubMed] [Google Scholar]
- 22.Attia ZI, et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet. 2019;394:861–867. doi: 10.1016/S0140-6736(19)31721-0. [DOI] [PubMed] [Google Scholar]
- 23.Noseworthy PA, et al. Artificial intelligence-guided screening for atrial fibrillation using electrocardiogram during sinus rhythm: a prospective non-randomised interventional trial. Lancet. 2022;400:1206–1212. doi: 10.1016/S0140-6736(22)01637-3. [DOI] [PubMed] [Google Scholar]
- 24.Farsalinos KE, et al. Head-to-head comparison of global longitudinal strain measurements among nine different vendors: the EACVI/ASE inter-vendor comparison study. J. Am. Soc. Echocardiogr. 2015;28:1171–1181.e2. doi: 10.1016/j.echo.2015.06.011. [DOI] [PubMed] [Google Scholar]
- 25.Persell SD, et al. Effect of home blood pressure monitoring via a smartphone hypertension coaching application or tracking application on adults with uncontrolled hypertension: a randomized clinical trial. JAMA Netw. Open. 2020;3:e200255. doi: 10.1001/jamanetworkopen.2020.0255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Vodrahalli, K., Daneshjou, R., Gerstenberg, T. & Zou, J. Do humans trust advice more if it comes from AI? In Proc. 2022 AAAI/ACM Conference on AI, Ethics, and Society 763–777 (Association for Computing Machinery, 2022).
- 27.Liu X, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 2020;26:1364–1374. doi: 10.1038/s41591-020-1034-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Cheitlin MD, et al. ACC/AHA Guidelines for the Clinical Application of Echocardiography. A report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (Committee on Clinical Application of Echocardiography). Developed in collaboration with the American Society of Echocardiography. Circulation. 1997;95:1686–1744. doi: 10.1161/01.CIR.95.6.1686. [DOI] [PubMed] [Google Scholar]
- 29.Douglas PS, et al. Echocardiographic imaging in clinical trials: American Society of Echocardiography Standards for echocardiography core laboratories: endorsed by the American College of Cardiology Foundation. J. Am. Soc. Echocardiogr. 2009;22:755–765. doi: 10.1016/j.echo.2009.05.020. [DOI] [PubMed] [Google Scholar]
- 30.Douglas PS, et al. 2019 ACC/AHA/ASE Key Data Elements and definitions for Transthoracic Echocardiography: a report of the American College of Cardiology/American Heart Association Task Force on clinical data standards (Writing Committee to Develop Clinical Data Standards for Transthoracic Echocardiography) and the American Society of Echocardiography. Circ. Cardiovasc. Imaging. 2019;12:e000027. doi: 10.1161/HCI.0000000000000027. [DOI] [PubMed] [Google Scholar]
- 31.CONSORT-AI and SPIRIT-AI Steering Group. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat. Med. 2019;25:1467–1468. doi: 10.1038/s41591-019-0603-3. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The AI model was trained using echocardiogram videos from Stanford Healthcare following Stanford IRB protocol 43721 with waiver of individual consent. A set of de-identified Stanford Healthcare echocardiogram videos is publicly available at EchoNet-Dynamic (https://echonet.github.io/dynamic/). The clinical trial was performed at Cedars-Sinai Medical Center under IRB protocol STUDY1707, and the study protocol, statistical analysis plan and de-identified trial results will be available at https://github.com/echonet/blinded_rct.
The code for the AI model is available at GitHub (https://github.com/echonet/dynamic).