Skip to main content
JAMA Network logoLink to JAMA Network
. 2021 Aug 4;6(11):1285–1295. doi: 10.1001/jamacardio.2021.2746

Performance of a Convolutional Neural Network and Explainability Technique for 12-Lead Electrocardiogram Interpretation

J Weston Hughes 1, Jeffrey E Olgin 2,3, Robert Avram 2,3, Sean A Abreau 2,3, Taylor Sittler 4, Kaahan Radia 1, Henry Hsia 2, Tomos Walters 2, Byron Lee 2, Joseph E Gonzalez 1, Geoffrey H Tison 1,2,3,5,
PMCID: PMC8340011  PMID: 34347007

Key Points

Question

Can readily available electrocardiogram (ECG) data be used to train a high-performing convolutional neural network (CNN) across a large range of 12-lead ECG diagnoses when compared with clinical standards of care?

Findings

In this cross-sectional study of 992 748 ECGs from 365 009 adult patients, a CNN was trained to predict 38 diagnostic classes with strong overall performance. Compared with a consensus committee of cardiac electrophysiologists, the CNN performed comparably to or exceeded cardiologist clinical diagnoses and the MUSE (GE Healthcare) system’s automated ECG diagnosis for most classes.

Meaning

In this cross-sectional study, a CNN trained on readily available ECG data achieved comparable performance to cardiologists and exceeded the performance of MUSE automated analysis for most diagnoses.


This cross-sectional study assesses the ability of a convolutional neural network trained on available electrocardiogram data to interpret electrocardiogram results compared with clinical standards of care.

Abstract

Importance

Millions of clinicians rely daily on automated preliminary electrocardiogram (ECG) interpretation. Critical comparisons of machine learning–based automated analysis against clinically accepted standards of care are lacking.

Objective

To use readily available 12-lead ECG data to train and apply an explainability technique to a convolutional neural network (CNN) that achieves high performance against clinical standards of care.

Design, Setting, and Participants

This cross-sectional study was conducted using data from January 1, 2003, to December 31, 2018. Data were obtained in a commonly available 12-lead ECG format from a single-center tertiary care institution. All patients aged 18 years or older who received ECGs at the University of California, San Francisco, were included, yielding a total of 365 009 patients. Data were analyzed from January 1, 2019, to March 2, 2021.

Exposures

A CNN was trained to predict the presence of 38 diagnostic classes in 5 categories from 12-lead ECG data. A CNN explainability technique called LIME (Linear Interpretable Model-Agnostic Explanations) was used to visualize ECG segments contributing to CNN diagnoses.

Main Outcomes and Measures

Area under the receiver operating characteristic curve (AUC), sensitivity, and specificity were calculated for the CNN in the holdout test data set against cardiologist clinical diagnoses. For a second validation, 3 electrophysiologists provided consensus committee diagnoses against which the CNN, cardiologist clinical diagnosis, and MUSE (GE Healthcare) automated analysis performance was compared using the F1 score; AUC, sensitivity, and specificity were also calculated for the CNN against the consensus committee.

Results

A total of 992 748 ECGs from 365 009 adult patients (mean [SD] age, 56.2 [17.6] years; 183 600 women [50.3%]; and 175 277 White patients [48.0%]) were included in the analysis. In 91 440 test data set ECGs, the CNN demonstrated an AUC of at least 0.960 for 32 of 38 classes (84.2%). Against the consensus committee diagnoses, the CNN had higher frequency-weighted mean F1 scores than both cardiologists and MUSE in all 5 categories (CNN frequency-weighted F1 score for rhythm, 0.812; conduction, 0.729; chamber diagnosis, 0.598; infarct, 0.674; and other diagnosis, 0.875). For 32 of 38 classes (84.2%), the CNN had AUCs of at least 0.910 and demonstrated comparable F1 scores and higher sensitivity than cardiologists, except for atrial fibrillation (CNN F1 score, 0.847 vs cardiologist F1 score, 0.881), junctional rhythm (0.526 vs 0.727), premature ventricular complex (0.786 vs 0.800), and Wolff-Parkinson-White (0.800 vs 0.842). Compared with MUSE, the CNN had higher F1 scores for all classes except supraventricular tachycardia (CNN F1 score, 0.696 vs MUSE F1 score, 0.714). The LIME technique highlighted physiologically relevant ECG segments.

Conclusions and Relevance

The results of this cross-sectional study suggest that readily available ECG data can be used to train a CNN algorithm to achieve comparable performance to clinical cardiologists and exceed the performance of MUSE automated analysis for most diagnoses, with some exceptions. The LIME explainability technique applied to CNNs highlights physiologically relevant ECG segments that contribute to the CNN’s diagnoses.

Introduction

With hundreds of millions of examinations performed annually worldwide,1 the electrocardiogram (ECG) is the most common cardiovascular diagnostic procedure. Every day, millions of clinicians rely on algorithms for preliminary interpretation in nearly every clinical ECG workflow.1,2 Most existing commercial ECG analysis systems use traditional rule-based algorithms to apply disease-specific criteria to ECGs—the same criteria that have been codified in guidelines and used by human readers for decades.3 Recent work has shown that machine learning can be effectively applied to ECGs to identify both well-understood4,5 and novel ECG-based diagnoses.6,7,8 Although machine learning may provide a powerful complement to existing automated ECG algorithms, a critical evaluation against current clinical standards of care, especially existing commercial algorithms, is lacking.

Machine learning algorithms, like convolutional neural networks (CNNs), learn from a body of labeled data, providing a valuable data-driven complement to human-defined rules2 that consider only a fraction of the total available ECG data.9 This data-driven algorithmic paradigm enables algorithms to be trained for ECG diagnoses for which ECG rules do not currently exist,6 or for more precise diagnoses than currently possible,8,10 while also enabling continual algorithmic improvement as more data become available11—a powerful advancement that is central to achieving the vision of a learning health system.12 However, trade-offs of machine learning include their being seen as “black boxes” that can be difficult to understand13 and being “data hungry”14 and the fact that existing ECG diagnostic criteria are not easily incorporated into these algorithms.

If we are to harness the potential advantages of applying machine learning to everyday ECG analysis, we first need to critically examine how machine learning algorithms compare against current clinically accepted standards, including commercial algorithms and cardiologist or expert interpretation. Building on a previously established approach,4 we specifically developed a CNN to accept and train on ECG data that are readily available in most institutions: namely, XML-format ECG waveform data and cardiologist-confirmed text diagnosis labels. By training our CNN using commonly available ECG data, we aspired to demonstrate what can be achieved in many institutions and, more importantly, what could be eventually achieved by combining cross-institutional data. We hypothesized that a CNN trained on this readily available 12-lead ECG data could achieve high performance against clinical standards of care. We compared CNN performance against both cardiologist diagnoses and the commercially available ECG system MUSE (GE Healthcare) in 2 validation data sets: (1) a large holdout test data set against cardiologist-confirmed clinical diagnoses and (2) a consensus committee data set of criterion standard diagnoses provided by a committee of board-certified electrophysiologists.

Methods

ECG Data and Study Design Overview

After institutional review board approval and informed consent waiver (owing to no more than minimal risk) from the University of California, San Francisco (UCSF) for this cross-sectional study, we obtained standard 12-lead ECG data from UCSF between January 1, 2003, and December 31, 2018. Clinical ECG data underwent initial analysis by the MUSE software (MAC 5500 HD, versions 5 to 10; Marquette 12SL; GE Healthcare). This initial MUSE interpretation was then presented to UCSF cardiologists who provided a clinical diagnosis by either changing or confirming the MUSE diagnosis. Both unedited MUSE and cardiologist-confirmed clinical diagnosis text were extracted.

We then developed and trained a CNN for 38 common ECG diagnoses (Table 1). We first validated the CNN against the cardiologist clinical diagnosis in the test data set (Figure 1). In addition, we performed a second validation in a consensus committee data set for which a committee of 3 electrophysiologists (H.H., T.W., and B.L.) provided ECG diagnoses by consensus. Against these criterion standard diagnoses, we then compared the performance of (1) the CNN algorithm, (2) the MUSE algorithm, and (3) the cardiologist clinical diagnosis. This study followed the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) reporting guideline.

Table 1. Performance of the CNN on 38 Diagnostic Classes Compared With Cardiologist Clinical Diagnoses in the Holdout Test Data Set (N = 32 576 Patients; 91 440 ECGs).

Diagnostic class Frequency CNN AUC (95% CI) Specificitya Sensitivityb
Rhythm
Sinus 57 186 0.995 (0.994-0.995) 0.993 0.990
Atrial fibrillation 6572 0.997 (0.997-0.997) 0.994 0.999
Atrial flutter 1406 0.991 (0.990-0.993) 0.980 0.986
Ectopic atrial rhythmc 514 0.949 (0.938-0.961) 0.900 0.899
Atrial tachycardiac 194 0.967 (0.958-0.975) 0.899 0.897
Ventricular tachycardiac 33 0.995 (0.989-0.999) 0.986 1.000
Junctional rhythmc 489 0.979 (0.973-0.984) 0.960 0.953
Supraventricular tachycardiac 308 0.996 (0.994-0.997) 0.994 0.990
Bigeminyc 248 0.989 (0.980-0.995) 0.993 0.976
Premature ventricular complex 3930 0.964 (0.960-0.967) 0.947 0.920
Premature atrial complex 4633 0.977 (0.974-0.979) 0.966 0.968
Ventricular paced 2443 0.997 (0.996-0.998) 0.995 0.998
Atrial pacedd 557 0.993 (0.988-0.997) 0.997 0.984
Rhythm diagnosis averagee NA 0.992 0.988 0.985
Conduction
AV block
1st Degree 5608 0.989 (0.988-0.990) 0.978 0.982
2nd Degree Mobitz 1c 182 0.988 (0.984-0.992) 0.969 0.962
Branch block
Left bundle 2026 0.994 (0.993-0.995) 0.983 0.990
Right bundle 7605 0.994 (0.993-0.994) 0.985 0.991
Left fascicular block
Anterior 3520 0.988 (0.986-0.989) 0.970 0.993
Posteriorc 372 0.987 (0.982-0.991) 0.974 0.995
Bifascicular blockd 1197 0.997 (0.996-0.997) 0.991 0.999
Nonspecific intraventricular conduction delay 1695 0.937 (0.931-0.943) 0.802 0.852
Axis deviation
Right 2768 0.988 (0.987-0.989) 0.976 0.997
Left 5951 0.981 (0.980-0.982) 0.955 0.987
Right superior axisc 308 0.998 (0.997-0.998) 0.995 1.000
Prolonged QT 6700 0.959 (0.958-0.961) 0.894 0.888
Wolff-Parkinson-Whitec 140 0.966 (0.942-0.983) 0.977 0.914
Conduction diagnosis averagee NA 0.981 0.953 0.965
Chamber enlargement
Ventricular hypertrophy
Left 9272 0.984 (0.983-0.985) 0.973 0.845
Right 1145 0.984 (0.981-0.987) 0.965 0.645
Atrial enlargement
Left 5003 0.976 (0.974-0.979) 0.966 0.785
Rightd 665 0.983 (0.978-0.988) 0.980 0.606
Chamber diagnosis averagee NA 0.981 0.971 0.802
Infarct
Anterior infarct 5187 0.969 (0.967-0.971) 0.926 0.972
Septal infarct 4107 0.983 (0.980-0.985) 0.975 0.981
Lateral infarctd 1733 0.985 (0.983-0.986) 0.955 0.973
Inferior infarct 6315 0.975 (0.973-0.976) 0.946 0.963
Posterior infarctc 303 0.937 (0.922-0.951) 0.845 0.838
ST elevationd 843 0.870 (0.857-0.881) 0.604 0.628
Infarct diagnosis averagee NA 0.971 0.930 0.953
Other
Lead misplacementc 516 0.841 (0.816-0.863) 0.366 0.692
Low voltage 3891 0.978 (0.977-0.980) 0.951 0.979
Other diagnosis averagee NA 0.962 0.882 0.945

Abbreviations: AUC, area under the receiver operating characteristic curve; AV, atrioventricular; CNN, convolutional neural network; ECG, electrocardiogram; NA, not applicable.

a

Specificity is reported at sensitivity fixed at 0.9 for each class.

b

Sensitivity is reported at specificity fixed at 0.9 for each class.

c

N = <4000 in the sampled training data set for this class.

d

N = <8000 in the sampled training data set for this class.

e

Frequency-weighted mean.

Figure 1. Diagram of Study Electrocardiogram (ECG) Data Sets.

Figure 1.

aThe sampled training data set was randomly sampled from the training data set (eMethods in the Supplement) to address class imbalance. Consensus committee data set individuals were not in other data sets. Blue boxes indicate data sets used for training; yellow boxes indicate data sets used for validation. UCSF indicates University of California, San Francisco.

ECG Diagnosis Ascertainment

The text from each ECG’s cardiologist clinical diagnosis was explored, and we selected 38 of the most common and clinically relevant ECG diagnoses to include in this analysis spanning arrhythmic, morphologic, conduction, and structural ECG-based diagnoses. Many important diagnoses were not available in high enough frequency in this data set to include. Text-parsing methods were applied to the diagnosis text (eMethods in the Supplement) to derive binary diagnosis labels denoting the presence of each of the 38 diagnoses anywhere within the ECG. Any number of diagnostic classes may be present in each ECG.

Data Sets

The ECG data from January 1, 2003, to December 31, 2017, were split into training, development, and test data sets in a ratio of approximately 8:1:1 (Figure 1). To address the substantial class imbalance between the 38 diagnostic classes, we down-sampled common classes in the training data set to form a sampled training data set (eMethods in the Supplement). We created an additional consensus committee data set (n = 328) by randomly sampling from all available 2018 ECGs until each of the 38 classes (according to the cardiologist clinical diagnosis) had at least 11 examples.

The CNN was trained on cardiologist clinical diagnosis labels from the sampled training data set. The development data set was used to refine CNN hyperparameters, and the test data set was used to validate the CNN against the cardiologist clinical diagnoses in the holdout test data set per standard validation practice (Table 1). We additionally performed a second rigorous validation by assembling a consensus committee of 3 board-certified, practicing cardiac electrophysiologists (H.H., T.W., and B.L.) to provide criterion standard diagnoses by consensus. Committee diagnoses were used as the criterion standard against which the CNN algorithm, MUSE algorithm, and cardiologist clinical diagnoses were compared in the consensus committee data set. All data sets had completely distinct, nonoverlapping sets of patients.

Statistical Analysis

We trained a single CNN to predict the presence of each of the 38 diagnostic classes. The model (eFigure 1 in the Supplement) had a 1-dimensional ResNet architecture15 similar to what has been previously described.4 Additional details on CNN structure and evaluation are described in the eMethods in the Supplement. We evaluated our CNN on both the test and consensus committee data sets (Figure 1). The CNN performance was assessed using the area under the receiver operating characteristic curve (AUC).16 The CIs for the AUCs were computed using the bootstrap method. Because the AUC cannot be calculated for MUSE or cardiologist clinical diagnoses, we also calculated F1 scores17 in the consensus committee data set whose primary function herein provided comparisons within a given diagnosis of the relative performance between CNN, MUSE, and cardiologist clinical diagnoses. We reported averaged performance metrics by diagnostic category, weighted by the frequencies of the individual classes in each category. Except where otherwise specified, thresholds for each class were chosen to maximize the F1 score in that data set. To illustrate the association between diagnoses predicted by the CNN, we plotted the co-occurrence of classes for both cardiologist clinical and CNN-predicted diagnoses in the test data set. Data were analyzed from January 1, 2019, to March 2, 2021.

Explainability

Understanding what CNNs learn in a data-driven manner from ECG data may provide valuable clinical insights. Therefore, we applied an explainability technique called Linear Interpretable Model-Agnostic Explanations (LIME)18 to highlight which ECG segments drive particular CNN diagnoses as learned by the CNN from the training data (eFigure 2 in the Supplement). We present the results of LIME visually by mapping higher-weighted segments over the original ECG signal, with weight corresponding to color intensity. The LIME technique highlights temporal segments of the ECG waveform only, and LIME-highlighted segments are vertically scaled according to the relative ECG voltages within that highlighted segment.

Results

Study Population

A total of 1 051 114 ECGs were obtained at UCSF between January 1, 2003, and December 31, 2018. After removing the ECGs of patients younger than 18 years at the time of acquisition (n = 58 366), a total of 992 748 ECGs from 365 009 adult patients (mean [SD] age, 56.2 [17.6] years; 183 600 women [50.3%]; and 175 277 White patients [48.0%]) remained (Figure 1). The most common cardiologist clinical diagnoses in the total data set included sinus rhythm (63.0%), left ventricular hypertrophy (9.7%), and right bundle branch block (8.6%) (eTable 1 in the Supplement). The sampled training data set included 170 297 ECGs from 63 360 patients, the development data set included 89 920 ECGs from 32 574 patients, and the test data set included 91 440 ECGs from 32 576 patients (Figure 1). The consensus data set included 328 ECGs obtained in 2018 from 328 patients who were not in other cohorts.

Algorithm Performance

We used the holdout test data set first to provide a large-scale validation (91 440 ECGs) of CNN performance against the clinically accepted standard of cardiologist-confirmed clinical ECG diagnoses (Table 1). The CNN demonstrated high AUCs of at least 0.960 for 32 of 38 diagnostic classes (84.2%). Exceptions included ectopic atrial rhythm (CNN AUC, 0.949; 95% CI, 0.938-0.961), nonspecific interventricular conduction delay (0.937; 95% CI, 0.931-0.943), prolonged QT (0.959; 95% CI, 0.958-0.961), and posterior infarct (0.937; 95% CI, 0.922-0.951); ST elevation and lead misplacement had AUCs of 0.870 (95% CI, 0.857-0.881) and 0.841 (95% CI, 0.816-0.863), respectively. Sensitivity in Table 1 is shown at a fixed threshold, where specificity equals 0.90 and vice versa. At these thresholds (which can be changed depending on the application), sensitivity and specificity were generally quite high for rhythm and conduction diagnostic categories, with lower values for classes such as ST elevation (sensitivity, 0.628; specificity, 0.604) and lead misplacement (sensitivity, 0.692; specificity, 0.366). The CNN F1 score is shown in eTable 2 in the Supplement. We examined the co-occurrence and confusion of diagnoses between cardiologist clinical and CNN-predicted diagnoses to understand the association between the various diagnostic classes predicted by the CNN. Overall, patterns were similar for rhythm classes and other examined classes (Figure 2A; eFigures 3 and 4 in the Supplement). eFigure 5 in the Supplement plots the association between class prevalence and CNN performance.

Figure 2. Co-occurrence Matrices, Frequency-Weighted Mean F1 Scores, and Sensitivities for the Convolutional Neural Network (CNN).

Figure 2.

Co-occurrence matrices for both (A) cardiologist-confirmed and (B) CNN-predicted rhythm diagnoses. Counts of co-occurrence diagnoses pairs are shown, with totals on the diagonal. C, Mean F1 scores vs the committee consensus diagnosis. D, Mean sensitivity vs the committee consensus diagnosis. AUC indicates frequency-weighted area under the receiver operating characteristic curve; cardiologist dx, cardiologist clinical diagnosis; MUSE, electrocardiogram interpretation database management system by GE Healthcare; NA, not available.

aF1 scores averaged by class frequencies.

bSpecificity is fixed at the frequency-weighted average cardiologist clinical diagnosis specificity for each class; sensitivities reported at this fixed specificity. MUSE sensitivity/specificity are unalterable and therefore are reported in eTable 3 in the Supplement.

cSensitivity averaged by class frequencies.

The consensus committee data set provided a second validation of the CNN against the rigorous standard of uniformly applied expert consensus diagnoses (Table 2). Importantly, this criterion standard uniquely enabled a relative comparison between the CNN, clinical cardiologist, and (unedited) MUSE diagnosis using the F1 score. In all 5 diagnostic categories, the CNN had higher frequency-weighted mean F1 scores than both cardiologist clinical and MUSE diagnoses (frequency-weighted F1 score for CNN, cardiologist clinical diagnosis, and MUSE, respectively, for rhythm, 0.812, 0.773, and 0.690; conduction, 0.729, 0.560, and 0.538; chamber, 0.598, 0.420, and 0.424; infarct, 0.674, 0.566, and 0.514; and other diagnoses, 0.875, 0.680, and 0.243) (Figure 2B). The CNN demonstrated AUCs of at least 0.910 for 32 of 38 individual diagnostic classes (84.2%) (Table 2). Exceptions included sinus rhythm (CNN AUC, 0.856; 95% CI, 0.810-0.888), nonspecific intraventricular conduction delay (0.866; 95% CI, 0.771-0.939), prolonged QT (0.860; 95% CI, 0.795-0.918), left atrial enlargement (0.835; 95% CI, 0.760-0.898), lateral infarct (0.891; 95% CI, 0.766-0.995), and ST elevation (0.862; 95% CI, 0.746-0.948). The CNN demonstrated comparable performance to cardiologists, with the same or higher F1 scores for all classes except atrial fibrillation (CNN F1 score, 0.847 vs cardiologist F1 score, 0.881), junctional rhythm (0.526 vs 0.727), premature ventricular complex (0.786 vs 0.800), and Wolff-Parkinson-White (0.800 vs 0.842) (Table 2). Because cardiologist clinical diagnosis sensitivity and specificity are fixed, we fixed CNN specificity at the (high) specificity of cardiologists in order to compare CNN and cardiologist sensitivities (Figure 2C). The CNN had higher frequency-weighted sensitivities than cardiologists for all 5 diagnostic categories and for 31 of 38 individual classes (81.6%) (frequency-weighted sensitivities for CNN and cardiologist clinical diagnosis at a fixed specificity, respectively, for rhythm, 0.728, 0.709; conduction, 0.609, 0.513; chamber, 0.617, 0.457; infarct, 0.749, 0.696; and other diagnoses, 0.938, 0.812) (Table 2).

Table 2. Performance of the CNN, Cardiologist Diagnosis, and MUSE Diagnosisa on 38 Diagnostic Classes Compared Against the Committee Consensus Diagnosis (N = 328).

Diagnostic class Frequency CNN AUC (95% CI) CNN F1 scoreb Cardiologist clinical F1 score MUSE F1 score Cardiologist-fixed specificityc CNN sensitivity Cardiologist clinical sensitivity
Rhythm
Sinus 228 0.856 (0.810-0.888) 0.849 0.818 0.784 0.940 0.750 0.711
Atrial fibrillation 30 0.987 (0.974-0.996) 0.847 0.881 0.833 0.990 0.800 0.867
Atrial flutter 18 0.970 (0.939-0.995) 0.750 0.645 0.333 0.990 0.667 0.556
Ectopic atrial rhythmd 10 0.988 (0.974-0.999) 0.750 0.560 0.444 0.975 0.900 0.700
Atrial tachycardiad 13 0.920 (0.853-0.957) 0.400 0.333 0.133 0.978 0.231 0.308
Ventricular tachycardiad 9 0.997 (0.992-1.000) 0.842 0.615 0.000 1.000 0.556 0.444
Junctional rhythmd 9 0.967 (0.925-0.991) 0.526 0.727 0.300 0.984 0.556 0.889
Supraventricular tachycardiad 8 0.987 (0.970-0.998) 0.696 0.632 0.714 0.984 0.500 0.750
Bigeminyd 10 0.999 (0.997-1.000) 0.952 0.857 0.737 0.994 1.000 0.900
Premature ventricular complex 32 0.942 (0.894-0.985) 0.786 0.800 0.712 0.976 0.719 0.812
Premature atrial complex 28 0.974 (0.954-0.990) 0.759 0.692 0.556 0.980 0.679 0.643
Ventricular paced 13 0.983 (0.955-1.000) 0.917 0.750 0.762 0.994 0.846 0.692
Atrial pacede 14 0.984 (0.956-1.000) 0.846 0.800 0.800 0.997 0.786 0.714
Rhythm diagnosis averagef NA 0.909 0.812 0.773 0.690 0.961 0.728 0.709
Conduction
AV block
1st Degree 47 0.939 (0.906-0.962) 0.679 0.560 0.521 0.975 0.553 0.447
2nd Degree Mobitz 1d 9 0.999 (0.996-1.000) 0.941 0.900 0.625 0.994 0.889 1.000
Branch block
Left bundle 13 0.955 (0.871-1.000) 0.870 0.720 0.692 0.990 0.769 0.692
Right bundle 42 0.994 (0.986-0.999) 0.941 0.833 0.800 0.951 1.000 0.952
Left fascicular block
Anterior 35 0.973 (0.956-0.989) 0.756 0.621 0.491 0.983 0.629 0.514
Posteriord 23 0.971 (0.948-0.988) 0.656 0.529 0.529 0.993 0.435 0.391
Bifascicular blocke 21 0.988 (0.977-0.996) 0.800 0.485 0.485 0.987 0.667 0.381
Nonspecific intraventricular conduction delay 21 0.866 (0.771-0.939) 0.514 0.308 0.293 0.961 0.476 0.286
Axis deviation
Right 38 0.953 (0.928-0.971) 0.667 0.357 0.345 0.972 0.447 0.263
Left 47 0.951 (0.909-0.981) 0.800 0.593 0.674 0.964 0.574 0.511
Right superior axisd 5 0.951 (0.916-0.977) 0.303 0.125 0.125 0.969 0.200 0.200
Prolonged QT 26 0.860 (0.795-0.918) 0.500 0.360 0.415 0.950 0.423 0.346
Wolff-Parkinson-Whited 8 0.992 (0.980-1.000) 0.800 0.842 0.706 0.991 0.750 1.000
Conduction diagnosis averagef NA 0.951 0.729 0.560 0.538 0.971 0.609 0.513
Chamber enlargement
Ventricular hypertrophy
Left 24 0.975 (0.960-0.988) 0.700 0.644 0.645 0.947 0.875 0.792
Right 11 0.982 (0.954-0.998) 0.706 0.348 0.364 0.975 0.818 0.364
Atrial enlargement
Left 37 0.835 (0.760-0.898) 0.432 0.214 0.222 0.955 0.324 0.162
Righte 9 0.960 (0.878-1.000) 0.875 0.762 0.737 0.987 0.889 0.889
Chamber diagnosis averagef NA 0.910 0.598 0.420 0.424 0.959 0.617 0.457
Infarct
Anterior infarct 12 0.919 (0.861-0.970) 0.634 0.514 0.500 0.905 0.783 0.783
Septal infarct 27 0.966 (0.944-0.988) 0.737 0.656 0.613 0.953 0.815 0.741
Lateral infarcte 10 0.891 (0.766-0.995) 0.632 0.303 0.308 0.943 0.700 0.500
Inferior infarct 27 0.962 (0.906-0.987) 0.746 0.708 0.622 0.950 0.889 0.852
Posterior infarctd 4 0.925 (0.807-1.000) 0.667 0.400 0.400 0.975 0.500 0.750
ST elevatione 14 0.862 (0.746-0.948) 0.480 0.400 0.308 0.981 0.429 0.357
Infarct diagnosis averagef NA 0.934 0.674 0.566 0.514 0.950 0.749 0.696
Other
Lead misplacementd 12 0.999 (0.997-1.000) 0.917 0.833 0.250 0.994 0.917 0.833
Low voltage 4 0.985 (0.945-1.000) 0.750 0.222 0.222 0.938 1.000 0.750
Other diagnosis averagef NA 0.995 0.875 0.680 0.243 0.980 0.938 0.812

Abbreviations: AUC, area under the receiver operating characteristic curve; AV, atrioventricular; CNN, convolutional neural network; MUSE, electrocardiogram interpretation database management system by GE Healthcare; NA, not applicable.

a

Total diagnosis N = 948.

b

F1 score is a global metric of algorithm performance complementary to the AUC, which rewards algorithms that maximize positive predictive value and sensitivity simultaneously. It is particularly useful in settings where the frequency of classes is imbalanced. Reported F1 score for the CNN is the maximal F1 score in the consensus committee data set.

c

Specificity is fixed at the cardiologist clinical diagnosis specificity for each class. Convolutional neural network and cardiologist clinical diagnosis sensitivities are reported at this same fixed specificity for each class. MUSE sensitivity and specificities are fixed and are reported separately in eTable 3 in the Supplement, since MUSE specificity cannot be altered to match the cardiologist clinical diagnosis specificities shown here.

d

N = <4000 in the sampled training data set for this class.

e

N = <8000 in the sampled training data set for this class.

f

Frequency-weighted mean.

Comparing the performance of the CNN against the unedited MUSE diagnoses, the CNN had a higher F1 score than MUSE for all classes except for supraventricular tachycardia (CNN F1 score, 0.696 vs MUSE F1 score, 0.714) (Table 2). When comparing the cardiologist clinical diagnosis and MUSE diagnosis (against the consensus committee) for 30 of 38 classes (79.0%), cardiologists had higher F1 scores than MUSE. MUSE automated analysis had higher F1 scores than cardiologists for supraventricular tachycardia (MUSE F1 score, 0.714 vs cardiologist F1 score, 0.632), ventricular paced (0.762 vs 0.750), left axis deviation (0.674 vs 0.593), prolonged QT (0.415 vs 0.360), left ventricular hypertrophy (0.645 vs 0.644), right ventricular hypertrophy (0.364 vs 0.348), left atrial enlargement (0.222 vs 0.214), and lateral infarct (0.308 vs 0.303). The CNN had higher sensitivities than MUSE for most classes at the same fixed MUSE specificity (eTable 3 in the Supplement). eTable 4 in the Supplement shows agreement between consensus committee members and the CNN.

Algorithm Explainability

To mitigate the black box limitation of CNNs, the LIME explainability technique identified ECG segments that the CNN learned as important for each diagnosis. Figure 3 shows examples of applying LIME to ECGs from various classes. The LIME-highlighted ECG segments exhibit similarities between examples in each diagnostic class and potentially illuminate disease-associated cardiac physiology.

Figure 3. Examples of an Artificial Intelligence Explainability Technique Applied to Electrocardiograms (ECGs).

Figure 3.

The Linear Interpretable Model-Agnostic Explanations (LIME) explainability technique highlights ECG segments important to the convolutional neural network (CNN) for each diagnosis. Segments of greater importance are shown in greater color intensity. For each example, all leads with LIME-highlighted segments are shown, as is the CNN’s confidence score. Many physiologically associated ECG features were highlighted: in Wolff-Parkinson-White, the QRS “delta-wave” of preexcitation; in right ventricular hypertrophy, the R-prime in V1; and in inferior infarcts, the Q-wave in the inferior leads III and aVF. For the unipolar limb leads (H), a indicates augmented; F, foot; L, left arm; R, right arm; and V, vector.

Discussion

In this cross-sectional study, we show that 12-lead ECG data readily available in many institutions can train a CNN algorithm to achieve strong performance across 5 clinically relevant diagnostic ECG categories. The CNN exceeded the performance of a well-studied, widely available commercial ECG system19,20,21 for nearly all examined diagnoses. It additionally achieved comparable performance to the expert cardiologists that currently over-read clinical ECGs by sensitivity, specificity, and F1 score, offering a compelling alternative that could complement existing rule-based ECG systems for many diagnoses. Results also demonstrate a novel approach to mitigate CNNs’ black box limitation through the application of the LIME explainability technique, which offers transparency into the ECG waveform-associated physiology contributing to each CNN diagnosis.

A growing body of evidence supports the aptitude of machine learning algorithms like CNNs for everyday ECG analysis. Our results corroborate recent studies supporting strong CNN performance for common ECG diagnoses in various settings.4,5,22 This work additionally provides a critical comparison of CNN performance against clinically accepted standards of care in ECG interpretation: a widely used commercial algorithm (MUSE) and cardiologist clinical diagnoses (which provide the final clinical ECG interpretation in most health care workflows). Beyond performance gains, CNNs provide several advantages over algorithms in existing ECG analysis systems. First is the well-documented ability of CNNs to incrementally improve as more training data are made available.11,23,24 Second is the ability to train CNNs to predict new diagnoses beyond what physicians can perform with ECGs, as we and others have demonstrated,6,7,8 raising the possibility to expand the diagnostic utility of ECGs beyond their present scope.25 However, CNNs may also be less ideal for some ECG tasks, such as calculation of interval durations or QRS/P-wave axis, which are inherently rule-based tasks. Therefore, employing a strategic combination of CNNs for certain diagnoses alongside rule-based algorithms for others may best leverage the relative strengths of each algorithmic paradigm.

There are multiple clinical implications of incorporating CNNs into the clinical ECG workflow. High-performing CNNs for certain diagnoses could be integrated directly into ECG analysis systems, such as MUSE, or deployed by health care institutions on digital ECG data before ECGs are incorporated into their electronic record systems (alongside a commercial system’s diagnoses) and accessed by clinicians. Such integration could enable existing ECG systems to be supplemented for diagnoses where CNNs excel. Here, we specifically prioritized developing a CNN pipeline that accepts and trains on digital ECG data “as is” without requiring additional (costly) annotation, which is often prohibitive to perform at scale. This approach enables already-trained CNNs like ours to be further trained using ECG data already available in many institutions, unlocking the enormous potential to use existing data from the hundreds of millions of ECGs in institutions worldwide. Combined with CNNs’ ability to improve as data increase,11,23,24 our training approach enables CNNs to be refined using population or institution-specific data. For example, an institution could use its own ECG archive to retrain a CNN to better suit its population’s unique demographic makeup. In addition, CNNs can uniquely predict novel diagnoses from ECGs not currently possible with commercial systems, such as decreased ejection fraction,6 pulmonary hypertension, or hypertrophic cardiomyopathy.8,10 Existing ECG systems and clinicians stand to benefit from the expanded analytic capabilities of CNNs.

The next necessary step toward clinical application is refinement of CNN performance for specific diagnoses to adequate clinical standards. We trained our CNN for 38 diagnoses, many of which already demonstrated clinically ready performance. However, we also observed unexpectedly lower CNN performance for some diagnoses, such as for sinus rhythm or for atrial fibrillation and premature ventricular complex, in which the cardiologist clinical diagnosis had higher F1 scores than the CNN. It is possible that this outcome may result, in part, from systematic discrepancies in consensus committee diagnoses compared with cardiologist clinical diagnosis labels used for CNN training. Cardiologists may also perform particularly well on visually striking classes, like premature ventricular complex or atrial fibrillation. There are several opportunities by which to refine CNN performance for specific diagnoses, such as an increase in training data, generation of more precise training labels, or the addition of further hyperparameter tuning. Several other considerations are discussed in the eAppendix in the Supplement. Development of diagnosis-specific algorithms may also improve performance for individual diagnoses. For example, a specific CNN could be trained for atrial fibrillation (and/or related diagnoses), whereas another CNN could be trained for ischemia.

In our present medical era in which most medical data are digitized, the barriers to developing algorithms that can truly support a learning health system12 are no longer technical but instead are largely administrative and regulatory. A health system that can constantly learn over time will require data-driven algorithmic paradigms like machine learning, which support iterative improvement, to supplement existing rule-based algorithmic systems. Our results show that CNNs can already outperform existing standard-of-care ECG workflows for certain diagnoses, surpassing traditional algorithms and delivering performance comparable to expert-level manual review. Electrocardiogram analysis systems that consistently achieve expert-level performance could drastically change current clinical workflows26 by allowing delegation of certain ECG diagnoses to algorithms, focusing human expertise on difficult diagnoses where it is needed more. The common practice of having cardiologists confirm all ECGs regardless of diagnosis—which is subject to limitations of reader fatigue26—might be worth revisiting.

It is notable that in approximately 20% of cases, the unedited MUSE diagnosis had better agreement with the consensus committee than the over-reading clinical cardiologist who made (erroneous) changes to the unedited MUSE diagnosis. Most of the classes for which MUSE outperformed the cardiologist clinical diagnosis were measurement-based diagnoses, such as prolonged QT, ventricular hypertrophy, or atrial enlargement (Table 2). For these diagnoses, more so than others, measurements or calculations are made on specific ECG features, making it plausible that the rule-based algorithms could objectively and systematically perform measurements more consistently than busy cardiologists tasked to read hundreds of ECGs in a sitting. Because human attention is a limited quantity, physician performance is influenced by the gravity of the diagnosis.27 Therefore, for less acute diagnoses, such as mild left atrial enlargement, human readers may pay less attention to borderline examples. Such trade-offs may not be necessary to make with well-trained algorithms.

Our work also suggests that greater amounts of training data would likely help to achieve the fullest potential of CNN-based ECG analysis. Although our data set had more than 1 million ECGs, many individual classes still had fewer than 4000 examples in the sampled training data set. We were thus limited to training our algorithm to the 38 most commonly available diagnoses—far from a comprehensive set of diagnoses with which we could ever consider replacing the MUSE system. Training similar CNNs for all required clinical diagnoses would likely demand more data than are available at any single institution. This scenario presents a compelling motivation to start working toward overcoming the formidable administrative obstacles that currently limit large-scale, multi-institutional data sharing in medicine. Innovation in areas such as federated learning and data privacy and related fields28,29 may ultimately enable training medical algorithms across multiple institutions, capturing greater geographic and biologic diversity.

The black box nature inherent to CNNs poses a noteworthy disadvantage in ECG analysis because specific aspects of the tracing represent underlying cardiac physiology, which is critical to clinical decision-making. Our results suggest that LIME can mitigate this issue by providing quantitative interpretability to CNNs (Figure 2). Interpreted alongside the CNN’s confidence score for each diagnosis, LIME annotations could improve clinician confidence in automated analysis while supporting an optimal human-machine interaction that provides greater context to incorporate algorithmic diagnoses into clinical decisions. Furthermore, we believe that the enormous potential of data-driven machine learning in medicine necessarily lies, in part, in its ability to help drive discovery from large quantities of data.8 Convolutional neural network explainability techniques such as LIME, combined with large cross-institutional ECG data sets, would greatly facilitate such discovery by identifying disease-associated ECG patterns that are too subtle to be visually recognized or from data sets too large for manual review. For example, the LIME-highlighted segments in our study that were unexpected may actually represent novel ECG correlates of underlying disease mechanisms yet to be understood, providing a basis for further investigation. Ultimately, such data-driven discovery may have the greatest potential for diseases whose recognized ECG correlates are likely incomplete (like amyloid or hypertrophic cardiomyopathy8,10).

Limitations

This study had some limitations. Our consensus committee data set was relatively small owing to constraints in the time and resources required to provide consensus diagnoses from the 3 electrophysiologists. We did not perform an external validation, raising the possibility of lower CNN performance in data from another institution on account of overfitting or systematic differences in ECGs, such as lead placement, noise, population, or other factors. It is also possible that the parsing strategy we used led to errors in class labels used in training and test data sets. However, this error would bias algorithm results toward the null and would have been revealed during consensus committee validation.

Conclusions

In this cross-sectional study, 12-lead ECG data readily available in many institutions was used to train a CNN algorithm that achieved performance comparable with cardiologists and exceeded the performance of a clinically accepted, automated ECG system for most but not all of the examined diagnostic classes. Larger data sets may be required to train deep learning–based algorithms across the full range of relevant clinical diagnoses. The LIME explainability technique can highlight physiologically relevant ECG segments that contribute to CNN diagnoses.

Supplement.

eFigure 1. Convolutional Neural Network (CNN) Architecture

eFigure 2. Deployment of an Explainable Convolutional Neural Network for Automated Electrocardiogram Analysis

eFigure 3. Co-Occurrence Matrices in the Test Data Set

eFigure 4. Confusion Matrix for CNN-Predicted Rhythm Diagnosis in the Test Data Set

eFigure 5. CNN Performance by Class and Number of Training Examples

eTable 1. Demographic and Diagnosis Characteristics

eTable 2. F1 Score of the Convolutional Neural Network on 38 Diagnostic Classes Compared Against the Cardiologist Clinical Diagnoses in the Hold-Out Test Data Set (n = 32 576 patients; 91 440 ECGs)

eTable 3. Sensitivity of the CNN, and MUSE Diagnosis Compared Against the Committee Consensus Diagnosis (n = 328)

eTable 4. Agreement Between the Consensus Committee Members (Electrophysiologists) and the CNN in the Consensus Committee Data Set (n = 328)

eMethods

eAppendix

eReferences

References

  • 1.Hongo RH, Goldschlager N. Status of computerized electrocardiography. Cardiol Clin. 2006;24(3):491-504, x. doi: 10.1016/j.ccl.2006.03.005 [DOI] [PubMed] [Google Scholar]
  • 2.Schläpfer J, Wellens HJ. Computer-interpreted electrocardiograms: benefits and limitations. J Am Coll Cardiol. 2017;70(9):1183-1192. doi: 10.1016/j.jacc.2017.07.723 [DOI] [PubMed] [Google Scholar]
  • 3.Blackburn H, Keys A, Simonson E, Rautaharju P, Punsar S. The electrocardiogram in population studies. a classification system. Circulation. 1960;21(June):1160-1175. doi: 10.1161/01.CIR.21.6.1160 [DOI] [PubMed] [Google Scholar]
  • 4.Hannun AY, Rajpurkar P, Haghpanahi M, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25(1):65-69. doi: 10.1038/s41591-018-0268-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ribeiro AH, Ribeiro MH, Paixão GMM, et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun. 2020;11(1):1760. doi: 10.1038/s41467-020-15432-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Attia ZI, Kapa S, Lopez-Jimenez F, et al. Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram. Nat Med. 2019;25(1):70-74. doi: 10.1038/s41591-018-0240-2 [DOI] [PubMed] [Google Scholar]
  • 7.Attia ZI, Noseworthy PA, Lopez-Jimenez F, et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet. 2019;394(10201):861-867. doi: 10.1016/S0140-6736(19)31721-0 [DOI] [PubMed] [Google Scholar]
  • 8.Tison GH, Zhang J, Delling FN, Deo RC. Automated and interpretable patient ECG profiles for disease detection, tracking, and discovery. Circ Cardiovasc Qual Outcomes. 2019;12(9):e005289. doi: 10.1161/CIRCOUTCOMES.118.005289 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hong S, Wu M, Zhou Y, et al. ENCASE: an ensemble classifier for ECG classification using expert features and deep neural networks. Comput Cardiol. 2017;44:2-5. doi: 10.22489/CinC.2017.178-245 [DOI] [Google Scholar]
  • 10.Ko WY, Siontis KC, Attia ZI, et al. Detection of hypertrophic cardiomyopathy using a convolutional neural network-enabled electrocardiogram. J Am Coll Cardiol. 2020;75(7):722-733. doi: 10.1016/j.jacc.2019.12.030 [DOI] [PubMed] [Google Scholar]
  • 11.Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness of data in deep learning era. In: IEEE International Conference on Computer Vision. 2017:843-852. doi: 10.1109/ICCV.2017.97. [DOI] [Google Scholar]
  • 12.Smith M, Saunders R, Stuckhardt L, McGinnis M. Best Care at Lower Cost: The Path to Continuously Learning Health Care in America. The National Academies Press; 2012. doi: 10.17226/13444. [DOI] [PubMed] [Google Scholar]
  • 13.Samek W, Wiegand T, Müller KR. Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. August 28, 2017. Accessed June 22, 2021. https://arxiv.org/pdf/1708.08296.pdf
  • 14.Silver D, Huang A, Maddison CJ, et al. Mastering the game of Go with deep neural networks and tree search. Nature. 2016;529(7587):484-489. doi: 10.1038/nature16961 [DOI] [PubMed] [Google Scholar]
  • 15.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778. doi: 10.1109/CVPR.2016.90. [DOI] [Google Scholar]
  • 16.Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45:171-186. doi: 10.1023/A:1010920819831 [DOI] [Google Scholar]
  • 17.Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. doi: 10.1371/journal.pone.0118432 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Ribeiro MT, Singh S, Guestrin C. “Why should I trust you ?” explaining the predictions of any classifier. In: KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:1135-1144. doi: 10.1145/2939672.2939778 [DOI] [Google Scholar]
  • 19.Shah AP, Rubin SA. Errors in the computerized electrocardiogram interpretation of cardiac rhythm. J Electrocardiol. 2007;40(5):385-390. doi: 10.1016/j.jelectrocard.2007.03.008 [DOI] [PubMed] [Google Scholar]
  • 20.Guglin ME, Thatai D. Common errors in computer electrocardiogram interpretation. Int J Cardiol. 2006;106(2):232-237. doi: 10.1016/j.ijcard.2005.02.007 [DOI] [PubMed] [Google Scholar]
  • 21.Poon K, Okin PM, Kligfield P. Diagnostic performance of a computer-based ECG rhythm algorithm. J Electrocardiol. 2005;38(3):235-238. doi: 10.1016/j.jelectrocard.2005.01.008 [DOI] [PubMed] [Google Scholar]
  • 22.Kashou AH, Ko WY, Attia ZI, Cohen MS, Friedman PA, Noseworthy PA. A comprehensive artificial intelligence–enabled electrocardiogram interpretation program. Cardiovasc Digit Health J. 2020;1(2):62-70. doi: 10.1016/j.cvdhj.2020.08.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316(22):2402-2410. doi: 10.1001/jama.2016.17216 [DOI] [PubMed] [Google Scholar]
  • 24.Joulin A, van der Maaten L, Jabri A, Vasilache N. Learning visual features from large weakly supervised data. In: European Conference on Computer Vision. 2016:67-84. doi: 10.1007/978-3-319-46478-7_5 [DOI] [Google Scholar]
  • 25.Tison GH. Finding new meaning in everyday electrocardiograms—leveraging deep learning to expand our diagnostic toolkit. JAMA Cardiol. 2021;6(5):493-494. doi: 10.1001/jamacardio.2020.7460 [DOI] [PubMed] [Google Scholar]
  • 26.Anh D, Krishnan S, Bogun F. Accuracy of electrocardiogram interpretation by cardiologists in the setting of incorrect computer analysis. J Electrocardiol. 2006;39(3):343-345. doi: 10.1016/j.jelectrocard.2006.02.002 [DOI] [PubMed] [Google Scholar]
  • 27.Semigran HL, Levine DM, Nundy S, Mehrotra A. Comparison of physician and computer diagnostic accuracy. JAMA Intern Med. 2016;176(12):1860-1861. doi: 10.1001/jamainternmed.2016.6001 [DOI] [PubMed] [Google Scholar]
  • 28.Brisimi TS, Chen R, Mela T, Olshevsky A, Paschalidis IC, Shi W. Federated learning of predictive models from federated electronic health records. Int J Med Inform. 2018;112(January):59-67. doi: 10.1016/j.ijmedinf.2018.01.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Xu J, Glicksberg BS, Su C, Walker P, Bian J, Wang F. Federated learning for healthcare informatics. August 20, 2020. Accessed June 22, 2021. https://arxiv.org/pdf/1911.06270.pdf [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement.

eFigure 1. Convolutional Neural Network (CNN) Architecture

eFigure 2. Deployment of an Explainable Convolutional Neural Network for Automated Electrocardiogram Analysis

eFigure 3. Co-Occurrence Matrices in the Test Data Set

eFigure 4. Confusion Matrix for CNN-Predicted Rhythm Diagnosis in the Test Data Set

eFigure 5. CNN Performance by Class and Number of Training Examples

eTable 1. Demographic and Diagnosis Characteristics

eTable 2. F1 Score of the Convolutional Neural Network on 38 Diagnostic Classes Compared Against the Cardiologist Clinical Diagnoses in the Hold-Out Test Data Set (n = 32 576 patients; 91 440 ECGs)

eTable 3. Sensitivity of the CNN, and MUSE Diagnosis Compared Against the Committee Consensus Diagnosis (n = 328)

eTable 4. Agreement Between the Consensus Committee Members (Electrophysiologists) and the CNN in the Consensus Committee Data Set (n = 328)

eMethods

eAppendix

eReferences


Articles from JAMA Cardiology are provided here courtesy of American Medical Association

RESOURCES