Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
. 2024 Jan 24;31(4):855–865. doi: 10.1093/jamia/ocae002

Biometric contrastive learning for data-efficient deep learning from electrocardiographic images

Veer Sangha 1,2, Akshay Khunte 3, Gregory Holste 4, Bobak J Mortazavi 5,6, Zhangyang Wang 7, Evangelos K Oikonomou 8, Rohan Khera 9,10,11,
PMCID: PMC10990541  PMID: 38269618

Abstract

Objective

Artificial intelligence (AI) detects heart disease from images of electrocardiograms (ECGs). However, traditional supervised learning is limited by the need for large amounts of labeled data. We report the development of Biometric Contrastive Learning (BCL), a self-supervised pretraining approach for label-efficient deep learning on ECG images.

Materials and Methods

Using pairs of ECGs from 78 288 individuals from Yale (2000-2015), we trained a convolutional neural network to identify temporally separated ECG pairs that varied in layouts from the same patient. We fine-tuned BCL-pretrained models to detect atrial fibrillation (AF), gender, and LVEF < 40%, using ECGs from 2015 to 2021. We externally tested the models in cohorts from Germany and the United States. We compared BCL with ImageNet initialization and general-purpose self-supervised contrastive learning for images (simCLR).

Results

While with 100% labeled training data, BCL performed similarly to other approaches for detecting AF/Gender/LVEF < 40% with an AUROC of 0.98/0.90/0.90 in the held-out test sets, it consistently outperformed other methods with smaller proportions of labeled data, reaching equivalent performance at 50% of data. With 0.1% data, BCL achieved AUROC of 0.88/0.79/0.75, compared with 0.51/0.52/0.60 (ImageNet) and 0.61/0.53/0.49 (simCLR). In external validation, BCL outperformed other methods even at 100% labeled training data, with an AUROC of 0.88/0.88 for Gender and LVEF < 40% compared with 0.83/0.83 (ImageNet) and 0.84/0.83 (simCLR).

Discussion and Conclusion

A pretraining strategy that leverages biometric signatures of different ECGs from the same patient enhances the efficiency of developing AI models for ECG images. This represents a major advance in detecting disorders from ECG images with limited labeled data.

Background and significance

Electrocardiography is a ubiquitous tool for the diagnosis of cardiovascular diseases. Deep learning has been successfully applied to automate both the detection of disorders that are commonly discernable by physicians from ECGs,1,2 as well as those that traditionally require more specialized imaging modalities such as echocardiography or cardiac magnetic resonance imaging scans.3–5

While most existing AI tools to automatically analyze ECGs rely on raw signal data, there are significant challenges associated with their clinical implementation. The use of signal-based models in both retrospective and prospective settings requires access to signal repositories, and data are often not stored beyond printed ECG images, particularly in remote settings.6 The widespread adoption of signal-based models is limited not only by the health system-wide investments required to incorporate them into clinical workflow but also by ECG data architectures varying by ECG device vendors,7,8 precluding a one-size-fits-all solution.9,10 These investments may not be available or cost-effective in low-resource settings, and, to date, signal-based models are not even broadly available in higher-resource settings such as the United States.

Deep learning-based analysis done directly on ECG images has the potential to make AI inference directly available to clinicians, who predominantly work with ECG images, either as printouts or digital images.7 We have developed an approach that can interpretably diagnose conduction and rhythm disorders,11 as well as structural disorders12 from any layout of real-world 12-lead ECG images. Our image-based AI-ECG approach is applicable to various clinical settings, different hospitals, and data storage formats, representing an easily accessible and scalable approach to detect underdiagnosed cardiovascular disorders in at-risk populations. The vendor-neutral nature of the technology allows untrained operators to implement screening through either chart review or automated applications to image repositories, making it optimal for large-scale implementation (Figure S1).13

Despite the advantages of an image-based approach, like any AI approach, regardless of the modality, algorithmic training and development require large, labeled datasets. However, many clinical disorders have low prevalence with a few examples in any individual dataset to develop algorithms designed for those conditions. This low prevalence of clinical labels is a key challenge for the development and generalizability of supervised learning approaches for ECG image models. For this reason, we developed a self-supervised learning approach to learn representations of ECG images that can serve as initializations for downstream finetuning on small, labeled datasets.

Self-supervised learning is designed to reduce the reliance on labeled data to develop models. The approach leverages unlabeled data to pretrain models before downstream fine-tuning on small, labeled datasets. It has been used in natural14 and medical image15,16 tasks and has recently been applied to ECG signals.17–19

Objectives

While self-supervised or contrastive learning has been done on real-world images of objects, 12-lead ECG images do not have an existing strategy to learn deeper features from unlabeled ECG images. Moreover, applications of self-supervised contrastive learning that have used raw ECG voltage data have not been designed to detect deep or hidden features of structural heart disease.

We have developed a few-shot, deep learning model development strategy—biometric contrastive learning (BCL)—in which our models are first trained to detect homologies of ECG features belonging to the same person, allowing for enhanced learning and detection of structural and functional abnormalities of the heart from any ECG image. Therefore, by developing a model that is designed to learn that two distinct ECGs belong to the same person, we develop an approach that builds a model that already recognizes key hidden features that make two ECGs of a single person similar. This early training process can then be fine-tuned on a small number of labeled examples of any condition of interest, and the model is able to adapt to enable superior performance. We evaluated the efficacy of our pretrained encoder on three clinical tasks: AF, gender, and left ventricular systolic dysfunction (LVSD) classification.

Materials and methods

The study was reviewed by the Yale Institutional Review Board, which approved the study protocol and waived the need for informed consent as the study represents a secondary analysis of existing data. The data cannot be shared publicly. These methods outline the full study protocol and adhere to the Minimum Information for Medical AI Reporting (MINIMAR) guidelines.20

Data sources for model development

12-lead ECG signal waveform data collected between 2000 and 2021 from all patients undergoing ECGs at Yale New Haven Hospital (YNHH), a large academic medical center, was used for the development and validation of BCL. The socioeconomic status of study individuals was not available. However, as the study population is patients seeking care at a large referral hospital that also serves as the safety net hospital for a diverse US city, we expect wide socioeconomic diversity in the development and validation sets.

These ECGs were recorded as standard 12-lead recordings sampled at a frequency of 500 Hz for 10 seconds. They were recorded on multiple different machines, and a majority were collected using Philips PageWriter machines and GE MAC machines. ECGs from 2000 to 2015 were used to develop the pretraining model. No clinical or other labels were used for these ECGs. ECGs collected between 2015 and 2020 were used to train supervised models on three clinical tasks. ECGs from separate patients collected in 2021 were used to evaluate the effectiveness of BCL and the other baseline methods.

Data preprocessing

ECG images were generated in the same manner as previous studies.11,12 Briefly, ECGs with 10 seconds of continuous recordings across all 12 leads were preprocessed with a one-second median filter, subtracted from the original waveform to remove baseline drift, representing processing steps pursued by ECG machines before generating printed output from collected waveform data.21 ECG signals were transformed into ECG images using the Python library ECG-plot. All images were converted to greyscale, followed by down-sampling to 300 × 300 pixels regardless of their original resolution using Python Image Library (PIL v9.2.0). The process of converting ECG signals to images was independent of model development, ensuring that the model did not learn any aspects of the processing that generated images from the signals. Moreover, in the applications demonstrated previously,11,12 and, in this current study, the images themselves do not undergo this preprocessing with an image-only prediction generated exclusively on images.

We created datasets with different plotting schemes for each signal waveform recording to ensure models were adaptable to real-world images, which vary in formats and layouts of leads. Four formats of images were generated for use in both BCL model pretraining and classification fine-tuning tasks: (1) standard printed ECG format in the United States, with four 2.5-second columns and a lead I rhythm strip; (2) a two-rhythm strip format, added lead II as an additional rhythm strip to the standard format; (3) an alternate format which consisted of two columns, each with 5 seconds of recording; and (4) a shuffled format, which had precordial leads in the first two columns and limb leads in the third and fourth. All images were rotated a random amount between −10 °C and 10 °C before being input into the model to mimic variations seen in uploaded ECGs and to aid in the prevention of overfitting.

Model training overview

BCL uses a convolutional neural network (CNN) backbone to build representations of ECGs specific to individuals. During pretraining, the model learned the elements of an ECG image that are consistent for a person. The model was rewarded for identifying ECG images from the same individual as similar and penalized for identifying ECG images from different individuals as similar (Figure 1).

Figure 1.

Figure 1.

Overview of biometric contrastive learning (BCL). Abbreviations: BCL = Biometric Contrastive Learning; CNN = convolutional neural network; ECG = electrocardiogram; MLP = multi-layer perceptron.

Positive and negative views

To develop the BCL model, we used ECG images in the four formats described above. Any two ECGs from the same person, in any two formats of image, were treated as a positive pair. Any pair of images from different individuals was treated as a negative pair.

Alternative approaches

In addition to BCL, we tested two traditional approaches for model pretraining. The first of these was initialization with ImageNet (224 pixels) pretrained EfficientNet B3 model weights. The second was the classic SimCLR14 contrastive pretraining method that has been popular in image processing, such as those to identify objects in real-world photos. The latter approach, developed by Google, uses cropped and flipped parts of an image to identify the parts derived from the same image. The parameters used for SimCLR are included in the supplement (Table S1).

BCL model pretraining

We used the EfficientNet-B3 CNN architecture for our encoder, following the demonstrated effectiveness of the model on other ECG image classification tasks.11,12 This model architecture requires images to be sampled at 300 × 300 square pixels includes 384 layers and has approximately 12 million trainable parameters (Figure S2). ECG images were converted to greyscale before being input into the model. Of note, the method is not restricted to this CNN architecture.

Our goal during pretraining was to minimize the contrastive loss function, which depended on the cosine similarity of embeddings of positive pairs in each batch compared to embeddings of negative pairs in the same batch. We used batch sizes of 16, with each ECG in the batch having one positive pair and 14 negative pairs, and accumulated gradients over 16 batches before updating the model.

Our encoder had an output dimension of 1536. We used a 2-layer multilayer perceptron (MLP) in pretraining, which projected the 1536-dimensional output of the encoder into a 128-dimensional space. We used an Adam optimizer with a learning rate of 1 × 10−5 for pretraining and trained the model for 10 epochs.

Dataset for BCL and SimCLR pretraining

We pretrained our model using ECGs acquired between 2000 and 2015 without labels from the Yale New Haven Health System. We excluded patients with ECGs after 2015 to ensure that there was no data leakage across pretraining and model development. For each patient, the pair of ECGs with the smallest time difference between 5 and 1000 days apart was chosen, and the other ECGs were not used. We set the minimum time difference for pairs to be 5 days to encourage the model to learn biometric features specific to an individual that might be present across health states, rather than the comparatively simpler task of identifying ECGs taken in a short window. We chose 1000 days as the upper limit for our pairs to create a reasonably sized cohort for pretraining. For each pair of ECGs, six unique pairs were created, corresponding to potential combinations where ECGs were in two different out of the four total image formats. During pretraining, we ensured that per batch there were no more than one pair of ECGs from any given patient.

Downstream classification task training

We performed downstream fine-tuning and evaluation of our pretrained encoder on three clinical tasks: AF, gender, and LVSD classification. They were chosen as they have different biological bases, and each has large prior work demonstrating their detection using ECGs.

AF is a rhythm disorder that is characterized by missing P waves and irregular heartbeats. AF is diagnosable by clinicians from ECGs upon manual inspection, and our previous work has demonstrated its diagnosis by AI algorithms for 12 lead signal and image data.1,2,11 Cardiologist-confirmed diagnosis statements accompanying all ECGs in the development cohort were searched for strings referencing AF and its abbreviations to identify ECGs for the task.

In addition to AF, we chose two clinical labels on ECGs that cannot be inferred by even experts on ECGs. First, is the gender of the patient. Gender is a hidden label that is not discernable on ECGs, but previous work has suggested that it can be detected on ECGs using AI methods.1,11 The second is LVSD, which is defined as LVEF < 40% and is associated with an over 8-fold increased risk of subsequent heart failure and a 2-fold risk of premature death.22 It is also a hidden label that has traditionally been diagnosed using echocardiography. Recent work has shown that LVSD can be diagnosed using 12-lead ECGs.3,12,23 ECGs in the development cohort within 15 days of a TTE were used for this task, and LVEF values from cardiologists read on the nearest TTE to each ECG were used.

We used the pretrained encoders described above and two randomly initialized fully connected layers to predict the labels of interest. We trained our classification models on progressively larger samples of training data, as 0.1%, 0.5%, 1%, 5%, 20%, 50%, and 100% labeled fractions of training datasets to test the effectiveness of various initialization methods. Additionally, to test the few-shot capability of the model, we trained classification models on datasets containing only 5 randomly selected positive and 5 negative samples (10 collectively) for each condition. We trained with a learning rate of 1 × 10−4 for 40 epochs. We used an Adam optimizer, gradient clipping, and a minibatch size of 64 throughout training, and we stopped training when validation loss did not improve in three consecutive epochs.

External validation

We pursued external validation to assess the generalizability of these pretraining techniques to external data sources. In addition to the held-out test set, AF and gender models were validated in the Germany-based dataset PTB-XL, whose data have previously been described.24 The dataset has 21 837 recordings from 18 885 patients, which were collected between 1989 and 1996 on Schiller AG devices. LVSD models were validated in a deidentified dataset of inpatient admissions at Lake Regional Hospital (LRH) in Osage Beach, Missouri, which has also been previously described.12 Briefly, the dataset contains 100 ECGs from unique patients, with 43 from patients with LVEF < 40% as measured by a TTE within 15 days of the ECG. The ECG images in this sample had a similar layout as the standard ECG format in the train set but had lead II rather than lead I as the rhythm strip. The images were obtained through ECG images captured from the electronic health records of individuals. There were unique noise real-world artifacts present in these images, too, including a different background color, the layout of the grid over which the waveform data are displayed, and the location and the font of the lead label.

Statistical analysis

Categorical variables were presented as frequency and percentages, and continuous variables as means and standard deviations or median and interquartile range, as appropriate. Model performance was evaluated in the held-out test set and external ECG image datasets. We used the area under the receiver operator characteristic (AUROC) and the area under the precision–recall curve (AUPRC) to measure model discrimination. 95% CIs for AUROC were calculated using DeLong’s algorithm.25 We compared AUROC and AUPRC metrics for BCL versus other initialization metrics across labels by computing the mean gain in each metric. This was the mean of a given metric for BCL across the three tasks, compared to the mean metrics for ImageNet initializations and SimCLR initializations across the three tasks. Analytic packages used in model development and statistical analysis are reported in Table S2. All model development and statistical analyses were performed using Python 3.9.5.

Results

Study population

ECGs from the Yale New Haven Health System were split into pretraining, development, and held-out test sets. Briefly, our pretraining dataset contained ECGs from 2000 to 2015, our development datasets for our AF, gender, and LVSD models used ECGs between 2015 and 2020 and our held-out test set used ECGs from January to June 2021. We constructed two development datasets, one that sampled all 382 830 ECGs that had a corresponding LVEF value from their nearest TTE within 15 days of recording to develop LVSD models. We then randomly sampled the same number, 382 830, from the total 1 869 582 ECGs recorded during this period to construct a development cohort for our AF and gender models. Similarly, we constructed equal-sized temporally distinct held-out test sets from ECGs in 2021, each containing 8708 ECGs, limited to one ECG per patient to ensure independence of observations in the assessment of performance metrics. Table 1 describes the patient characteristics in the pretraining, development, and test cohorts. ECGs in the development cohort for all three classification tasks were split into training and validation datasets at the patient level (90%, 10%), stratified by whether a patient had the condition of interest.

Table 1.

Baseline characteristics of study population. Data presented as median [IQR] for age and number (percent) for other variables.

Pretraining (2000-2015) Development (2015-2020)
Test (2021)
AF & Gender EF AF & Gender EF
Number of ECGs 156 576 382 830 382 830 8708 8708
Patients 78 288 207 432 104 665 8708 8708
Sex
Female 78 784 (50.3%) 189 265 (49.4%) 173 373 (45.3%) 4362 (50.1%) 4331 (49.7%)
Male 74 146 (47.4%) 193 565 (50.6%) 209 415 (54.7%) 4346 (49.9%) 4363 (50.1%)
Missing 3646 (2.4%) 0 (0.0%) 42 (0.0%) 0 (0.0%) 14 (0.2%)
Age (years) 60 [44 - 74] 64 [50 - 76] 68 [57 - 78] 57 (38 - 71) 65 [53 - 77]
Ethnicity, Hispanic 11 464 (7.3%) 32 838 (8.6%) 31 963 (8.3%) 726 (8.3%) 798 (9.2%)
Race a
Asian 1278 (0.8%) 4107 (1.1%) 4920 (1.3%) 123 (1.4%) 169 (1.9%)
Black 19 918 (12.7%) 47 472 (12.4%) 54 749 (14.3%) 633 (7.3%) 792 (9.1%)
White 75 914 (48.5%) 210 947 (55.1%) 246 506 (64.4%) 3872 (44.5%) 5399 (62.0%)
Other 1364 (0.9%) 3804 (1.0%) 4398 (1.1%) 128 (1.5%) 176 (2.0%)
Unknown 46 638 (29.8%) 83 662 (21.9%) 40 294 (10.5%) 3226 (27.0%) 1374 (15.8%)
ECG Abnormalities
AF 32 144 (8.4%) 432 (5.0%)
EF < 40% 57 998 (15.1%) 548 (6.3%)
a

White refers to non-Hispanic White.

Abbreviations: AF = atrial fibrillation. ECGs = electrocardiograms; EF = ejection fraction; Other refers to all other Races that are not explicitly tracked.

Performance in the held-out test sets

When trained on 100% of the data available, the three strategies (BCL, ImageNet, simCLR) were comparable for the 3 tasks of identifying gender, AF, and LVSD, on both AUROC and AUPRC, the performance metrics we used to evaluate models. However, across the three tasks, BCL demonstrated equivalent performance for both AUROC (Figure 2 and Table 2) and AUPRC (Figure 3 and Table S3) at 50% of the data available as it did with 100% of the data, even though the other models initialized with ImageNet weights and using a SimCLR approach suffered drop-offs in performance.

Figure 2.

Figure 2.

AUROC curves in held-out test sets. A) Ejection Fraction < 40% B) Atrial Fibrillation C) Gender. Abbreviations: AUROC = area under receiver-operating characteristic curve; BCL = Biometric Contrastive Learning.

Table 2.

AUROCs of models with different initializations in the held-out test sets.

Initialization Strategy
Label % Data BCL ImageNet SimCLR
EF < 40% FSL 0.608 (0.583-0.633) 0.539 (0.514-0.565) 0.457 (0.433-0.482)
0.1 0.753 (0.731-0.775) 0.603 (0.579-0.628) 0.49 (0.464-0.516)
0.5 0.805 (0.785-0.825) 0.737 (0.716-0.759) 0.709 (0.687-0.732)
1 0.835 (0.818-0.852) 0.793 (0.774-0.812) 0.738 (0.716-0.761)
5 0.88 (0.866-0.894) 0.845 (0.829-0.861) 0.826 (0.81-0.843)
20 0.892 (0.879-0.905) 0.875 (0.861-0.889) 0.842 (0.826-0.858)
50 0.904 (0.892-0.916) 0.898 (0.886-0.91) 0.88 (0.866-0.893)
100 0.903 (0.891-0.915) 0.901 (0.889-0.914) 0.897 (0.885-0.91)
AF FSL 0.82 (0.799-0.84) 0.495 (0.466-0.524) 0.386 (0.359-0.412)
0.1 0.902 (0.887-0.918) 0.513 (0.485-0.541) 0.614 (0.588-0.64)
0.5 0.952 (0.942-0.961) 0.841 (0.82-0.861) 0.664 (0.639-0.69)
1 0.965 (0.956-0.973) 0.933 (0.921-0.945) 0.897 (0.886-0.907)
5 0.977 (0.97-0.985) 0.978 (0.972-0.985) 0.964 (0.957-0.971)
20 0.986 (0.98-0.993) 0.985 (0.979-0.991) 0.98 (0.972-0.988)
50 0.987 (0.98-0.994) 0.99 (0.985-0.995) 0.983 (0.975-0.99)
100 0.984 (0.976-0.992) 0.991 (0.987-0.995) 0.991 (0.986-0.995)
Gender FSL 0.634 (0.622-0.645) 0.51 (0.498-0.522) 0.505 (0.493-0.517)
0.1 0.794 (0.785-0.803) 0.519 (0.507-0.531) 0.533 (0.521-0.545)
0.5 0.844 (0.836-0.852) 0.737 (0.727-0.748) 0.537 (0.525-0.549)
1 0.853 (0.845-0.86) 0.758 (0.748-0.768) 0.763 (0.753-0.773)
5 0.877 (0.87-0.884) 0.822 (0.814-0.831) 0.797 (0.788-0.806)
20 0.891 (0.884-0.897) 0.87 (0.863-0.877) 0.851 (0.843-0.859)
50 0.9 (0.894-0.906) 0.885 (0.878-0.891) 0.856 (0.848-0.863)
100 0.907 (0.901-0.913) 0.905 (0.899-0.911) 0.899 (0.892-0.905)

Abbreviations: AF = atrial fibrillation; BCL = Biometric Contrastive Learning; EF = ejection fraction; FSL = few shot learning.

Figure 3.

Figure 3.

AUPRC curves in held-out test sets. A) Ejection Fraction < 40% B) Atrial Fibrillation C) Gender. Abbreviations: AUPRC = area under precision-recall curve; BCL = Biometric Contrastive Learning.

As the quantity of data available progressively decreased, models trained using all three strategies saw lower performance with smaller size of the labeled training data, but BCL consistently outperformed other methods, with the difference in performance between BCL and other methods growing as labeled data became scarcer (Figures 2 and 3 and Table 2 and Table S3). On models trained with 1% of the available data, AUROC for models trained with BCL on the tasks of detecting LVSD, AF, and gender was 0.84, 0.96, and 0.85, respectively, while AUROC for ImageNet initialized models was 0.79, 0.93, and 0.76, respectively. AUROC for models trained with SimCLR was 0.74, 0.90, and 0.76, respectively. This corresponded to a mean gain in AUROC of 0.07 (8.7%) between BCL and the other two methods across tested applications (Figure 4A).

Figure 4.

Figure 4.

A) AUROC and B) AUPRC gains in held-out test sets. Abbreviations: AF = atrial fibrillation; AUROC= area under receiver-operating characteristic curve; AUPRC = area under precision-recall curve; BCL = Biometric Contrastive Learning; EF = ejection fraction.

On models trained with 1% of the data available, AUPRC for models trained with BCL on the tasks of detecting LVSD, AF, and gender was 0.31, 0.66, and 0.85, respectively, while AUPRC for ImageNet initialized models was 0.23, 0.55, and 0.75, respectively, and AUPRC for models trained with SimCLR was 0.19, 0.26, and 0.76, respectively. This corresponded to a mean gain in AUPRC of 0.15 (32.8%) between BCL and the other two methods across tested applications (Figure 4B).

On models trained with 0.1% of the data available, AUROC for models trained with BCL on the tasks of detecting LVSD, AF, and gender was 0.75, 0.90, and 0.79 respectively, and AUPRC was 0.19, 0.48, and 0.79. AUROC for ImageNet initialized models was 0.60, 0.51, and 0.52 respectively, and AUPRC was 0.09, 0.05, and 0.52. Finally, AUROC for models trained with SimCLR was 0.49, 0.61, and 0.53 respectively, and AUPRC was 0.07, 0.07, and 0.53. This corresponded to a mean gain in AUROC of 0.27 (49.7%) and AUPRC of 0.27 (119.8%) between BCL and the other two methods across tested applications (Figure 4). At a similar amount of training data to our 0.1% cohort, other previously published self-supervised learning methods for ECG signals resulted in a mean gain in AUROC of 0.06 (9.1%) for PCLR (mean 0.69 to mean 0.76) and 0.13 (19%) for 3 kg (mean 0.69 to mean 0.83) over random initialization across tested applications.18,19

Finally, in a few-shot context, with only five positive and five negative samples available for training, only models initialized with BCL were able to discriminate between the two classes across all three conditions, with AUROCs of 0.61, 0.82, and 0.63 for LVSD, AF, and gender, respectively, compared with AUROCs in 0.39-0.54 range for both other models (Table 2). Models initialized with BCL performed consistently across demographic subgroups in the held-out test set (Tables S5 and S6).

Performance in external validation datasets

The patterns observed in the held-out test set at small fractions of the training dataset available were also observed in the validation datasets. On models trained on 1% of the data available, AUROC for models trained with BCL on the tasks of detecting LVSD, AF, and gender in the two external validation sets was 0.81, 0.97, and 0.80, respectively. AUROC for ImageNet initialized models was 0.70, 0.87, and 0.71, respectively, and AUROC for models trained with SimCLR was 0.66, 0.82, and 0.71, respectively (Figure 5 and Table 3). This corresponded to a mean gain in AUROC of 0.11 (15.3%) and AUPRC of 0.23 (42.8%) across applications between BCL and the other methods (Figure 6). There were similar patterns in AUROC and AUPRC when using other fractions of training data less than the entire dataset (Figure 5 and Figure S3 and Table 3 and Table S4).

Figure 5.

Figure 5.

AUROC curves in external validation sets. A) Ejection Fraction < 40% B) Atrial Fibrillation C) Gender. Abbreviations: AUROC = area under receiver-operating characteristic curve; BCL = biometric contrastive learning.

Table 3.

AUROCs of models with different initializations in the external validation sets.

Initialization Strategy
Label % Data BCL ImageNet SimCLR
EF < 40% FSL 0.638 (0.528-0.749) 0.489 (0.377-0.6) 0.488 (0.371-0.605)
0.1 0.732 (0.635-0.829) 0.515 (0.398-0.633) 0.537 (0.423-0.65)
0.5 0.81 (0.726-0.894) 0.658 (0.552-0.764) 0.573 (0.461-0.685)
1 0.807 (0.723-0.89) 0.696 (0.59-0.801) 0.655 (0.549-0.762)
5 0.851 (0.776-0.926) 0.798 (0.712-0.885) 0.696 (0.594-0.798)
20 0.878 (0.811-0.944) 0.771 (0.678-0.864) 0.753 (0.66-0.846)
50 0.886 (0.822-0.95) 0.841 (0.764-0.919) 0.789 (0.702-0.875)
100 0.883 (0.817-0.949) 0.83 (0.75-0.911) 0.83 (0.75-0.91)
AF FSL 0.826 (0.815-0.836) 0.514 (0.498-0.53) 0.332 (0.318-0.347)
0.1 0.895 (0.886-0.904) 0.487 (0.471-0.503) 0.532 (0.516-0.548)
0.5 0.96 (0.955-0.965) 0.75 (0.736-0.763) 0.686 (0.672-0.701)
1 0.971 (0.967-0.975) 0.871 (0.861-0.881) 0.825 (0.817-0.833)
5 0.981 (0.977-0.984) 0.955 (0.949-0.96) 0.948 (0.944-0.952)
20 0.983 (0.98-0.987) 0.981 (0.977-0.985) 0.982 (0.978-0.985)
50 0.985 (0.981-0.988) 0.985 (0.981-0.989) 0.982 (0.979-0.986)
100 0.982 (0.978-0.986) 0.986 (0.983-0.989) 0.987 (0.984-0.99)
Gender FSL 0.577 (0.569-0.584) 0.554 (0.546-0.561) 0.451 (0.444-0.459)
0.1 0.748 (0.741-0.754) 0.56 (0.553-0.568) 0.549 (0.541-0.557)
0.5 0.803 (0.797-0.808) 0.684 (0.677-0.691) 0.572 (0.564-0.58)
1 0.803 (0.797-0.808) 0.715 (0.708-0.721) 0.713 (0.706-0.719)
5 0.844 (0.839-0.849) 0.762 (0.755-0.768) 0.7 (0.693-0.707)
20 0.87 (0.865-0.874) 0.767 (0.76-0.773) 0.775 (0.769-0.781)
50 0.881 (0.876-0.885) 0.803 (0.797-0.808) 0.784 (0.778-0.79)
100 0.877 (0.872-0.881) 0.835 (0.83-0.84) 0.838 (0.833-0.843)

Abbreviations: AF = atrial fibrillation; BCL = Biometric Contrastive Learning EF = ejection fraction; FSL = few shot learning.

Figure 6.

Figure 6.

A) AUROC and B) AUPRC gains in external validation sets. Abbreviations: AF = atrial fibrillation; AUROC = area under receiver-operating characteristic curve; AUPRC = area under precision-recall curve; BCL = biometric contrastive learning; EF = ejection fraction.

On models trained with 100% of data available, BCL performed better in the external validation sets for the two hidden label tasks of LVSD and gender. AUROCs for models trained with BCL were 0.88 and 0.88 for LVSD and gender, respectively, with ImageNet initializations being 0.83 and 0.83 respectively, and with SimCLR being 0.83 and 0.84, respectively (Figure 5 and Table 3). AUPRCs for models trained with BCL were 0.88 and 0.89, respectively, with ImageNet initializations being 0.75 and 0.85, respectively, and with SimCLR being 0.80 and 0.86 respectively (Figure S3). The initialization techniques demonstrated comparable performances for AF classification.

Discussion

Our novel biometric pretraining framework enables label-efficient learning on ECG images. The method focuses on building models to identify shared features of ECGs drawn from the same person at different times and plotted in different layouts and augmentations. This allows us to use unlabeled data to achieve large gains in the model development process for applications with sparsely labeled datasets, such as rare clinical disorders. While the model is not explicitly trained for any clinical diagnosis identification task during the pretraining process, it learns deeper representations from ECGs across different layouts. Using this method, a format-independent model for ECG images can be trained with just a few positive and negative examples.

Biometric contrastive learning (BCL) consistently outperforms two other commonly used initialization methods—ImageNet initialization and pretraining using SimCLR, the standard contrastive pretraining approach for image-based models. We compared the approaches on three tasks with differing biological bases—AF, gender, and LVSD. Evaluation across these three distinct tasks spanning clinical and hidden disorders demonstrated the broad relevance of this pretraining strategy for many discrete classification tasks. The fact that BCL is generalized as the best method across all tasks indicates that it can be used across disease domains in the development of ECG models. BCL outperformed other methods by a larger margin as the quantity of data available decreased. Additionally, BCL performed better for hidden label tasks in our external validation datasets even when trained on 100% of the data. This suggests that the method is effective for both rare disorders when only a few positive examples are available for model development and may learn patterns that generalize better to new data sources even when trained with larger databases.

For ImageNet initializations, performances dropped across the three tasks as the amount of available training data became smaller, with drops in performance becoming larger as the scarcity of data increased. While BCL minimized this drop, SimCLR did not provide a performance boost over supervised pretraining on ImageNet, a large general-purpose dataset, for the three tasks studied, as has previously been reported in other medical image classification tasks.15,26 SimCLR has effectively been applied for medical image tasks that are done by humans and are resilient to transformations like rotations, flips, and contortions of the image. ECG classification by humans relies on patterns in the plotted signals that are not as resilient to image contortions and may require a learned representation of patterns across several leads and time points in the image. It is possible that the representations of ECG images learned through the distortions in SimCLR are not relevant to the features of the image that contain information about both clinical labels diagnosable by physicians and hidden labels. On the other hand, it appears that the signatures stored in the BCL model, which learns which features of the image are unique to any given individual are indeed relevant to features useful for such classification tasks. When compared to self-supervised frameworks developed for ECG signal data, BCL appeared to offer a greater gain in AUROC compared to baseline initializations at a similar development cohort size. More work is needed to explicitly compare signal-based methods on the same variety of tasks that BCL was tested on.

Our study has certain limitations that merit consideration. First, while the results suggest that BCL is most effective when training models for rare disorders, we do not compare pretraining strategies for a rare disorder, instead choosing to train models on fractions of labeled datasets with more common labels. This was essential to identify the threshold for training data. While our work indicates that BCL would be effective when used on rarer disorders, further investigation on both other such rare disorders and other disorders not explicitly tested in this analysis is necessary. Second, our models were built on a single CNN architecture, without an explicit evaluation against alternative models and architectures, and whether the pretraining strategy is more suitable for some models vs. others. While the methods presented here should be generalizable to any encoder, future studies could evaluate these other architectures. Third, our models were developed in a single institution. However, we have a large, diverse population, and the models explicitly validated at other institutions and data sources. The ECG datasets are similar across sites, and therefore, our findings would suggest that the pretraining procedures would generalize to other sites developing models for ECG images. Fourth, while SimCLR is the most popular self-supervised contrastive method, we did not perform extensive hyperparameter optimization or try other methods such as MoCo or SiamSiam. Further studies could incorporate methods from these techniques to further improve upon BCL. Fifth, while we attempted to provide comparisons between different models using gains between ROC and PR curves and by providing confidence intervals for AUROC values using DeLong’s method and AUPRC values using bootstrapping, practical implementation of these models in clinical workflows occurs at specific thresholds, at which the magnitude of model performance differences may vary. Nevertheless, the comparison methods we employed are informative regarding whether the performance curves for a pair of models are significantly different and informative across the spectrum of potential thresholds. Sixth, development and validation of our models was carried out in patients who had already undergone ECGs and/or echocardiography (in the case of patients analyzed for LVSD). Curated retrospective repositories containing these modalities of data are still needed to train algorithms, although BCL reduces the burden needed to collect large amounts of labeled data. Furthermore, these cohorts may not be the same as the intended populations for model deployment. While the generalizability of the techniques presented here is promising, future work will focus on deploying them in more diverse scenarios, including prospective screening settings.

Conclusion

We developed a novel pretraining strategy that leverages the biometric signature of different ECGs from the same patient to significantly enhance data efficiency in developing AI-ECG models for ECG images, across several discrete tasks. This approach broadens the applications of image AI-ECG to rare disorders, for which training data is often limited, representing a significant advance in format-independent deep learning for the detection of heart disease directly from ECG images.

Supplementary Material

ocae002_Supplementary_Data

Contributor Information

Veer Sangha, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale University, New Haven, CT, 06510, United States; Department of Engineering Science, Oxford University, Oxford, OX1 3PJ, United Kingdom.

Akshay Khunte, Department of Computer Science, Yale University, New Haven, CT, 06511, United States.

Gregory Holste, Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, 78712, United States.

Bobak J Mortazavi, Department of Computer Science & Engineering, Texas A&M University, College Station, TX, 77843, United States; Center for Outcomes Research and Evaluation (CORE), Yale New Haven Hospital, New Haven, CT, 06510, United States.

Zhangyang Wang, Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, 78712, United States.

Evangelos K Oikonomou, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale University, New Haven, CT, 06510, United States.

Rohan Khera, Section of Cardiovascular Medicine, Department of Internal Medicine, Yale University, New Haven, CT, 06510, United States; Center for Outcomes Research and Evaluation (CORE), Yale New Haven Hospital, New Haven, CT, 06510, United States; Section of Health Informatics, Department of Biostatistics, Yale School of Public Health, New Haven, CT, 06510, United States.

Author contributions

R.K. conceived the study and accessed the data. V.S. developed the models. V.S. pursued the statistical analysis. V.S. drafted the manuscript. All authors provided feedback regarding the study design and made critical contributions to the writing of the manuscript. R.K. supervised the study, procured funding, and was the guarantor.

Supplementary material

Supplementary material is available at Journal of the American Medical Informatics Association online.

Funding

This study was supported by research funding awarded to Dr Khera by the Yale School of Medicine and grant support from the National Heart, Lung, and Blood Institute of the National Institutes of Health under the award K23HL153775. The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Conflict of interests

Mr Sangha and Dr Khera are the coinventors of U.S. Provisional Patent Application No. 63/346 610, “Articles and methods for format-independent detection of hidden cardiovascular disease from printed electrocardiographic images using deep learning”. They are also cofounders of Ensight-AI. Dr Mortazavi reported receiving grants from the National Institute of Biomedical Imaging and Bioengineering, the National Heart, Lung, and Blood Institute, the US Food and Drug Administration, and the US Department of Defense Advanced Research Projects Agency outside the submitted work; in addition, Dr Mortazavi has a pending patent on predictive models using electronic health records (US20180315507A1). Dr Oikonomou receives support from the National Heart, Lung, and Blood Institute of the National Institutes of Health (under award 1F32HL170592). He is an academic co-founder of Evidence2Health LLC and an ad hoc consultant for Caristo Diagnostics Ltd (all outside the submitted work). Dr Khera receives support from the National Heart, Lung, and Blood Institute of the National Institutes of Health (under award K23HL153775) and the Doris Duke Charitable Foundation (under award 2022060). He receives support from the Blavatnik Foundation through the Blavatnik Fund for Innovation at Yale. He also receives research support, through Yale, from Bristol-Myers Squibb, Novo Nordisk, and BridgeBio. He is an Associate Editor at JAMA. In addition to 63/346 610, Dr Khera is a coinventor of U.S. Provisional Patent Applications 63/177 117, 63/428 569, and 63/484 426. He is also a founder of Evidence2Health LLC, a precision health platform to improve evidence-based cardiovascular care.

Data availability

The data from the Yale New Haven Hospital and Lake Regional Hospital cannot be shared publicly as IRB stipulations do not permit data sharing. PTB-XL ECGs are available online at https://physionet.org/content/ptb-xl/1.0.3/.

References

  • 1. Ribeiro AH, Ribeiro MH, Paixão GMM, et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun. 2020;11(1):1760. 10.1038/s41467-020-15432-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Hannun AY, Rajpurkar P, Haghpanahi M, et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med. 2019;25(1):65-69. 10.1038/s41591-018-0268-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Attia ZI, Kapa S, Lopez-Jimenez F, et al. Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram. Nat Med. 2019;25(1):70-74. 10.1038/s41591-018-0240-2. [DOI] [PubMed] [Google Scholar]
  • 4. Ulloa-Cerna AE, Jing L, Pfeifer JM, et al. rECHOmmend: an ECG-based machine learning approach for identifying patients at increased risk of undiagnosed structural heart disease detectable by echocardiography. Circulation. 2022;146(1):36-47. 10.1161/CIRCULATIONAHA.121.057869. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Kwon J-M, Lee SY, Jeon K-H, et al. Deep learning-based algorithm for detecting aortic stenosis using electrocardiography. J Am Heart Assoc. 2020;9(7):e014717. 10.1161/JAHA.119.014717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Siontis KC, Noseworthy PA, Attia ZI, Friedman PA.. Artificial intelligence-enhanced electrocardiography in cardiovascular disease management. Nat Rev Cardiol. 2021;18(7):465-478. 10.1038/s41569-020-00503-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Cuevas-González D, García-Vázquez JP, Bravo-Zanoguera M, et al. ECG standards and formats for interoperability between mHealth and healthcare information systems: a scoping review. Int J Environ Res Public Health. 2022;19(19):11941. 10.3390/ijerph191911941. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Bond RR, Finlay DD, Nugent CD, Moore G.. A review of ECG storage formats. Int J Med Inform. 2011;80(10):681-697. 10.1016/j.ijmedinf.2011.06.008 [DOI] [PubMed] [Google Scholar]
  • 9. Lyon A, Minchole A, Martinez JP, Laguna P, Rodriguez B.. Computational techniques for ECG analysis and interpretation in light of their contribution to medical advances. 14 J R Soc Interface. 2018;15(138). 10.1098/rsif.2017.0821. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Chung CT, Lee S, King E, et al. Clinical significance, challenges and limitations in using artificial intelligence for electrocardiography-based diagnosis. Int J Arrhythmia. 2022;23(1):24. 10.1186/s42444-022-00075-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Sangha V, Mortazavi BJ, Haimovich AD, et al. Automated multilabel diagnosis on electrocardiographic images and signals. Nat Commun. 2022;13(1):1583. 10.1038/s41467-022-29153-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Sangha V, Nargesi AA, Dhingra LS, et al. Detection of left ventricular systolic dysfunction from electrocardiographic images. Circulation. 2023;148(9):765-777. 10.1161/CIRCULATIONAHA.122.062646 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Leiner T, Bennink E, Mol CP, Kuijf HJ, Veldhuis WB.. Bringing AI to the clinic: blueprint for a vendor-neutral AI deployment infrastructure. Insights Imaging. 2021;12(1):11. 10.1186/s13244-020-00931-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Chen T, Kornblith S, Norouzi M, Hinton G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709, 2020. https://ui.adsabs.harvard.edu/abs/2020arXiv200205709C, preprint: not peer reviewed.
  • 15. Azizi S, et al. Big self-supervised models advance medical image classification. arXiv, arXiv:2101.05224, 2021. https://ui.adsabs.harvard.edu/abs/2021arXiv210105224A, preprint: not peer reviewed.
  • 16. Ciga O, Xu T, Martel AL. Self supervised contrastive learning for digital histopathology. arXiv, arXiv:2011.13971, 2020. https://ui.adsabs.harvard.edu/abs/2020arXiv201113971C, preprint: not peer reviewed.
  • 17. Kiyasseh D, Zhu T, Clifton DA. CLOCS: contrastive learning of cardiac signals across space, time, and patients. arXiv, arXiv:2005.13249, 2020. https://ui.adsabs.harvard.edu/abs/2020arXiv200513249K, preprint: not peer reviewed.
  • 18. Gopal B, et al. 3KG: contrastive learning of 12-lead electrocardiograms using physiologically-inspired augmentations. arXiv, arXiv:2106.04452. 2021. https://ui.adsabs.harvard.edu/abs/2021arXiv210604452G, preprint: not peer reviewed.
  • 19. Diamant N, Reinertsen E, Song S, et al. Patient contrastive learning: a performant, expressive, and practical approach to electrocardiogram modeling. PLoS Comput Biol. 2022;18(2):e1009862. 10.1371/journal.pcbi.1009862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Hernandez-Boussard T, Bozkurt S, Ioannidis JPA, Shah NH.. MINIMAR (MINimum Information for Medical AI Reporting): developing reporting standards for artificial intelligence in health care. J Am Med Inform Assoc. 2020;27(12):2011-2015. 10.1093/jamia/ocaa088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Serhani MA, H, T EK, Ismail H, Nujum Navaz A.. ECG monitoring systems: review, architecture, processes, and key challenges. 7-8. Sensors (Basel). 2020;20(6). 10.3390/s20061796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Wang TJ, Evans JC, Benjamin EJ, et al. Natural history of asymptomatic left ventricular systolic dysfunction in the community. Circulation. 2003;108(8):977-982. 10.1161/01.CIR.0000085166.44904.79. [DOI] [PubMed] [Google Scholar]
  • 23. Khunte A, Sangha V, Oikonomou EK, et al. Detection of left ventricular systolic dysfunction from single-lead electrocardiography adapted for portable and wearable devices. NPJ Digit Med. 2023;6(1):124. 10.1038/s41746-023-00869-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Wagner P, Strodthoff N, Bousseljot R-D, et al. PTB-XL, a large publicly available electrocardiography dataset. Sci Data. 2020;7(1):154. 10.1038/s41597-020-0495-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. DeLong ER, DeLong DM, Clarke-Pearson DL.. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837-845. [PubMed] [Google Scholar]
  • 26. Azizi S, Culp L, Freyberg J, et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nat Biomed Eng. 2023;7(6):756-779. 10.1038/s41551-023-01049-7. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ocae002_Supplementary_Data

Data Availability Statement

The data from the Yale New Haven Hospital and Lake Regional Hospital cannot be shared publicly as IRB stipulations do not permit data sharing. PTB-XL ECGs are available online at https://physionet.org/content/ptb-xl/1.0.3/.


Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES