Skip to main content
European Heart Journal. Digital Health logoLink to European Heart Journal. Digital Health
. 2025 May 21;6(4):619–623. doi: 10.1093/ehjdh/ztaf051

A deep foundation model for electrocardiogram interpretation: enabling rare disease detection through transfer learning

Stephanie M Hu 1, Joshua P Barrios 2,3,2, Geoffrey H Tison 4,5,6,✉,3,2
PMCID: PMC12282392  PMID: 40703125

Abstract

In healthcare, scarcity of high-quality human-adjudicated labelled data may limit the potential of deep neural networks (DNNs). Foundation models provide an efficient starting point for deep learning that can facilitate effective DNN training with fewer labelled training examples. In this study, we leveraged cardiologist-confirmed labels from a large dataset of 1.6 million electrocardiograms (ECGs) acquired as part of routine clinical care at UCSF between 1986 and 2019 to pre-train a convolutional DNN to predict 68 common ECG diagnoses. To our knowledge, this model is one of the most comprehensive ECG DNN models to date, demonstrating high performance with a median area under the receiver operating curve (AUC) of 0.978, median sensitivity of 0.937, and median specificity of 0.923. We then demonstrate the model’s utility as a foundation model by additionally training (fine-tuning) the DNN to detect three novel ECG diagnoses with relatively small datasets: carcinoid syndrome, pericardial constriction, and rheumatic doming of the mitral valve. Fine-tuning training of the foundation model achieved an AUC of 0.772 (95% CI 0.723–0.816) for carcinoid syndrome, 0.883 (0.863–0.906) for pericardial constriction, and 0.826 (95% CI 0.802–0.854) for rheumatic doming, compared to 0.492 (95% CI 0.434–0.558), 0.689 (95% CI 0.656–0.720), and 0.701 (95% CI 0.657–0.745), respectively, for DNNs trained from scratch on the same small datasets. Our results demonstrate that the ECG foundation model learned a flexible representation of ECG waveforms and can improve performance of fine-tuned downstream models, particularly in data-limited settings.

Keywords: Electrocardiogram, Artificial intelligence, Neural network, Transfer learning, Arrhythmias, Valvular disease, Pericardial disease

Graphical Abstract

Graphical Abstract.

Graphical Abstract

We first pre-trained an electrocardiogram (ECG) deep neural network (DNN) foundation model to detect 68 common ECG diagnoses. Next, using this pre-trained foundation model as a starting point, we further fine-tuned it for three rare-disease, novel ECG tasks, showing that the fine-tuned pre-trained model consistently outperformed the DNN trained from scratch using the same relatively small training datasets. Future work can use the pre-trained ECG DNN foundation model as a foundation to train models for various future tasks.

Introduction

Deep neural networks (DNNs) have been trained to automatically classify a variety of electrocardiogram (ECG) diagnoses, including rhythm abnormalities, left ventricular hypertrophy, and ischaemia.1–4 However, for less common diagnoses, a major challenge for training DNNs particularly in medicine is obtaining sufficient high-quality labelled data.5 Foundation models are pre-trained DNNs that can be used as starting points for efficient subsequent training (known as fine-tuning) on other tasks.6 A broadly-trained ECG foundation model could be useful to train models for rare medical conditions.

In this study, we leveraged a large dataset of 1.6 million ECGs with cardiologist-confirmed diagnostic labels to first pre-train a DNN foundation model that simultaneously classifies 68 diagnoses from raw ECG waveforms. We aimed for the DNN to learn a flexible representation of 12-lead ECG waveforms that would improve performance for downstream training tasks. To demonstrate its utility as a foundation model, we then fine-tuned this ECG foundation model to detect three relatively rare conditions that are typically diagnosed by echocardiography, not ECG, making them ‘novel’ ECG diagnosis tasks: carcinoid syndrome (CS), pericardial constriction (PC), and rheumatic doming of the mitral valve (RD). For each condition, we compared performance of the fine-tuned foundation models against DNNs trained from scratch using the same data.

Methods

Cohort and dataset creation

For this cross-sectional study, we obtained all ECGs collected as part of routine clinical care at UCSF between September 1986 and December 2019, yielding 1 616 852 ECGs from 408 363 patients aged ≥18. Electrocardiogram data underwent initial interpretation by the MUSE software (GE Healthcare); UCSF cardiologists provided the final ECG interpretation by changing or confirming the MUSE diagnosis. We identified the most common cardiologist-confirmed diagnostic labels using keyword-matching against standard MUSE terms. All diagnoses with a frequency of ≥200 examples were selected, resulting in 68 diagnosis labels. Through manual review, we captured common alternative phrasings for all diagnoses and handled negation using basic natural language processing.

Labels for CS, PC, and RD were obtained from echocardiogram studies adjudicated by level three board-certified echocardiographers at UCSF between 2012 and 2022. Electrocardiograms obtained within five years of the echo were used as case ECGs. For each condition, age/sex-matched control ECGs were chosen from the patients without the condition at a 4:1 control:case ratio.

Model development and validation

For the foundation model, we used a custom implementation of the ResNet architecture described previously.5 Data were split by patient in an 8:1:1 training:development:test ratio, and the development dataset was used for hyperparameter selection based on validation loss. The final hyperparameters were: learning rate 0.0001, number of convolutional blocks 15 (with two convolutional layers per block), dropout 0.2, patience 3, and factor 0.1. Each ECG was normalized using the mean and standard deviation across all leads within that ECG. To address class imbalance, the loss function was weighted by the ratio of negative to positive samples for each class. The DNN was trained for 50 epochs with randomly initialized weights, using the Adam optimizer and a batch size of 256; the best checkpoint was selected based on validation loss.

Evaluation metrics included area under the receiver operating curve (AUC), sensitivity, specificity, and F1 score, computed individually for each diagnostic class. Thresholds for each diagnostic class were selected to maximize sensitivity × specificity on the development set, and confidence intervals were calculated using bootstrapping7 over 100 trials.

Transfer learning

For each condition, data used for transfer learning were split by patient in an 8:1:1 training:development:test ratio. Patients in the training and development datasets for the foundation model were excluded from the development and test datasets of each downstream task. For fine-tuning, we initialized the foundational model with the pre-trained model weights and trained on each downstream task for 50 epochs separately, selecting the best checkpoint, to create one model per task; all DNN layers were kept trainable. For comparison for each disease, we trained a DNN with identical architecture ‘from scratch’ with randomly initialized weights. Hyperparameters (learning rate, patience, factor) were tuned for all models individually using the development dataset by searching across the following hyperparameters: learning rate: 1e−3, 1e−4, 1e−5, 1e−6; patience: 3, 5, 10; factor: 0.1, 0.5.

Results

Foundation model baseline performance

The cohort used to develop the foundational model was 47.4% female, and mean age was 57.9 ± 17.7 years. The average number of ECGs per patient was 3.96 ± 7.92. The mean number of diagnoses per ECG was 2.20 ± 1.35. Nineteen diagnoses had fewer than 4000 ECGs with that diagnosis.

The foundation model demonstrated a median AUC of 0.978 across all 68 diagnoses (25th percentile 0.961, 75th percentile 0.992) and a mean AUC of 0.968 ± 0.032 on the held-out test set. The model performed best on rhythm diagnoses (average frequency-weighted AUC 0.993), followed by axis deviation (AUC 0.982) and myocardial ischaemia/injury (AUC 0.970). There was no correlation between diagnosis frequency and AUC performance (r = 0.143, p = 0.243). The model achieved high performance for high-frequency diagnoses (e.g. sinus tachycardia, sinus bradycardia, and atrial fibrillation) as well as for low frequency diagnoses (e.g. ventricular tachycardia, Brugada syndrome, and trigeminy) and also diagnoses not commonly included in ECG artificial intelligence (AI) models (e.g. ventricular aneurysm, pericardial effusion, and subendocardial injury) (see Supplementary material online). Results are summarized across the eight diagnostic categories in Table 1.

Table 1.

Performance of the deep neural network electrocardiogram foundation model on the test set for 68 electrocardiogram labels summarized across eight electrocardiogram diagnostic categories

Diagnostic ECG category (# diagnoses in category) Positive examples AUC Sensitivity Specificity F1 PPV
Rhythm (20) 188 958 0.993 0.970 0.969 0.872 0.831
Conduction (14) 54 828 0.978 0.946 0.925 0.573 0.438
Axis deviation (3) 16 907 0.982 0.973 0.927 0.604 0.443
Chamber enlargement (6) 27 393 0.934 0.878 0.842 0.437 0.310
Myocardial infarction (7) 29 606 0.966 0.918 0.898 0.409 0.273
Myocardial ischaemia/injury (5) 25 433 0.970 0.941 0.896 0.453 0.304
Nonspecific changes (4) 8208 0.918 0.864 0.818 0.278 0.168
Other (9) 5649 0.963 0.904 0.913 0.232 0.139

Evaluation metrics are reported as the mean of that metric across the diagnoses within each class. Other class includes digitalis effect, hyperkalaemia, hypokalaemia, pericardial effusion, pericarditis, pulmonary embolism, left ventricular aneurysm, dextrocardia, and lead misplacement.

Transfer learning

To determine the effectiveness of our foundation model in downstream tasks, we fine-tuned the model with additional training for three transfer learning tasks in which labelled training data were limited and that are ‘novel’ ECG diagnosis tasks that cannot be performed by cardiologists by manual ECG interpretation: detection of CS, PC, and RD. There were 39 CS patients (354 ECGs), 105 PC patients (1992 ECGs), and 131 RD patients (1287 ECGs). Deep neural network classifiers trained via fine-tuning the DNN foundation model showed significantly improved performance over DNNs trained ‘from scratch’ with randomly initialized weights (Figure 1). The DNN foundation model fine-tuned to detect CS achieved a median AUC of 0.772 (95% CI 0.723–0.816), compared to 0.492 (95% CI 0.434–0.558) when trained from scratch. Similarly, the foundation model trained to detect PC achieved a median AUC of 0.883 (95% CI 0.863–0.904) and the DNN foundation model trained to detect RD achieved a median AUC of 0.826 (95% CI 0.802–0.854), compared to 0.689 (95% CI 0.656–0.720) and 0.701 (95% CI 0.657–0.745), respectively, when DNNs were trained from scratch.

Figure 1.

Figure 1

(Top) Receiver operating characteristic curves for individual deep neural networks trained either by fine-tuning the pre-trained deep neural network foundation model (purple line) or training a deep neural network from scratch (green line) for the three low data tasks of detecting carcinoid syndrome, pericardial constriction, and rheumatic doming of the mitral valve from electrocardiogram waveforms. (Bottom) Performance metrics for the fine-tuned deep neural network foundation model for carcinoid syndrome, pericardial constriction, and rheumatic doming of the mitral valve.

Discussion

In this study, we demonstrated that a single DNN (the foundation model) can itself be used to identify a broad array of 68 ECG diagnoses, performing similarly to or exceeding previously published models with fewer ECG diagnoses.1,2,4,8 By pre-training on this broad range of ECG diagnoses, our foundation model learned an information-rich representation of ECG features that enabled it to improve performance of downstream models for three novel, data-limited ECG tasks. For the three demonstrated transfer learning tasks of CS, PC, and RD, all of which are outside the domain of the pre-training tasks and all of which are novel ECG biomarker tasks, the foundation model provided statistically significant performance improvements compared to training DNNs from scratch with the same limited training data. These results demonstrate the ability of the ECG foundation model to improve overall performance and training data-efficiency of DNN training, which is particularly helpful for rare conditions.

While prior efforts have shown that pre-training DNNs for ECG prediction can improve performance when training datasets are small,5,9,10 they have been limited to a highly restricted number of diagnoses—most commonly within the diagnostic category of arrhythmias—and have only shown the benefit of pre-training by artificially restricting dataset size to mimic rare diagnoses.9,10 Our work is unique in training an ECG foundation model on a large set of 68 real-world ECG diagnoses and then demonstrating that fine-tuning this model on tasks outside the original domain of pre-training substantially increases performance for three separate ECG diagnoses that are both uncommon and not identifiable through expert ECG interpretation. For all three demonstration tasks, the fine-tuned foundation model achieved substantially higher performance than DNNs trained from scratch, demonstrating the ability of foundation models to augment ECG DNN training in real-world low-data scenarios where enhanced detection of rarely encountered conditions can augment physician sensitivity. We aim to release the weights for our pre-trained foundation model which we hope will help to advance the field and facilitate future ECG DNN transfer learning efforts, particularly for other less common ECG diagnoses.

Our study has several limitations. Though large and obtained over decades, our ECG dataset is derived from a single-centre raising the possibility of performance degradation if used in other centres. External validation studies are needed to verify that our foundation model performs similarly both for fine-tuning of other ECG diagnoses and with ECG data from other populations. We anticipate that public release of our foundation model will readily enable such validation. We acknowledge that the prevalence of our rare diseases used in training is much higher than what would be encountered in practice, resulting from the 4:1 control:case ratio used for training. Though the primary performance metrics of AUC, sensitivity, and specificity are not affected by disease prevalence in the test dataset, F1 score and positive and negative predictive values are affected by disease prevalence and this would need to be considered during real-world deployment.

Supplementary Material

ztaf051_Supplementary_Data

Contributor Information

Stephanie M Hu, Department of Medicine, Division of Cardiology, University of California, San Francisco, 555 Mission Bay Blvd South, San Francisco, CA 94158, USA.

Joshua P Barrios, Department of Medicine, Division of Cardiology, University of California, San Francisco, 555 Mission Bay Blvd South, San Francisco, CA 94158, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, 480 16th St, San Francisco, CA 94158, USA.

Geoffrey H Tison, Department of Medicine, Division of Cardiology, University of California, San Francisco, 555 Mission Bay Blvd South, San Francisco, CA 94158, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, 480 16th St, San Francisco, CA 94158, USA; Center for Biosignal Research, University of California, San Francisco, 555 Mission Bay Blvd South, San Francisco, CA 94158, USA.

Supplementary material

Supplementary material is available at European Heart Journal – Digital Health.

Funding

This work was partially funded by the National Institutes of Health through grants 1R56HL161475 and DP2HL174046; and the University of California Noyce Initiative.

Data availability

The data underlying this article cannot be shared publicly due to the data deriving from clinical care and for patient privacy. Reasonable requests for collaboration can be sent to the corresponding author.

Lead author biography

graphic file with name ztaf051il1.jpg

Stephanie M. Hu is a second-year medical student at UCSF with an interest in applying machine learning and other computational methods to improving the diagnosis and management of disease. She has published in various fields including surgical robotics and deep learning models for counterfactual prediction. Prior to medical school, she received her B.S. in Computer Science from MIT and worked as a software engineer in health tech and quantitative finance.

References

  • 1. Hannun  AY, Rajpurkar  P, Haghpanahi  M, Tison  GH, Bourn  C, Turakhia  MP, et al.  Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med  2019;25:65–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Ribeiro  AH, Ribeiro  MH, Paixão  GMM, Oliveira  DM, Gomes  PR, Canazart  JA, et al.  Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun  2020;11:1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Gustafsson  S, Gedon  D, Lampa  E, Ribeiro  AH, Holzmann  MJ, Schön  TB, et al.  Development and validation of deep learning ECG-based prediction of myocardial infarction in emergency department patients. Sci Rep  2022;12:19615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hughes  JW, Olgin  JE, Avram  R, Abreau  SA, Sittler  T, Radia  K, et al.  Performance of a convolutional neural network and explainability technique for 12-lead electrocardiogram interpretation. JAMA Cardiol  2021;6:1285–1295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Jang  JH, Kim  TY, Yoon  D. Effectiveness of transfer learning for deep learning-based electrocardiogram analysis. Healthc Inform Res  2021;27:19–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Pan  SJ, Yang  Q. A survey on transfer learning. IEEE Trans Knowl Data Eng  2010;22:1345–1359. [Google Scholar]
  • 7. Efron  B. Bootstrap methods: another look at the jackknife. Ann Stat  1979;7:1–26. [Google Scholar]
  • 8. Kashou  AH, Ko  WY, Attia  ZI, Cohen  MS, Friedman  PA, Noseworthy  PA. A comprehensive artificial intelligence-enabled electrocardiogram interpretation program. Cardiovasc Digit Health J  2020;1:62–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Gopal  B, Han  R, Raghupathi  G, Ng  A, Tison  G, Rajpurkar  P. 3KG: contrastive learning of 12-lead electrocardiograms using physiologically-inspired augmentations. Proc Mach Learn Res  2021;158:156–167. [Google Scholar]
  • 10. Weimann  K, Conrad  TOF. Transfer learning for ECG classification. Sci Rep  2021;11:5251. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ztaf051_Supplementary_Data

Data Availability Statement

The data underlying this article cannot be shared publicly due to the data deriving from clinical care and for patient privacy. Reasonable requests for collaboration can be sent to the corresponding author.


Articles from European Heart Journal. Digital Health are provided here courtesy of Oxford University Press on behalf of the European Society of Cardiology

RESOURCES