Abstract
Heart failure (HF) poses a significant public health challenge, with a rising global mortality rate. Early detection and prevention of HF could significantly reduce its impact. We introduce a novel methodology for predicting HF risk using 12-lead electrocardiograms (ECGs). We present a novel, lightweight dual attention ECG network designed to capture complex ECG features essential for early HF risk prediction, despite the notable imbalance between low and high-risk groups. This network incorporates a cross-lead attention module and 12 lead-specific temporal attention modules, focusing on cross-lead interactions and each lead’s local dynamics. To further alleviate model overfitting, we leverage a large language model (LLM) with a public ECG-Report dataset for pretraining on an ECG-Report alignment task. The network is then fine-tuned for HF risk prediction using two specific cohorts from the UK Biobank study, focusing on patients with hypertension (UKB-HYP) and those who have had a myocardial infarction (UKB-MI). The results reveal that LLM-informed pre-training substantially enhances HF risk prediction in these cohorts. The dual attention design not only improves interpretability but also predictive accuracy, outperforming existing competitive methods with C-index scores of 0.6349 for UKB-HYP and 0.5805 for UKB-MI. This demonstrates our method’s potential in advancing HF risk assessment with clinical complex ECG data.
Index Terms: Large language model, multi-modal learning, heart failure, risk prediction, interpretable artificial intelligence, electrocardiogram
I. Introduction
HEART failure (HF) is a complex cardiovascular syndrome, where the heart fails to pump sufficient blood to meet the body’s demands. Common causes of HF are cardiac structural and/or functional abnormalities, including heart attack, cardiomyopathy, and high blood pressure. HF is a chronic and progressive disease. In England, admissions due to HF have notably escalated, witnessing an increment from 65,025 in the year 2013/14 to 86,474 in 2018/19, representing a 33% surge, as reported by the British Heart Foundation [1]. It has been found that around 50% of deaths in HF patients presented with a sudden and unexpected pattern [2, 3], which leaves a tremendous burden on patients with HF, their families, and healthcare systems worldwide.
Preventing HF early is crucial to reduce its health and economic impacts, yet HF diagnosis often occurs late when patients have already developed serious symptoms [4, 5]. A promising strategy for improving HF management is the development of risk prediction models for future HF events. These models generate a risk score for patients over a specific timeframe, taking into consideration of the patient’s specific characteristics. With such a personalized risk assessment, more tailored HF management strategies and/or treatment recommendations can be provided. In this context, the low-cost 12-lead electrocardiogram (ECG), a medical test commonly used in clinical practice, serves as a valuable resource for evaluating a patient’s cardiovascular health and uncovering the risk. Recent studies have already found that several markers detected from clinically acquired 12-lead ECG are associated with future HF events, such as prolonged QRS duration [6–10] conduction disorders (left and right bundle-branch blocks) [11, 12], etc. A significant limitation of these previous research lies in its reliance on a limited set of biomarkers (e.g., QRS duration), which are identified through predetermined rules based on ECG data. Moreover, much of this research has employed simple linear models for modeling the risk ratio associated with HF. While linear models offer a straightforward and interpretable framework for risk assessment, they may fail to capture the intricate, complex subtleties and nuances embedded within the ECG for early-stage HF risk prediction.
In recent years, deep learning-based approaches with neural networks have shown great capacities to automatically extract features from raw data and utilize them to perform a wide range of tasks. In the field of ECG analysis, deep neural networks have shown competitive performance on a wide range of tasks, including disease classification, waveform prediction, rhythm detection, mortality prediction, and automated report generation, with higher accuracy compared to traditional approaches with handcrafted features [13–15]. Yet, two prevalent limitations associated with deep learning-based methods are their over-fitting issues as well as poor interpretability. Over-fitting occurs when a model directly memorizes the training data instead of learning task relevant features, reducing its ability to perform well on new, unseen data. The large number of parameters and layers in deep neural networks can further exacerbate the risk of overfitting, especially when the available training data is limited. Additionally, the complex architecture of these networks makes it difficult to trace and understand the decision-making process behind their predictions, leading to challenges in interpretability.
In this work, we aim to develop a deep learning-based HF risk prediction model with improved feature learning, higher data efficiency, and explainability. To this end, we design a novel, lightweight ECG dual attention network. This network is capable of capturing intricate cross-lead interactions and local temporal dynamics within each lead. The dual attention mechanism also enables the visualization of lead-wise attention maps and temporal activation across each lead for improved explainability of neural network behaviors. To further alleviate the over-fitting with improved data efficiency and more importantly, to learn clinically useful representations from ECG data for higher precision, we adopt a two-stage training scheme. We first employ a large language model (LLM) to pre-train the network using a large public ECG-Report dataset covering a wide spectrum of diverse diseases and then finetune the network for the HF risk prediction task on two specific cohorts with specific risk factors collected from the UK Biobank study. Specifically, we employ ClinicalBERT [16] to extract text embeddings from ECG report and force the extracted ECG features to be aligned with corresponding text features at the pretraining stage. We hypothesize that such an ECG-text feature alignment learning paradigm can better facilitate the deep neural network to capture clinically useful patterns, which provide a more holistic picture of patients and potentially yield more accurate risk predictions. To the best of our knowledge, this is the first work that applies LLM-informed pre-training to benefit the training of the downstream ECG-based HF risk prediction model.
To summarize, our contribution is two-fold:
-
1)
A novel deep neural network architecture to enhance both the representation learning of ECG features and the model’s interpretability: The proposed network not only yields a quantitative risk score but also offers qualitative, interpretable insights into the neural network’s reasoning through a dual attention mechanism. This unique feature acts as a transparent medium, enabling both clinicians and readers to observe and understand the intricate relationships between different ECG leads as well as the dynamic temporal patterns within each lead, highlighting the particular leads and time segments that hold the greatest importance for reliable risk prediction.
-
2)
An effective model pretraining strategy with LLM: We design an LLM-informed multi-modal pretraining task so that clinical knowledge can be transferred to the downstream risk prediction task. Training a deep risk prediction network is challenging due to a lack of sufficient event data. In this work, we advocate for the strategic use of large language models, coupled with structured ECG reports with confidence scores, to guide the pretraining process for the benefits of more data-efficient, accurate risk prediction models.
II. Related Work
A. Risk prediction
Risk prediction aims to estimate the chance h(t) that a patient will have a certain event in an infinitesimal time interval given that the event has not occurred before. A common approach is the Cox proportional hazards formulation (CoxPH) [17]. The CoxPH model is a statistical model for predicting the time-to-event outcomes based on the assumption that the hazard function h(t) is proportional to a set of features or variables associated with the subject being studied. Mathematically, it can be defined as: h(t|x) = h0(t) exp(gθ(x)), where h(t|x) models the hazard rate at which events occur at a time t taking the subject information x into account, h0(t) is the baseline hazard function shared by all observations, depending only on the time t, and exp(gθ(x)) is the risk score for observation x. Conventional CoxPH models assume that the input observation is a set of covariates (e.g., sex, age, medical history, smoking) and constrain the function g(·) to the linear form: gθ(x) = θ⊺x = θ1×x1+θ2×x2+· · ·+θp×xp with θ being the weights to the set of p covariates in x : x1, x2, …, xp [17]. In the existing body of literature pertaining to HF risk prediction, the predominant methodologies employ a predefined set of variables to construct a linear model, subsequently isolating variables that exhibit a high correlation for the purpose of risk prediction. This approach includes parameters such as prolonged QRS duration [6–10], various conduction disorders (specifically left and right bundle-branch blocks) [11, 12], elongated QT intervals [7], abnormalities in QRS/T angles and T wave patterns [7, 18–21], and ST-segment depression in the V5 precordial lead [7, 22].
More recently, deep learning approaches such as Deep-Surv [23] consider replacing the linear function g(·) with a neural network f, which is a deep architecture parameterized by the weights of the network θ. This relaxes the risk model to a non-linear form; also, the input can be high-dimensional without needing to be linearly independent, as the neural network is capable of extracting hierarchical features for risk score estimation in a non-linear fashion. Such an approach has been found to have superior performance against traditional approaches [23] on different risk prediction or survival prediction tasks. For example, researchers have successfully applied neural networks to automatically discover latent features from high-dimensional data, such as whole-slide pathology images [24, 25], 4D cardiac shape with motion information [26], etc., for risk/survival prediction. In this work, we adhere to a similar philosophy and concentrate on creating neural networks designed to autonomously extract features from intricate 12-lead ECG waves, bypassing the reliance on a predefined set of ECG parameters.
B. Large language model for healthcare
Large language models (LLMs) are a class of artificial intelligence (AI) algorithms to understand human language. They can answer questions, provide summaries or translations, and create stories or poems [27]. Recent studies have found that LLMs can be effective in guiding representation learning on image data [28], enabling the knowledge transfer to several downstream vision tasks such as image classification, object detection, and segmentation. In the medical domain, such kind of multi-modal pretraining has been exploited for better understanding of imaging data, such as chest X-rays and magnetic resonance images, to benefit the downstream disease classification and medical image segmentation tasks [29–33]. Apart from image data, most recently, concurrent works have been made on exploring the connection between natural language and signals (ECG, EEG) for better disease classification [34–36]. To the best of our knowledge, there is no existing risk prediction work that explores the benefit of combining ECG reports with ECG waves for better representation learning. Compared to disease classification, the task of risk prediction is more challenging due to the lack of event records for training, which amplifies the overfitting issue associated with deep neural networks. Our work provides a promising transfer learning approach to alleviate the need for a large number of event data, by utilizing LLMs and additional public large ECG-report datasets to conduct multi-modal pretraining.
III. Methods
A. Overview
Assume we have a dataset of N triplets to record the HF events in a population. Here, xi ∈ ℝ12×T is a 12-lead ECG signal (I, II, III, AVL, AVR, AVF, V1-6) with a recording length of T ; δi indicates whether there is known date of HF; ti is the number of month to the censoring time if there is no reported HF event during the follow-up period (δi = 0, right censored) or the number of months until the patient was diagnosed with HF during the follow-up (δi = 1, uncensored). The objective is to have a risk prediction model f parameterized by θ so that it can predict a patient’s risk of HF given the patient’s ECG data. To this end, we design an ECG dual attention network (ECG-DAN), as shown in Fig. 1(a), where the input is a 12-lead ECG median wave recording x, and the output of the network is a single node r, which estimates the risk score r = f (x; θ).
Fig. 1.
(a) Overview of the ECG dual attention encoder-based risk prediction network (ECG-DAN). A 12-lead ECG recording x is sent to the ECG dual attention encoder, which is capable of simultaneously extracting both cross-lead relationships as well as temporal dynamic patterns within each lead, for better feature aggregation. Then, features from two routes are added and then sent to a max-average pooling layer, producing a flattened feature vector zecg. Finally, we employ a multi-layer perceptron (MLP) module to map from a high-dimensional feature space into a risk score (scalar) r for heart failure. (b) Overview of the core attention module used in lead attention and temporal attention modules. See Sec. IIIB for more details.
B. ECG-DAN network
A signature of ECG-DAN is a dual attention module, which is designed to extract morphological and spatial changes and relationships across different leads as well as the temporal dynamics inside each lead for a more comprehensive understanding of the heart’s electrical activity. We first process each lead signal via a group of 1D convolution layers for noise filtering and feature extraction, which gives Cin-channel features for each lead at each time point. We then employ a set of K residual blocks with K 2× down-sampling layers along the temporal dimension to extract features at different scales. The output for each lead consists of feature maps with Cout channels and a reduced time dimension of Those features from all 12 leads h : {h1, h2, …, h12}are then chained and sent to the Lead Attention (LA) block, facilitating the learning of cross-lead interactions to enhance feature aggregation globally. Concurrently, a 12-lead Temporal Attention (TA) component is employed to capture crucial temporal patterns within each lead hi along the time dimension. Here, for each lead hi, we apply an individual temporal attention module TAi across the time domain separately, given the fact that different leads correspond to different directions of cardiac activation within three-dimensional (3D) space. The outcomes of the two attention modules are added and then summed with a concat-pooling module, producing concatenated flattened features from max-pooling and average pooling operations following common practice from previous ECG analysis work [13]:
| (1) |
With this latent feature zecg, we then send it to a multi-layer perceptron network (MLP), which consists of three linear layers, reducing the feature by half and then projecting the feature into a 3D feature space, finally, regressing it to a risk score r.
Next, we will explain the two key modules for improved feature learning and model explainability: the lead attention module and the temporal attention, in more detail.
1). Lead attention
We apply the multi-head attention-based encoder module introduced in transformers [37] to capture the cross-lead interactions. Specifically, we first reshape the feature matrix is number of leads) where each lead feature is treated as a token. As shown in Fig. 1(b), a sinusoid positional encoding PE [37] is added to the input feature h ← LayerNorm(h + PE(h)) to encode the contextual information with layer normalization [38], followed by a multi-head self-attention module to capture cross-lead interactions. At a high level, the input to the attention are: key matrix query matrix and value matrix where key and query matrices are used to compute the weights to re-weight the value matrix. Mathematically, it can be expressed as:
| (2) |
where ALA ∈ ℝ12×12 is the attention weight matrix, and Dk is the feature dimension We use the same input h for computing K, Q, V 1.
In addition to the self-attention, a fully connected feed-forward network (FFN) is applied to refine the re-weighted features with residual connection:
| (3) |
where W1, W2 and b1, b2 are weights and biases of two linear layers in FFN; LayerNorm is a layer normalization layer [38] for feature normalization. In this way, the output feature is adjusted with all the information from other leads into account.
2). 12-lead temporal attention (TA)
The structure of the temporal attention module is very similar to the lead attention. The only difference is that now we have 12 separate attention modules for different lead features so that each module is adapted to a specific lead to extract temporal dynamic features locally. Specifically, we split the feature h along the lead dimension, and for each lead feature hi, we first add sinusoid positional encoding PE information along the time domain and then process each lead with a separate temporal attention module where each time point is treated as a token. In other words, the input becomes a sequence of Cout-dimensional features with a sequence length of and the corresponding temporal attention matrix for lead i becomes: Similar to the lead attention module process, we reweight and normalize the lead feature hi with the generated temporal attention matrix and then employ an FFN module for feature refinement for each lead. The output of the temporal attention is a concatenation of output from the 12-lead temporal attention modules. The whole process can be defined as follows:
| (4) |
C. Training
Deep learning-based risk prediction often struggles with small datasets, particularly when events are rare, as in our case where HF events are below 5%. To overcome this and prevent overfitting, we adopt a two-stage training approach, see Fig. 2. Initially, we train our network on a large, diverse public ECG dataset. Of note, this dataset does not contain any HF records for risk prediction. After that, we initialize the model with the pretrained parameters and then conduct fine-tuning, focusing on risk prediction for particular populations with documented HF from the UK Biobank study [41]. Subsequently, we fine-tune the pretrained model on the UK Biobank data, specifically targeting HF risk prediction. The pretraining incorporates human-verified ECG reports to align features with clinical knowledge, aiming to improve the model’s ability to discern pathological ECG patterns relevant to HF risk.
Fig. 2.
Training overview. Our model is (a) first pretrained on the ECG-Report alignment task and the signal reconstruction task on a large-scale public dataset (PTB-XL [39, 40]), and then (b) finetuned on the heart failure risk prediction task with two specific cohorts from the UK Biobank where the future HF event data is available. Here, in PTB-XL dataset, each report has been abstracted to a set of SCP codes with SCP-ECG statement description and confidence score (annotated by human experts). We construct a structured report based on SCP-ECG protocol [39] and then send it to a frozen LLM to extract clinical knowledge for better representation learning guidance. As one ECG may have multiple SCP-code relavant statements, we extract text features separately and then use confidence-based reweighting to aggregate features for feature summation. See below texts for more details.
1). Large language model informed model pre-training
For pretraining, we employ the PTB-XL dataset [39, 40], a large, expert-verified collection of 21, 799 clinical 12-lead ECGs with accompanying text reports. It features annotations by cardiologists according to the SCP-ECG standard and classifies waveforms into five categories: Normal (NORM), Myocardial Infarction (MI), ST/T Change (STTC), Conduction Disturbance (CD), and Hypertrophy, with possible overlap due to concurrent conditions. We follow the dataset creators’ protocol, using folds 1-8 for training and folds 9-10 for validation and testing during pretraining [39].
Extracting latent text code ztext from reports using large language model: We employ an LLM to extract the knowledge embedded in ECG reports. Specifically, we use the medical domain language model: BioClinical BERT [42] 2, which has been trained on a large number of electronic health records from MIMIC III [43]. We consider two ways of extracting text embeddings ztext:
Latent text code from raw ECG report (raw): For a piece of ECG report y (in English) 3, we simply feed it to the LLM to get ztext = LLM(y), following [32, 35].
Latent text code from structured ECG report weighted by confidence (structured with confidence): The original PTB-XL dataset also provides ECG-SCP codes generated from ECG reports. Specifically, each report has been abstracted to a set of SCP codes with SCP-ECG statement description and confidence score (annotated by human experts). In this case, we build a structured sentence with linked SCP statement category information and SCP description for each SCP code: y(SCP) = ‘{#Statement Category(SCP)}:{#SCP-ECG Statement Description(SCP)}’6. For example, as shown in Fig. 2(a), given an SCP code: LNGQT, the input becomes: ‘other ST-T descriptive statements: long QT-interval’. We send this type of structured input to the LLM, which gives an embedding zSCP = LLM(y(SCP)) for each SCP code. It is important to note that a single recording may encompass multiple SCP codes. In that case, multiple SCP embeddings are initially derived. These embeddings are then aggregated with weights based on corresponding confidence scores c to get the text embedding for the corresponding ECG: where csum is the sum of all confidence scores for all corresponding statements for the input ECG signal.
2). ECG-report alignment loss
To align the ECG to report, similar to [28], we first project the latent ECG code zecg and the latent text code ztext to eecg, etext with two learnable projection functions px, py, respectively, so that the two embeddings eecg = px(zecg), etext = py(ztext) are of the same dimension, as shown in Fig. 2(a). We use a distance loss 𝒟 to quantify the dissimilarity between the two:
| (5) |
where 𝒟 is a cosine embedding metric function: 𝒟 (eecg, etext) = 1 − cos(eecg, etext), a metric that is commonly used for measuring the distance between two embeddings. The loss value is an average value over a batch of N paired embeddings.
3). Pretraining loss
The total loss for pretraining is a combination of the ECG-report alignment loss and a signal reconstruction loss, defined as:
| (6) |
where we compute the mean-squared-error between every input signal xi and reconstructed signals in a batch and average them. Adding a signal reconstruction loss is necessary as it can help uncover latent generic features in the ECG signal and serve as a regularization term to avoid latent space collapse problems. During pre-training, we only update the parameters in the ECG encoder, ECG decoder, and two projectors, while keeping the parameters of the language model frozen for training stability and efficiency, as suggested by prior work [29]. Detailed network structures can be found in the Appendix.
4). Finetuning loss
After pre-training, we copy the model weights to initialize the risk prediction model, see Fig. 2(b) and then finetune the risk prediction network. The finetuning loss is also a multi-task loss, including the self-supervised signal reconstruction loss, and a risk loss [23], which aims to minimize the average negative log partial likelihood of the set of uncensored patients (δi = 1: developed into HF during the follow-up). The risk loss is defined as:
| (7) |
where nδ=1 is the number of uncensored subjects in a batch, and fθ(x) is a predicted risk score.
The total loss is then defined as:
| (8) |
where α is a trade-off parameter to balance the contribution of two losses.
During model optimization, an issue with ℒrisk is that it can be very sensitive to the number of uncensored subjects nδ=1 (number of subjects who develop HF in the follow-up) in the training batch. This causes training instability problems if the number becomes zero or jumps from large to small and vice versa between batches. In real-world datasets, this issue is amplified due to a high-class imbalance between censored and uncensored subjects. To address this problem, we, therefore, perform a modified version of stochastic gradient descent for model optimization, where we select a batch of n observations with stratified random sampling to ensure every batch maintains a comparable ratio of censored to uncensored observations, mirroring the ratio found in the entire training population.
IV. Experiments
A. Study population
In this study, we focus on two populations: patients with hypertension (HYP) and MI, which are highly related to disease progression to HF. Subjects are selected from the UK Biobank (UKB) study, which is a large-scale biomedical database containing genetic, demographic, and disease information and is regularly updated with comprehensive follow-up studies, from approx. 500,000 subjects. The UKB dataset consists of a large portion of healthy subjects as well as those with a range of cardiovascular and other diseases. To assess the future risk of HF, our analysis is confined to individuals who have ECG together with imaging data available and have not been diagnosed with HF either prior to or at the time of the ECG evaluation.
1). UKB-HYP
HF-free subjects with prevalent HYP (had HYP before or during the ECG examination) at baseline time from the UKB dataset are studied [41]. We identified 11,581 HF-free HYP subjects. Follow-up time was defined as the time from the baseline ECG measurement until a diagnosis of HF or death or the end of follow-up (January 5, 2023). Most ECG recordings together with images were taken between 2014 and 2021. Records with less than two years of follow-up time were excluded. Among 11581 participants, 162 (1.2%) participants developed HF. The median follow-up time is 56 months (4.7 years), and the maximum follow-up time is 87 months (7.3 years).
2). UKB-MI
HF-free subjects at baseline with prevalent MI records are studied. Similar to the above selection procedure, we identified 800 subjects. Among them, 32 subjects (4%) developed HF during the follow-up period. The median follow-up time is 53 months (4.4 years), and the maximum follow-up time is 83 months (6.9 years). Code lists used for the retrieval of disease information can be found in the Appendix.
B. Implementation Details
For ECG signals, we use 12-lead median waveforms (covering a single beat) with a frequency of 500 Hz as input. Each lead is preprocessed with z-score normalization and then zero-padded to a length of 1024. We employ 5 residual blocks (K = 5, Cin = 4, Cout = 16) to obtain multi-scale features. The dimension of latent ECG feature zecg is 512, and the dimension of projected features eecg, etext is 128. The default convolutional kernel size is 5. Further network details are provided in the Appendix. We use a batch size of 128 (n = 128) for model updates at pre-training, with random lead masking for data augmentation [44]. AdamW optimizer [45] with stratified batches is used for training. For fine-tuning in post-MI risk prediction (UKB-MI), we maintain a batch size of 128. Given the low HF incidence (< 2%) in UKB-HYP, we increase the batch size to 1024 to include enough uncensored subjects for calculating ℒrisk. Batch-wise dropout is used to stabilize fine-tuning [24] as well as avoid overfitting. The loss function’s weighting parameter (α) is set at 0.5. Pre-training and fine-tuning are conducted over 300 and 100 epochs, respectively, to ensure convergence.
C. Evaluation metrics and evaluation method
We report the concordance index (C-index) [46] as the primary evaluation metric. This metric measures the accuracy of the ranking of predicted time based on the radio of concordant pairs: A concordant pair refers to when the predicted time ranking of two subjects aligns with their actual ranking, while a discordant pair means the opposite. A pair (i, j) is concordant if ti < tj and risk scores ri > rj, or vice versa. A value of 1 denotes perfect prediction, while a value of 0.5 indicates a prediction quality equivalent to random guessing.
Robust evaluation with multiple two-fold stratified cross-valiation
Due to the scarcity of uncensored subjects (with HF events), using conventional deep learning data splits (e.g, five-fold cross validation or 7:1:2 for training, validation and testing) would result in too few HF events for reliable C-index evaluation. Hence, we opt for a 1:1 training-to-testing ratio with stratified cross-validation, aligned with patient HF status during follow-up. This approach constitute a two-fold stratified cross-validation, which also helps to maintain the same proportion of each class as in the original dataset in each split. To further avoid over-fitting, 20% of the training data is randomly allocated as a validation set for hyper-parameter search (e.g. choice of the learning rate) and model selection. Specifically, we split the dataset into two folds, one for training and validation, the rest for testing. This process is then repeated once more, with the roles of the two parts reversed. To ensure that the result is not biased by the splitting, we perform the above procedure five times, each with a different split across two datasets (UKB-HYP and UKB-MI). The final model performance is reported as the average (with standard deviation) of these 10 trials.
V. Results
A. LLM-informed pre-training improves the accuracy of the downstream risk prediction
We first compare our proposed pretraining strategy against various pretraining tasks:
No pretraining (represented by a hyphen), serving as a baseline, with networks trained for 400 epochs (equivalent to the sum of pretraining and finetuning epochs);
Pretraining on signal reconstruction only;
Pretraining on signal reconstruction combined with multilabel disease classification (NORM, MI, STTC, CD, hypertrophy), using a cross-entropy loss for classification;
Pretraining on signal reconstruction and ECG-Report alignment using raw text reports;
Our proposed method, pretraining on signal reconstruction and ECG-Report alignment with structured report and confidence information, as detailed in Sec. III-C1.
Results are shown in Table II. It is clear that the models pretrained using the proposed strategy (signal reconstruction + ECG2Text alignment (structured w/ confidence)) consistently obtain the highest accuracy among the two tasks, obtaining the average C-index: 0.6349 (0.0156) on UKB-HYP and 0.5805 (0.0580) on UKB-MI.
Table II.
Comparison of risk prediction performance using different pre-training tasks. All experiments are performed using the same proposed network architecture. Reported values are average C-index over 5 cross-validations using different random splits.
| Pretraining Tasks | Study population | |
|---|---|---|
| Hypertension (UKB-HYP) |
Myocardial Infarction (UKB-MI) |
|
| - | 0.6122 (0.0190) | 0.5065 (0.0776) |
| SR | 0.6327 (0.0165) | 0.5069 (0.0770) |
| SR + Classification | 0.6370 (0.0216) | 0.5220 (0.0475) |
| SR + ECG-R Alignment (raw) | 0.6088 (0.0189) | 0.5796 (0.0570) |
| SR + ECG-R Alignment (structured with confidence) |
0.6349 (0.0156) | 0.5805 (0.0580) |
SR; Classification: ECG Disease Classification; ECG-R: ECG-Report.
B. Comparison study on risk prediction using different network architectures
We further compare model performance using different encoder architectures, including 1) the encoder used in a variational autoencoder (VAE)-like network architecture, which has been found effective for ECG signal feature representation learning in previous works [47–50]; 2) XResNet1D [13], which is the top performing network architecture on a wide range of different ECG analysis tasks such as ECG disease classification, age regression, and form/rhythm prediction on the public PTB-XL benchmark dataset [39] and ICBEB2018 dataset [51]. We report their performance trained w/o and w/the proposed language model informed pre-training strategy with structured SCP report in Table III. The proposed ECG attention network has the fewest parameters. Yet, models with this type of network architecture and the proposed pre-training strategy obtain the highest C-index scores on both two tasks.
Table III.
Comparison of risk prediction performance using different types of deep neural networks. By default, all network has been initialized with pre-trained on the proposed ECG-Report alignment and reconstruction tasks (300 epochs) and then finetuned on the risk prediction task (100 epochs). For fair comparison, models without any pre-training are trained for 400 epochs (300+100). To avoid overfitting, models with the highest C-index score on the validation set are selected as the final model. Reported values are average C-index over 5 cross-validations using different random splits (10 trials in total).
| Network architectures | # parameters | LLM-informed Pretraining | Study population | |
|---|---|---|---|---|
| Hypertension (UKB-HYP) | Myocardial Infarction (UKB-MI) | |||
| CNN-VAE ([47–50]) | 7.0M | ✗ | 0.6215 (0.0237) | 0.5675 (0.0353) |
| CNN-VAE ([47–50]) | 7.0M | ✓ | 0.6346 (0.0218) | 0.5484 (0.0269) |
| XResNet1D ([13]) | 1.9M | ✗ | 0.5601 (0.0273) | 0.5357 (0.0485) |
| XResNet1D ([13]) | 1.9M | ✓ | 0.6234 (0.0143) | 0.5431 (0.0387) |
| ECG dual attention (The proposed) | 1.4M | ✗ | 0.6122 (0.0190) | 0.5065 (0.0776) |
| ECG dual attention (The proposed) | 1.4M | ✓ | 0.6349 (0.0156) | 0.5805 (0.0580) |
C. Comparison study between traditional ECG parameter-based risk prediction vs ECG dual attention
We further compare our method to the traditional method with well-established ECG parameters. Specifically, we collect a set of ECG parameters from the ECG signals, which have been identified in previous studies for incident HF prediction and mortality estimation [6–9, 11, 12, 18–22]. All of these parameters are automatically extracted by the ECG device (using the GE CardioSoft V6 7) with supporting evidence from previous work: 1) Ventricular rate; 2) Left-axis deviation; 3) Right-axis deviation; 4) Prolonged P-wave duration (> 120 ms) [22]; 5) Prolonged PR interval (> 200 ms) [22]; 6) Prolonged QRS duration (> 100 ms) [22]; 7) Prolonged QT interval (≥ 460 (women)/450 (men) ms using the Framingham formula) [52]; 8) Delayed intrinsicoid deflection (DID time) (the maximum value in leads V5 and V6 > 50 ms) [53]; 9) Abnormal P-wave axis (values outside the range of 0° and 75°) [54]; 10) Left ventricular hypertrophy [55]; 11) Abnormal QRS-T angle (> 77° (women)/88° (men)) [22]; 12) Low QRS voltage [56]; 13) ST-T abnormality [56]; 14) Right bundle-branch block [56]; 15) Left bundle-branch block [56]. Following [22], we fit these ECG parameters into the conventional CoxPH linear regression model [17] and use it as a strong competing model for HF risk prediction. Table IV shows the performances using our approach and the traditional ECG parameter-based approach, and Fig. 3 shows Kaplan-Meier plots which depict the survival probability estimates over time, stratified by risk groups defined by each model’s predictions. We further plot the features learned in the last hidden layer (∈ ℝ3) of ECGs from the risk prediction branch in the proposed network and representative ECG waves with the lowest/highest risk score in Fig. 4.
Table IV.
Comparison of risk prediction performance using traditional ECG parameter-based and our deep learning model (ECG dual attention). Reported values are average C-index over five times of 2-fold cross-validations using different random splits (10 trials in total).
| Method | Input | Model Type | Study population | |
|---|---|---|---|---|
| Hypertension (UKB-HYP) |
Myocardial Infarction (UKB-MI) |
|||
| Tradiational | 15 ECG parameters |
Linear | 0.6149 (0.0125) | 0.5398 (0.0200) |
| ECG dual attention (the proposed) |
12-lead Median waves |
Nonlinear | 0.6349 (0.0156) | 0.5805 (0.0580) |
Fig. 3.
Kaplan-Meier risk curves for a) conventional model using a composite of 15 predefined ECG parameters/measurements, and b) the proposed ECG dual attention risk prediction model with the language-informed pretraining. For both models, patients were divided into low- and high-risk groups with a cutoff value referenced from the top 98th percentile (for UKB-HYP) or top 96th percentile (UKB-MI) risk scores predicted by the model, reflecting the statistics of the datasets in Table I
Fig. 4.
3D visualization of the last 3-dim hidden feature learned in the risk prediction subnetwork along with the visualization of input ECG waves with lowest predicted risk score (dark purple) and highest predicted risk score (light yellow) on the (a) UKB-HYP and (b) UKB-MI datasets.
D. Dual attention mechanism improves model stability and interpretability
1). Quantitative ablation study
We also evaluate the impact of the dual attention modules on whether they can help to enhance the accuracy of risk prediction. We conduct the ablation study experiments using the same training strategy with the same network but either the lead and/or the temporal attention module removed. Results are shown in Table V. It can be observed that the propose model containing both attention modules consistently produces the most stable performance, with the best performance on the UKB-HYP and the second best performance on the UKB-MI, yielding the best average performance across the two datasets.
Table V. Ablation Experiments. Numbers in bold represent the highest values, while numbers with underlines denote the second highest.
| Method | Study population | |
|---|---|---|
| Hypertension (UKB-HYP) |
Myocardial Infarction (UKB-MI) |
|
| w/o time attention module | 0.6038 (0.0243) | 0.5855 (0.0353) |
| w/o lead attention module | 0.6018 (0.0418) | 0.4920 (0.0884) |
| w/o lead+time attention module | 0.6167 (0.0105) | 0.5195 (0.0539) |
| The proposed | 0.6349 (0.0156) | 0.5805 (0.0580) |
2). Lead attention matrix visualization
We further visualize the average lead attention matrix at the population level, averaging matrices in different risk groups. For both datasets, patients are divided into low- and high-risk groups with a cutoff value referenced from the top 98th percentile (for UKB-HYP) or top 96th percentile (UKB-MI) risk scores predicted by the model, where the threshold is chosen based on the statistics of the datasets. Of note, all risk scores and attention matrices are obtained using the same cross-validation strategy, where the predicted risk is computed by averaging predictions from models trained with data excluding that subject. Figure 5 illustrates the contribution of each lead for feature re-weighting across all 12-lead features for different populations. We can find similar patterns though the underlying study cohorts are different.
Fig. 5.
Visualization of lead attention patterns and differences between low-risk and high-risk groups across two, different populations: UKB-HYP and UKB-MI.
3). Temporal attention activation map visualization
To visualize the temporal attention matrix in the original ECG time length T, we adapt GradCAM for ECG leads to highlight the model’s focus. We condense the attention matrix by summing column values for attention scores across time, and then map them to the ECG input features, creating Grad-CAM maps. These maps, weighted by their attention scores, reveal the network’s focus areas on the ECG. More details of implementation can be found in the Appendix. The visualization is presented in Fig. 6.
Fig. 6.
Visualization of cross-lead (a,b) and 12-lead temporal attention maps (d,e) obtained from a high HF risk ECG with HYP (a,d) and a high HF risk ECGs with MI (b,e). (c) is a schematic standard ECG for illustrative purpose. Source: Wikimedia Commons.
VI. Discussion
Importance of language informed pretraining
In the experiments, we first studied the impact of different pretraining tasks for downstreaming risk predictions and highlighted the value of LLM-informed pretraining for risk prediction in Table II. In general, pre-training tasks enhance the risk prediction performance compared to those without pretraining, especially on the smaller dataset (UKB-MI). It is interesting to see that the performance of the deep risk prediction model (average C-index: 0.5065) can be inferior to the traditional approach (average C-index 0.5398) if without proper pretraining, see Table II and Table IV. This indicates that pre-training is important to alleviate model over-fitting. Our study further highlights the critical role of targeted model pretraining in identifying both generic and pathological features for downstream HF risk prediction. Pretraining on structured ECG reports outperforms methods using simple disease labels or unstructured text inputs. We credit this improved performance to the integration of detailed context from structured reports and supplementary confidence information. Figure 7 shows the UMAP [57] visualization of latent text codes of structured reports zSCP extracted by LLM, which suggests that LLM distinguishes between disease-specific embeddings (e.g., separated clusters) and captures the interrelations among various diseases. For instance, the embedding for the normal or healthy category (red dots) is nearer to ST/T wave change embeddings (blue dots) and notably distant from those for MI (green dots), which is aligned with the fact that ST/T variations can be non-pathological, unlike the distinct disease pattern of MI.
Fig. 7.
U-map visualization of latent code embeddings zSCP from the large language model using different structured SCP statements. Different colors represent the categorization of statements with disease labels.
ECG dual attention network exhibits high effectiveness of risk stratification
By comparing the results in Table III, it is clear that the suggested approach surpasses leading deep learning approaches in terms of both computational costs and accuracy. Interestingly, we observe that language-informed pretraining, while continuously boosting the performance on the ResNet-based structure and our attention model, does not enhance the CNN-VAE-based model’s performance. This phenomenon probably attributed to the over-parameterization in the CNN-VAE networks (7.0M). In that case, the prioritization of strong regularization to shape the latent representation to like a Gaussian distribution to ensure feature independence [58] becomes crucial. Such regularization might conflict with the language-informed training process.
Results in Table IV also show that our approach surpasses the performance of a traditional methodology, which relies on a predefined set of ECG parameters with a simple linear model, as well as current leading deep learning-based models in terms of both computational costs and performance. The high overlapping region between high-risk and low-risk curves of MI populations (left bottom) when using the traditional approach in Fig. 3 reveals the challenge when the underlying population is limited. By contrast, our method can consistently stratify the patients into different risks with a clear gap. Figure 4 demonstrates that our network effectively discriminates between low-risk (upper right, purple dots) and high-risk ECGs (bottom left, yellow dots) in the latent space, correlating high-risk ECGs with clinical markers like prolonged QRS duration and longer QT intervals, in alignment with established studies [6–10]
ECG dual attention network exhibits high interpretability
Furthermore, the proposed two dual attention modules offer model interpretability, a feature highly valued in the clinical setting. The integration of a dual attention mechanism serves as a window into the network’s thought process, allowing clinicians and readers alike to visualize and comprehend the complex interactions between different ECG leads, as shown in Fig. 5. First, we found the three augmented leads (aVR, aVL, aVF) contributed the least (see blue regions). We believe that this may be because aVR, aVL, and aVF can be derived from Lead I, II, and III using Goldberger’s equations [59], thus containing redundant information for feature aggregation and risk prediction. Second, we found that the network typically pays more attention to chest lead I and precordial leads (V2-V6) (see red regions). In clinical practice, comparing the morphological changes across the precordial leads (V1-V6) can help to identify ST/T wave abnormality, which could be an indicator of future HF [7, 18–21] and sudden cardiac death [60]. On the other hand, upon comparing the attention patterns derived from high-risk groups to those from low-risk groups, as illustrated in the column titled ‘Difference’ in Fig. 5, we observed elevated activation values within the high-risk groups. This observation underscores the network’s heightened sensitivity in identifying abnormal features.
For better understanding, we visualize the cross-lead and temporal activation maps of two high-risk cases in Fig. 6. For temporal ones, we apply GradCAM [61] to map the low-resolution temporal attention maps back to the original input signal level (see Appendix for more details). It can be observed that cross-lead module provides an overview to identify the uniqueness or exploit synergy or cross-lead interactions among various leads, whereas temporal lead attention focuses on identifying local areas in each lead important to the prediction. For example, leads III and II stand out in the cross-lead attention maps (a) and (b) respectively, highlighting unique pathological pattern (prolonged negative QRS in lead III, see (d) or strong abnormal noises in lead II, see original signals in (e)), in contrast to other leads. Additionally, the orange diagonal clusters in panel (b) showcase good R wave progression from V2 to V6, see (e). On the other hand, the temporal attention maps in (d) and (e) reveal more intricate details, showing the neural network’s capability to focus on clinically significant features such as P, Q, S, T waves, and R peaks, even with strong noise present, and highlight pathological abnormalities like abnormal R peaks, T wave iregularities and prolonged QRS and QT intervals. This implies that the dual attention network can autonomously discover clinically relevant biomarkers from ECG data without explicitly being taught during its training phase. Future research will aim at collaborating with medical professionals to verify potential novel biomarkers using these visual tools.
Limitations
One of the limitations of the current work is that it only considers information from ECG signals. In the future, we will consider extending our approach to a broader spectrum of data. The input features can be blood test results, demography (age, sex, ethnicity) [47], smoking, chronic disease condition such as diabetes, imaging-derived features [49, 50, 62] as well as genetic information to create a more holistic and accurate characterization of the patient [8, 9, 63, 64]. Moreover, it is interesting to increase the inclusivity and diversity of our study by considering a broader population base.
Broader Impact
We believe the proposed LLM-informed pretraining scheme is not limited to the risk prediction task. It also holds the potential to inspire new approaches and applications for ECG signal analysis, such as disease diagnosis and ECG segmentation. The LLM model, utilizing textual reports, acts as an additional source of contextual supervision, which guides the ECG network to better understand and extract complex patterns. We would also like to highlight that the utilized public pre-training dataset which comprises a diverse and comprehensive pathologies is crucial for model generalization. In the future, it is also interesting to further enhance the model interpretability through the incorporation of the generative capabilities of LLM, providing structured report for further explanation.
VIII. Conclusion
This paper presents a study with a novel ECG dual attention network for ECG-based HF risk prediction. This network distinguishes itself through its unique blend of being both lightweight and efficient, outperforming existing models in the field. Its standout feature is the ability to generate not just a quantitative risk score but also to provide a qualitative interpretation of its internal processes. This is achieved through the generation of attention visualization maps, which span across and delve within individual leads, offering a granular view of the network’s focus and decision-making process. We hope to contribute to the development of more transparent and interpretable AI-assisted systems, fostering trust and facilitating broader adoption in clinical settings. Additionally, the study presents a multi-modal pretraining approach for risk prediction models, which leverages external public ECG reports and confidence scores from diverse populations, combined with advanced large language models, to address the challenges posed by limited future event data in risk prediction tasks.
Supplementary Material
Table I. Statistics of studied population(s).
| characteristics | total | had HF during follow-up? | ||
|---|---|---|---|---|
| Yes | No | |||
| UKB-HYP | Age at examination# | 66.31 (45-82) | 69.57 (49-81) | 66.27 (45-82) |
| Sex (men/women) | 6730/4851 | 109/53 (1.6%/1.1%) | 6621/4798 | |
| UKB-MI | Age at examination# | 68.21 (49-81) | 67.28 (50-79) | 68.25 (49-81) |
| Sex (men/women) | 650/150 | 27/5 (4.3%/3.4%) | 623/145 | |
Values represent mean (minimum-maximum).
Acknowledgment
This research has been conducted using the UK Biobank Resource under Application Number ‘40161’.
Biographies
Chen Chen obtained her MSc and Ph.D. degree in Advanced Computing from Imperial College London, in 2016 and 2022, respectively. From 2022, she worked as a research associate at Imperial College London. After that, she joined Oxford BioMedIA group, University of Oxford in 2023. Her research interests include medical image analysis and multi-modal AI.
Lei Li received PhD degree in School of Biomedical Engineering, Shanghai Jiao Tong University, in 2021, and BS degree in Department of Medical Information Engineering, Sichuan University, in 2016. She is working in Department of Engineering Science, University of Oxford. Her research interests include cardiac image analysis and multi-modal AI.
Marcel Beetz is currently working towards the PhD degree at the Institute of Biomedical Engineering (IBME), Department of Engineering Science, University of Oxford. His research interests include the development and application of machine learning models for the analysis of multimodal medical data.
Abhirup Banerjee (Member, IEEE) received the BSc (Hons) and Master degrees in Statistics and the PhD degree in Computer Science in 2017 from the Indian Statistical Institute, Kolkata, India. He joined the University of Oxford as Postdoctoral Researcher in August 2017. He received the prestigious Royal Society University Research Fellowship and started as the Faculty Member in the Department of Engineering Science, University of Oxford, UK in October 2022. His research interests include Multimodal Machine Learning, Biomedical Image Analysis, etc.
Ramneek Gupta is the Digital Director, Global Drug Discovery at Novo Nordisk Research Center Oxford. He brings experience from previous roles at Technical University of Denmark and Eli Lilly and Company. Ramneek Gupta holds a 1997 - 2001 Doctor of Philosophy in Bioinformatics at Technical University of Denmark. His research focuses on using in silico pipelines, novel algorithms and advanced machine learning activities to identify novel targets for precision medicine and other research projects.
Vicente Grau is a Professor of Biomedical Image Analysis at the Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford. His research focuses on the development of novel computational methods for the analysis of biomedical images, with a particular interest on the interaction between these and models of cardiovascular function.
Footnotes
In practice, for computational efficiency, following [37], we apply the multi-head attention trick, which first linearly projects the queries, keys, and values h times with different learnable linear projection matrices to lower dimensions (Dk/h) respectively to compute the attention matrix and re-weight the projected value matrix, then concatenate all the output h heads to recover the feature dimension.
Since the original reports are written in a mixture of German and English, we used the open-source machine translation tool: EasyNMT 4 for batch translation following [34] and then used ChatGPT5 to further refine those failed cases with a prompt: translate the ECG report into English:#text.
Database containing SCP code-statement mappings can be found at https://physionet.org/content/ptb-xl/1.0.1/scp_statements.csv and https://physionet.org/content/ptb-xl/1.0.1/ptbxl_database.csv.
The authors express no conflict of interest.
Contributor Information
Chen Chen, Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford; Imperial College London; University of Sheffield, Sheffield.
Lei Li, Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford; University of Southampton.
Marcel Beetz, Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford.
Abhirup Banerjee, Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford.
Ramneek Gupta, Novo Nordisk Research Centre Oxford (NNRCO).
Vicente Grau, Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford, Oxford.
References
- [1].British Heart Foundation. [Accessed on Oct, 2023];Heart failure hospital admissions rise by a third in five years. 2019 [Google Scholar]
- [2].Tomaselli GF, Zipes DP. What causes sudden death in heart failure? Circulation Research. 2004;95(8):754–763. doi: 10.1161/01.RES.0000145047.14691.db. [DOI] [PubMed] [Google Scholar]
- [3].Lane RE, Cowie MR, Chow AW. Prediction and prevention of sudden cardiac death in heart failure. Heart. 2005;91(5):674–680. doi: 10.1136/hrt.2003.025254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Yancy CW, et al. Clinical presentation, management, and in-hospital outcomes of patients admitted with acute decompensated heart failure with preserved systolic function: A report from the acute decompensated heart failure national registry (ADHERE) database. Journal of the American College of Cardiology. 2006;47(1):76–84. doi: 10.1016/j.jacc.2005.09.022. [DOI] [PubMed] [Google Scholar]
- [5].Thompson BS, Yancy CW. Immediate vs delayed diagnosis of heart failure: Is there a difference in outcomes? results of a harris interactive® patient survey. Journal of Cardiac Failure. 2004;10(4):S125 [Google Scholar]
- [6].da Silva RMFL, et al. P-wave dispersion and left atrial volume index as predictors in heart failure. Arquivos Brasileiros de Cardiologia. 2013;100(1):67–74. doi: 10.1590/s0066-782x2012005000115. [DOI] [PubMed] [Google Scholar]
- [7].Triola B, et al. Electrocardiographic predictors of cardiovascular outcome in women: The national heart, lung, and blood institute-sponsored women’s ischemia syndrome evaluation (wise) study. Journal of the American College of Cardiology. 2005;46:51–56. doi: 10.1016/j.jacc.2004.09.082. [DOI] [PubMed] [Google Scholar]
- [8].Khan SS, et al. 10-year risk equations for incident heart failure in the general population. Journal of the American College of Cardiology. 2019;73(19):2388–2397. doi: 10.1016/j.jacc.2019.02.057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Khan SS, et al. Development and validation of a Long-Term incident heart failure risk model. Circulation Research. 2022;130(2):200–209. doi: 10.1161/CIRCRESAHA.121.319595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Ilkhanoff L, et al. Association of QRS duration with left ventricular structure and function and risk of heart failure in middle-aged and older adults: The Multi-Ethnic study of atherosclerosis (MESA) European Journal of Heart Failure. 2012;14(11):1285–1292. doi: 10.1093/eurjhf/hfs112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Zhang Z-M, et al. Ventricular conduction defects and the risk of incident heart failure in the atherosclerosis risk in communities (ARIC) study. Journal of Cardiac Failure. 2015;21(4):307–312. doi: 10.1016/j.cardfail.2015.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Zhang Z-M, et al. Different patterns of bundle-branch blocks and the risk of incident heart failure in the Women’s Health Initiative (WHI) study. Circulation: Heart Failure. 2013;6:655–661. doi: 10.1161/CIRCHEARTFAILURE.113.000217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Strodthoff N, et al. Deep learning for ECG analysis: Benchmarks and insights from PTB-XL. IEEE Journal of Biomedical and Health Informatics. 2021;25(5):1519–1528. doi: 10.1109/JBHI.2020.3022989. [DOI] [PubMed] [Google Scholar]
- [14].Ardeti VA, et al. An overview on state-of-the-art electrocardiogram signal processing methods: Traditional to AI-based approaches. Expert Systems with Applications. 2023;217:119561 [Google Scholar]
- [15].Hughes JW, et al. A deep learning-based electrocardiogram risk score for long term cardiovascular death and disease. NPJ digital medicine. 2023;6(1):169. doi: 10.1038/s41746-023-00916-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Alsentzer E, et al. Publicly available clinical BERT embeddings; Proceedings of the 2nd Clinical Natural Language Processing Workshop; Minneapolis, Minnesota, USA. 2019. pp. 72–78. [Google Scholar]
- [17].Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society Series B (Methodological) 1972;34(2):187–220. [Google Scholar]
- [18].Rautaharju PM, et al. Electrocardiographic predictors of incident heart failure in men and women free from manifest cardiovascular disease (from the atherosclerosis risk in communities [aric] study) The American Journal of Cardiology. 2013;112:843–849. doi: 10.1016/j.amjcard.2013.05.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Rautaharju PM, et al. Electrocardiographic predictors of incident congestive heart failure and all-cause mortality in postmenopausal women: The women’s health initiative. Circulation. 2006;113:481–489. doi: 10.1161/CIRCULATIONAHA.105.537415. [DOI] [PubMed] [Google Scholar]
- [20].Rautaharju PM, et al. Electrocardiographic predictors of new-onset heart failure in men and in women free of coronary heart disease (from the Atherosclerosis in Communities [ARIC] Study) The American Journal of Cardiology. 2007;100:1437–1441. doi: 10.1016/j.amjcard.2007.06.036. [DOI] [PubMed] [Google Scholar]
- [21].Zhang Z-M, et al. Comparison of the prognostic significance of the electrocardiographic QRS/T angles in predicting incident coronary heart disease and total mortality (from the atherosclerosis risk in communities study) The American journal of cardiology. 2007;100(5):844–849. doi: 10.1016/j.amjcard.2007.03.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].O’Neal WT, et al. Electrocardiographic predictors of heart failure with reduced versus preserved ejection fraction: The Multi-Ethnic study of atherosclerosis. Journal of the American Heart Association. 2017;6(6) doi: 10.1161/JAHA.117.006023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Katzman J, et al. DeepSurv: Personalized treatment recommender system using a cox proportional hazards deep neural network. BMC Medical Research Methodology. 2018 doi: 10.1186/s12874-018-0482-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Jiang S, Suriawinata AA, Hassanpour S. MHAttnSurv: Multi-Head attention for survival prediction using Whole-Slide pathology images. Computers in Biology and Medicine. 2023;158 doi: 10.1016/j.compbiomed.2023.106883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Li Z, et al. Survival prediction via hierarchical multi-modal Co-Attention transformer: A computational Histology-Radiology solution. IEEE Transactions on Medical Imaging. 2023 doi: 10.1109/TMI.2023.3263010. PP. [DOI] [PubMed] [Google Scholar]
- [26].Bello GA, et al. Deep learning cardiac motion analysis for human survival prediction. Nature Machine Intelligence. 2019;1:95–104. doi: 10.1038/s42256-019-0019-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Singhal K, et al. Large language models encode clinical knowledge. Nature. 2022;620(7972):172–180. doi: 10.1038/s41586-023-06291-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Radford A, et al. Learning transferable visual models from natural language supervision. 2021 [Google Scholar]
- [29].Liu C, et al. M-FLAG: Medical Vision-Language pre-training with frozen language models and latent space geometry optimization. Medical Image Computing and Computer Assisted Intervention. 2023 [Google Scholar]
- [30].Zhang X, et al. Knowledge-enhanced visual-language pre-training on chest radiology images. Nature Communications. 2023;14(1):4542. doi: 10.1038/s41467-023-40260-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Liu J, et al. Clip-driven universal model for organ segmentation and tumor detection; International Conference on Computer Vision; 2023. [Google Scholar]
- [32].Qiu J, et al. Multimodal representation learning of cardiovascular magnetic resonance imaging. 2023 [Google Scholar]
- [33].Turgut Ö, MÜller P, Hager P, Shit S, Starck S, Menten MJ, Martens E, Rueckert D. Unlocking the diagnostic potential of ecg through knowledge transfer from cardiac mri. ArXiv. 2023 doi: 10.1016/j.media.2024.103451. [DOI] [PubMed] [Google Scholar]
- [34].Qiu J, et al. Transfer knowledge from natural language to electrocardiography: Can we detect cardiovascular disease through language models?; Findings of the Association for Computational Linguistics: EACL 2023; Dubrovnik, Croatia. 2023. pp. 442–453. [Google Scholar]
- [35].Liu C, et al. ETP: Learning transferable ECG representations via ECG-Text pre-training. 2023 [Google Scholar]
- [36].Han W, et al. Can brain signals reveal inner alignment with human languages? 2023 [Google Scholar]
- [37].Vaswani A, et al. In: Guyon I, et al., editors. Attention is all you need; Conference on Neural Information Processing Systems; 2017. pp. 5998–6008. [Google Scholar]
- [38].Ba JL, Kiros JR, Hinton GE. Layer normalization. 2016 [Google Scholar]
- [39].Wagner P, et al. PTB-XL, a large publicly available electrocardiography dataset. Scientific Data. 2020;7(1):154. doi: 10.1038/s41597-020-0495-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Strodthoff N, et al. Ptb-xl+, a comprehensive electrocardiographic feature dataset. Scientific Data. 2023;10(1):279. doi: 10.1038/s41597-023-02153-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Sudlow C, et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine. 2015;12(3):e1001779. doi: 10.1371/journal.pmed.1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Alsentzer E, et al. Publicly available clinical BERT embeddings; Proceedings of the 2nd Clinical Natural Language Processing Workshop; Minneapolis, Minnesota, USA. 2019. pp. 72–78. [Google Scholar]
- [43].Johnson AEW, et al. MIMIC-III, a freely accessible critical care database. Scientific Data. 2016;3:160035. doi: 10.1038/sdata.2016.35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [44].Nejedly P, et al. Classification of ecg using ensemble of residual cnns with attention mechanism; 2021 Computing in Cardiology (CinC); 2021. pp. 1–4. [DOI] [PubMed] [Google Scholar]
- [45].Loshchilov I, Hutter F. Decoupled weight decay regularization; ICLR; 2019. [Google Scholar]
- [46].Harrell FE, Jr, et al. Evaluating the yield of medical tests. JAMA: the Journal of the American Medical Association. 1982;247(18):2543–2546. [PubMed] [Google Scholar]
- [47].Sang Y, Beetz M, Grau V. Generation of 12-lead electrocardiogram with subject-specific, image-derived characteristics using a conditional variational autoencoder; International Symposium on Biomedical Imaging; 2022. pp. 1–5. [Google Scholar]
- [48].Beetz M, et al. Combined generation of electrocardiogram and cardiac anatomy models using multi-modal variational autoencoders; International Symposium on Biomedical Imaging; 2022. pp. 1–4. [Google Scholar]
- [49].Beetz M, Banerjee A, Grau V. Multi-domain variational autoencoders for combined modeling of MRI-based biventricular anatomy and ECG-based cardiac electrophysiology. Frontiers in Physiology. 2022;13 doi: 10.3389/fphys.2022.886723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Li L, et al. Deep computational model for the inference of ventricular activation properties; International Workshop on Statistical Atlases and Computational Models of the Heart; 2022. pp. 369–380. [Google Scholar]
- [51].Liu F, et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. Journal of Medical Imaging and Health Informatics. 2018;8(7):1368–1373. [Google Scholar]
- [52].Sagie A, et al. An improved method for adjusting the QT interval for heart rate (the Framingham Heart Study) The American Journal of Cardiology. 1992;70(7):797–801. doi: 10.1016/0002-9149(92)90562-d. [DOI] [PubMed] [Google Scholar]
- [53].O’Neal WT, et al. Electrocardiographic time to intrinsicoid deflection and heart failure: The Multi-Ethnic study of atherosclerosis. Clinical Cardiology. 2016;39(9):531–536. doi: 10.1002/clc.22561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Li Y, Shah AJ, Soliman EZ. Effect of electrocardiographic P-wave axis on mortality. The American Journal of Cardiology. 2014;113(2):372–376. doi: 10.1016/j.amjcard.2013.08.050. [DOI] [PubMed] [Google Scholar]
- [55].Devereux RB, et al. Electrocardiographic detection of left ventricular hypertrophy using echocardiographic determination of left ventricular mass as the reference standard. comparison of standard criteria, computer diagnosis and physician interpretation. Journal of the American College of Cardiology. 1984;3(1):82–87. doi: 10.1016/s0735-1097(84)80433-7. [DOI] [PubMed] [Google Scholar]
- [56].Prineas RJ, Crow RS, Zhang Z-M. The Minnesota Code Manual of Electrocardiographic Findings. 2009 [Google Scholar]
- [57].McInnes L, Healy J, Melville J. UMAP: Uniform manifold approximation and projection for dimension reduction. 2018 [Google Scholar]
- [58].Kingma DP, Welling M. Auto-Encoding variational bayes; International Conference on Learning Representations; 2014. pp. 1–14. [Google Scholar]
- [59].EKG/ECG leads, electrodes, limb leads, chest (precordial) leads. ECG Waves; 2023. [Accessed on: 2023-07-01]. [Online]. Available: https://ecgwaves.com/topic/ekg-ecg-leads-electrodes-systems-limb-chest-precordial/ [Google Scholar]
- [60].Tikkanen JT, et al. Electrocardiographic T wave abnormalities and the risk of sudden cardiac death: The Finnish perspective. Annals of Noninvasive Electrocardiology: the Official Journal of the International Society for Holter and Noninvasive Electrocardiology, Inc. 2015;20(6):526–533. doi: 10.1111/anec.12310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Selvaraju RR, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization; IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017; 2017. pp. 618–626. [Google Scholar]
- [62].Grafton-Clarke C, et al. Cardiac magnetic resonance left ventricular filling pressure is linked to symptoms, signs and prognosis in heart failure. ESC heart failure. 2023;10(5):3067–3076. doi: 10.1002/ehf2.14499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [63].Cikes M, et al. Machine learning-based phenogrouping in heart failure to identify responders to cardiac resynchronization therapy. European Journal of Heart Failure. 2019;21(1):74–85. doi: 10.1002/ejhf.1333. [DOI] [PubMed] [Google Scholar]
- [64].Biton S, et al. Atrial fibrillation risk prediction from the 12-lead electrocardiogram using digital biomarkers and deep representation learning. European Heart Journal Digital health. 2021;2(4):576–585. doi: 10.1093/ehjdh/ztab071. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.







