Generalization challenges in electrocardiogram deep learning: insights from dataset characteristics and attention mechanism

Zhaojing Huang; Sarisha MacLachlan; Leping Yu; Luis Fernando Herbozo Contreras; Nhan Duy Truong; Antonio Horta Ribeiro; Omid Kavehei

doi:10.1080/14796678.2024.2354082

. 2024 Jun 5;20(4):209–220. doi: 10.1080/14796678.2024.2354082

Generalization challenges in electrocardiogram deep learning: insights from dataset characteristics and attention mechanism

Zhaojing Huang ^a,^*,^‡, Sarisha MacLachlan ^a,^‡, Leping Yu ^a, Luis Fernando Herbozo Contreras ^a, Nhan Duy Truong ^a, Antonio Horta Ribeiro ^b, Omid Kavehei ^a

PMCID: PMC11285255 PMID: 39049767

Abstract

Aim: Deep learning’s widespread use prompts heightened scrutiny, particularly in the biomedical fields, with a specific focus on model generalizability. This study delves into the influence of training data characteristics on the generalization performance of models, specifically in cardiac abnormality detection. Materials & methods: Leveraging diverse electrocardiogram datasets, models are trained on subsets with varying characteristics and subsequently compared for performance. Additionally, the introduction of the attention mechanism aims to improve generalizability. Results: Experiments reveal that using a balanced dataset, just 1% of a large dataset, leads to equal performance in generalization tasks, notably in detecting cardiology abnormalities. Conclusion: This balanced training data notably enhances model generalizability, while the integration of the attention mechanism further refines the model’s ability to generalize effectively.

Keywords: : attention mechanism, cardiac abnormality detection, dataset characteristics, deep learning, electrocardiogram, generalization

Plain language summary

This study tackles a common problem for deep learning models: they often struggle when faced with new, unfamiliar data that they have not been trained on. This phenomenon is also known as performance drop in out-of-distribution generalization. This reduced performance on out-of-distribution generalization is a key focus of the research, aiming to improve the models’ ability to handle diverse data sets beyond their training data.

The study examines how the characteristics of the dataset used to train deep learning models affect their ability to detect abnormal heart activities when applied to new, unseen data. Researchers trained these models using various sets of electrocardiogram (ECG) data and then evaluated their performance in identifying abnormalities. They also introduced an attention mechanism to enhance the models’ learning capabilities. The attention mechanism in deep learning is like a spotlight that helps the model focus on important information while ignoring less relevant details.

The findings were particularly noteworthy. Despite being trained on a small, well-balanced subset of a larger dataset, the models excelled in detecting heart abnormalities in new, unfamiliar data. This training method significantly improved the models’ generalization and performance with unseen data. Furthermore, integrating the attention mechanism substantially enhanced the models’ ability to generalize effectively on new information.

Plain language summary

Article highlights.

Objective

Investigate the impact of training data characteristics and attention mechanism on deep learning model generalizability in cardiac abnormality detection.

Findings

Balanced dataset (1% of the total) improves model performance in generalization tasks, especially in detecting cardiology abnormalities.
The attention mechanism further enhances the model’s capacity to comprehend and utilize out-of-distribution data effectively.

Methodology

Utilized multiple electrocardiogram datasets for the study.
Trained models on subsets with varying characteristics and evaluated performance.
Added attention mechanism to enhance learning capabilities.

Implications

Balanced training data significantly enhances model generalizability.
Attention mechanism improves the model’s ability to generalize on out-of-distribution data.

Limitations & future research

Lack of clinical user information in datasets due to privacy and ethical considerations.
Future research may consider patient-specific models for improved generalization in biomedical machine learning.

Conclusion

Balanced and curated datasets are crucial for training high-performing models in cardiac abnormality detection using deep learning.
Attention mechanisms show promise in enhancing model accuracy and generalization.

Cardiac abnormalities are typically characterized as any deviation or alteration from an individual’s typical heart rate pattern that cannot be justified physiologically. They often indicate an underlying cardiac ailment and early identification can help prevent serious clinical conditions like heart failure or stroke [1]. Multiple clinical methods are currently used to diagnose cardiac abnormalities, including examining the patient’s medical history, physical assessment, and specialized monitoring equipment. The ECG is a widely utilized tool for measuring the heart’s electrical activity and was introduced by Waller in the early 1900s. It is considered essential in evaluating and diagnosing cardiovascular diseases [2]. The ECG offers a straightforward, low-cost, and nonintrusive approach to tracking heart signals.

Physicians frequently encounter challenges in interpreting ECG recordings because of the intricacy, unclear causes of abnormalities, and their clinical associations. Depending solely on visual examination of ECG recordings can result in incorrect diagnosis and categorization, which can have life-threatening outcomes [3]. A high level of expertise is necessary to interpret ECGs manually [4]. Furthermore, there is variation in interpretation among different observers and even for the same observer over different readings, and the task is both monotonous and requires a lot of time even for highly experienced cardiologists [5]. Studies indicate that cardiologists can accurately recognize ECG abnormalities in the range of 53–96%, with as many as 33% of ECG readings containing some level of mistake, and up to 11% of cases leading to incorrect treatment [3].

Recent research has aimed to develop precise deep neural networks (DNNs) capable of accurately classifying cardiac abnormalities from ECG tracings. However, the generalization of these models has been given insufficient attention in some publications. Ribeiro et al. [6] constructed a DNN that was trained on the most extensive 12-lead ECG dataset to date and achieved an impressive F₁ score of over 80% and a specificity of more than 99% [6]. Nevertheless, this model has not been externally validated. The Hybrid Deep CNN Model proposed by Ullah [7] and the deep-learning-based framework introduced by Jamil and Rahman [8] have both attained an impressive F₁ score of 0.99 and above. However, neither of these models has been cross-validated on other datasets, and their ability to generalize to new data has not been evaluated.

This study delves into the characteristics of the training dataset in terms of generalizability and investigates the impact of attention mechanisms on improving model generalization to new, unseen data. Attention mechanisms have garnered significant recognition for their ability to enhance model performance [9]. Through a series of experiments, we aim to assess how integrating attention mechanisms influences the model’s capability to generalize effectively and also examine the impact of training data characteristics on the generalizability of the trained model. This research contributes to expanding our understanding of dataset-specific characteristics and the role of attention mechanisms in enhancing model generalization.

1. Background

1.1. DNN architecture

This study extends the model introduced by Ribeiro et al. [6], which was originally trained on the Telehealth Network of Minas Gerais (TNMG) dataset to detect six cardiac abnormalities. Specifically, we utilize their unidimensional residual neural network architecture, including a 1D convolutional layer (Conv), followed by batch normalization (BN) and a ReLU activation function. The network then passes through four residual blocks (ResBlk) before exiting with a dense layer, as depicted in Figure 1A.

Figure 1. — Deep neural network and attention deep neural network. **(A)** Ribeiro et al. [6] developed a comprehensive DNN model for abnormality classification, which utilized multiple residual blocks in a specific configuration. **(B)** The attention layer added to the base model integrates both ReLU and Softmax mechanisms. It is positioned at the backend of the base architecture to study the effect of attention mechanisms on model performance and generalization.

BN: Batch normalization; Conv: Convolutional layer; DNN: Deep neural network; ReLU: Rectified Linear Unit; ResBlk: Residual Block.

1.2. Attention layer

Bahdanau et al. [9] proposed an attention mechanism to address the bottleneck problem of fixed-length encoding vectors in sequence-to-sequence models. This problem is especially critical for longer or more complex sequences where the decoder’s access to input information is limited. The attention mechanism enables the decoder to selectively focus on relevant input parts, improving its access to information at each time step. The attention layer proposed in our model can be broken down into three main computational steps:

Alignment scores are computed by the attention mechanism’s alignment model, which takes encoded hidden states and the previous decoder output. This model is represented by Equation 1, typically a feedforward neural network.

e_{t, i} = s_{t - 1} \cdot h_{i}

(1)

In the equation, e_t,i represents the alignment score between the decoder hidden state at time step t-1 (denoted by s_t-1) and the encoder hidden state at time step i (denoted by h_i). The alignment score is computed as the dot product between these hidden states and measures their relevance or similarity. This alignment score is used in the attention mechanism to determine the decoder’s focus on specific encoder hidden states during sequence generation tasks. The dot product captures the degree of alignment between the two hidden states and is commonly used in attention mechanisms to quantify their compatibility.

To calculate the attention weights (denoted by a_t,i) as Equation 2, the attention mechanism applies a softmax function to the alignment scores that were computed earlier.

a_{t, i} = softMax (e_{t, i})

(2)

The attention mechanism generates a distinct context vector (denoted by c_t) for the decoder at each time step as Equation 3, which is determined by a weighted sum of all encoder hidden states.

c_{t} = \sum_{i = 1}^{T} a_{t, i} h_{i}

(3)

This simplified attention mechanism differs from the advanced self-attention used in the paper ‘Attention is All You Need’ [10]. Notable distinctions include a single attention head, the absence of positional encoding, and a more straightforward architecture. This simplified approach may be better suited for specific tasks within a DNN model.

2. Materials & methods

Using the state-of-the-art model presented by Ribeiro et al. [6] as the base model, this study aims to evaluate the cross-dataset generalization ability of the model. Additionally, the study investigates how training data characteristics impact the model’s generalization performance. Furthermore, the study proposes an attention mechanism as an effective solution to enhance the generalization ability of machine learning models.

One of the primary focuses of this work is to investigate how dataset characteristics impact the performance of the trained model when applied to generalize beyond the training dataset. To achieve this objective, several other publicly available datasets are utilized in addition to the training dataset. These additional datasets are used to test the out-of-sample generalizability of the trained models.

The training dataset is segmented using various strategies to create distinct training subsets, each employed to train the proposed model architecture. Subsequently, these trained models are tested on the utilized datasets to analyze and comprehend their behavior. This process offers an overview of the model’s performance and its capacity to generalize across different datasets. Additionally, it provides insights into how the characteristics of the training dataset impact the model’s generalizability and performance.

This study focuses on utilizing publicly accessible datasets of ECG data to train and test the proposed models. However, the absence of clinical user information in the data, due to privacy and ethical considerations, is a limitation. Access to such information would significantly improve the model’s explainability and enhance the overall interpretability of the study, providing promising direction for future research.

At the end of our study, we will incorporate an attention mechanism into the model architecture and train the proposed model accordingly. The trained model with the attention mechanism will then be tested on various datasets. We will compare its performance with that of the model without the attention mechanism to gauge the impact of the attention mechanism on generalization ability.

2.1. Datasets

To evaluate the proposed method, this study primarily used the extensive ECG dataset collected by TNMG used by Ribeiro et al. [6]. Additionally, the China Physiological Signal Challenge 2018 (CPSC) dataset [11] and the ECG database from Shaoxing and Ningbo Hospitals (SNH) were employed as validation datasets.

2.1.1. TNMG dataset

The TNMG dataset, cited in the reference by Ribeiro et al. [6], consists of 2,322,513 labeled S12L-ECG recordings with short duration, obtained from 1,676,384 unique patients. Between 2010 and 2016, the TNMG, Brazil, gathered data using either a Tecnologia Eletronica Brasileira TEB ECGPC model or a Micromed Biotecnologia ErgoPC 13 model tele-electrocardiograph. Cardiologist annotations and the University of Glasgow ECG analysis program were used to label the recordings, resulting in a dataset containing a normal heart rhythm and six typical clinical abnormality categories. The recordings were sampled at 400 Hz, and zero-padding was used where necessary to ensure each of the 12 leads produced records of equal length, containing 4096 samples, which corresponds to approximately 10 s [6]. The six common clinical abnormality classes in the dataset are summarized in Table 1.

Table 1.

Classifications of abnormality in Telehealth Network Minas Gerais dataset and incidents in the general population.

Abnormality	Acronyms	Description	Prevalence in 100,000	Ref.
First-degree atrioventricular block	1dAVb	A PR interval exceeding 200 ms on a surface electrocardiogram indicates first-degree atrioventricular block	4000 among elders	[12]
Atrial fibrillation	AF	Rapid and irregular heartbeat characterize atrial fibrillation, which is the most common type of abnormality	2000 of the European and North American population	[13]
Left bundle branch block	LBBB	Results in a sequence of activation in the right ventricle before the left ventricle which leads to modifications in the left ventricle’s perfusion, mechanics, and workload	32 among men and 37 among women per year	[14]
Right bundle branch block	RBBB	A condition that affects the electrical activity in the ventricles of the heart, causing a delay in the depolarization of the right ventricle due to disrupted transmission of signals in the His–Purkinje system	200–1300 in general public	[15]
Sinus tachycardia	ST	ST is a tachyarrhythmia characterized by a raised resting heart rate and an exaggerated heart rate response to slight exertion or alterations in body posture	1160 randomly selected individuals	[16]
Sinus bradycardia	SB	A condition where the heart’s sinus node fires electrical impulses slower than normal, causing the heart rate to be slower than the usual resting rate	160 in the elderly population	[17]

Open in a new tab

This study utilized 1,048,575 distinct ECG 12-lead tracings from the TNMG dataset as part of its experimentation. The distribution of the dataset is depicted in Figure 2A.

Figure 2. — Dataset description using charts. **(A)** The TNMG dataset presents organized data for different age groups and genders. In this visualization, a darker bar represents the prevalence of abnormality. A central doughnut chart provides a comprehensive overview of various abnormalities on the right side. The inner layer of the chart offers a detailed breakdown by gender, while the outer layer provides a nuanced breakdown by age group. **(B)** The CPSC dataset is visualized similarly to the TNMG dataset, focusing on specific age groups and genders. **(C)** The Shaoxing and Ningbo hospitals dataset uses the same visual style as the TNMG dataset, incorporating organized data for age groups and genders.

1dAVb: First-degree Atrioventricular block; AF: Atrial fibrillation; CPSC: China Physiological Signal Challenge 2018; F: Female; LBBB: Left bundle branch block; M: Male; RBBB: Right bundle branch block; SB: Sinus bradycardia; ST: Sinus tachycardia; TNMG: Telehealth Network of Minas Gerais.

The TNMG dataset, a comprehensive collection of ECG recordings, offers a valuable resource for investigating the prevalence of ECG abnormalities in the general population. Figure 2A presents a striking proportion of normal ECG records in the TNMG dataset, with over 80% of the total records displaying no labeled abnormality. This finding is consistent with the theoretical expectation of a healthy population, as demonstrated in Table 1. Moreover, the distribution of ECG abnormalities across different age groups in the TNMG data aligns with the known increased prevalence of abnormalities in the elderly population. Notably, the number of patients with abnormalities declines beyond 75, potentially due to lower survival rates among the elderly population with abnormalities. Overall, the present findings highlight the importance of utilizing large and diverse datasets such as TNMG to gain insights into the expected distribution of ECG abnormalities in the general population while also identifying age as a crucial factor in the prevalence of certain cardiac pathology.

The analysis of abnormality classifications within the dataset demonstrates a relatively uniform distribution, with each type of abnormality occurring at comparable frequencies. Furthermore, this consistent distribution pattern is maintained across both genders, with no apparent gender-based variations observed. The age-based breakdown of abnormality classifications is also consistent with the general trend seen in the overall breakdown of age groups within the dataset. Notably, most cases are concentrated within the middle age groups, underscoring the utility of this dataset for developing predictive models for abnormalities. These findings offer valuable insights into the prevalence and distribution of abnormalities in the general population and emphasize the importance of utilizing large and diverse datasets for developing robust predictive models for abnormality detection and classification.

2.1.2. CPSC dataset

The CPSC training dataset is publicly accessible and includes an open-source collection of 6877 12-lead ECG records. The recordings ranged from 6 to 60 s and were sampled at a rate of 500 Hz [11]. The ECGs are divided into eight different types of abnormality, with AF, 1dAVb, left bundle branch block, and right bundle branch block overlapping with the TNMG dataset. Additionally, ST-segment depression, premature atrial contraction, ST-segment elevation, and premature ventricular contraction are also represented.

In order to align the CPSC dataset with the TNMG dataset introduced earlier, patients diagnosed with ST-segment depression, premature atrial contraction, ST-segment elevation, and premature ventricular contraction in their first labeled condition are excluded from our analysis. The ECG tracings are also resampled to 400 Hz to ensure comparability with the TNMG dataset. The breakdown of the processed CPSC dataset is shown in Figure 2B.

The dataset contains a high proportion of patients with abnormality, with approximately 75% of the total 4622 unique ECG tracings exhibiting one or more types of abnormality. This high proportion is not a result of the reduction in dataset size but rather an inherent characteristic of the original dataset. The CPSC dataset follows general trends in the population with a higher frequency of abnormality among elderly patients. However, the distribution of patients across the four classifications of abnormality is uneven. The high proportion of patients with the right bundle branch block does not reflect the population and makes the dataset uneven for training algorithms.

2.1.3. SNH dataset

The SNH dataset, open to the public, contains 45,152 12-lead ECG recordings gathered between 2013 and 2018 [18,19]. These recordings are 10 s long, and a licensed physician has assigned labels to each one indicating its association with one or more of the 63 abnormality types. In order to conform with the TNMG dataset, only ECG tracings that are categorized as normal or fall under the six classes existing in TNMG have been retained. Consequently, the total number of recordings has decreased to 34,033. Furthermore, all tracings have been resampled at a rate of 400 from 500 Hz. The processed SNH dataset is presented in Figure 2C.

The SNH dataset highlights a significant prevalence of abnormalities among patients, with approximately 70% of the population presenting one or more types of abnormality. Upon closer examination of the patient classifications, the dataset reveals a significant skew toward SB, with more than half of the abnormality cases categorized as such. Such an imbalance toward SB in the dataset may pose a challenge when training an abnormality classification model.

2.2. Data subsets

To investigate how training data characteristics affect the model’s performance and generalization ability, this study selected subsets of ECG tracings from the TNMG dataset using different selection criteria. The same model was trained on each subset, and a cross-dataset performance analysis was conducted to examine the model’s performance and generalization capacity. The model is also compared with the model trained in [6].

By using different selection criteria for the training data subsets, this study aims to provide insights into which characteristics of the training data have the most significant impact on the model’s generalization ability. Furthermore, by analyzing the model’s cross-dataset performance, this study can assess the degree to which the model can apply what it learned from the training data to new datasets. The findings from this study can inform future work on developing models with robust generalization capabilities in ECG analysis.

2.3. Attention mechanism

Attention mechanisms have been shown to improve the performance of models [9]; however, this study goes beyond and examines the effect of attention mechanisms on generalization. This study proposes an additional attention layer to be added to the base model, which will also be trained on the four subsets of ECG tracings. The performance of this model with the attention mechanism will be compared with the performance of the base model without the attention mechanism, and their cross-dataset generalization abilities will be evaluated.

The proposed base model with the attention mechanism is illustrated in Figure 1B. By including an attention mechanism, the model can selectively focus on the most informative regions of the ECG tracings, which could potentially improve the model’s ability to generalize to new datasets. The findings from this study could provide insights into the benefits of incorporating attention mechanisms in ECG analysis models and how they impact the models’ generalization capabilities.

2.4. Evaluation metrics

The models in this study are evaluated using the following performance metrics. In binary classification, TP (true positive), TN (true negative), FP (false positive) and FN (false negative) are defined. Specificity, recall, and precision are calculated using Equations 4 , 5 & 6, respectively. The F₁ score, depicted in Equation 7, represents the harmonic mean of precision and recall, offering a balanced evaluation that accounts for both metrics.

Specificity = \frac{T P}{T N + F P}

(4)

Recall = \frac{T P}{T P + F N}

(5)

Precision = \frac{T P}{T P + F P}

(6)

F_{1} = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(7)

Due to the varying degrees of class imbalance observed in all three datasets, where one class significantly outweighs the other, the F₁ score emerges as a more appropriate metric. By considering both true positives and false negatives, the F₁ score offers better evaluation and sensitivity for imbalanced datasets.

A higher score typically indicates better performance of the model. While the F₁ score indicates the overall balanced performance of the model.

These performance metrics align with those utilized in Ribeiro et al.’s [6] work, simplifying the evaluation of results. Furthermore, a uniform threshold value of 0.5 was applied to all performance metrics to maintain consistency in the comparison process.

3. Experiment

3.1. Strategy of subsetting

This study investigates the influence of distinct selection criteria for training data on a proposed model architecture’s performance and generalization capacity. The model will be trained using four subsets of 12-lead ECG data from the TNMG dataset, each consisting of 21,000 traces, with varying selection criteria as presented in Table 2. The study’s main objective is to analyze how specific attributes or characteristics of the training data impact the performance and generalization ability of the model when applied to various datasets.

Table 2.

Subset selection criteria.

Subset	Selection criteria
I	Three thousand recordings from each of the six ECG classification types and 3000 normal readings
II	The same selection criteria as approach I, but favor patients with multiple types of abnormality
III	Select the 1500 oldest and youngest records for each ECG classification
IV	Randomly selected

Open in a new tab

As a result of some patients having more than one type of abnormality, there may be repeated selection of patients from the TNMG dataset. The remaining data required to reach 21,000 records is randomly drawn from the dataset. These repeated selections may introduce discrepancies in the overall statistics. The statistics of the subsets are summarized in Figure 3.

Figure 3. — The subset statistics are presented with charts that show different age groups, with separate bars for females on the left and males on the right to enhance visual clarity. Although all subsets contain the same number of patients, it is important to note that some patients may exhibit more than one abnormality, resulting in total numbers exceeding 21,000.

3.2. Training

The architecture of the base model shown in Figure 1A is trained on each of the subsets, and its performance is evaluated on a test dataset presented in [6]. The performance of these models is compared with that of a model trained in [6], and their cross-dataset generalization ability is assessed by testing the models on the CPSC and SNH datasets.

Similarly, we train the proposed model with the attention mechanism illustrated in Figure 1B on the subsets and select the best-performing model for evaluation of its cross-dataset generalization ability. The performance is compared with that of the base model trained in [6], and its generalization capacity is evaluated.

The hyperparameters previously used in [6] were deployed to train our model. These include 16 for kernel size, 64 for batch size, an initial learning rate of 0.001, the Adam optimizer, 0.8 for dropout rate, and epochs of 200 with a patience value of 9 epochs for early stopping.

By utilizing these hyperparameters, we aim to ensure consistency in our experiments and facilitate direct comparison with prior research. Such an approach enables a more rigorous evaluation of our proposed model architecture’s performance in ECG signal classification tasks.

3.3. Model testing

The models were evaluated on the test dataset described in [6], which consists of 827 unique 12-lead tracings of ECG records. Further details regarding the breakdown of the test dataset can be found in Figure 4.

Figure 4. — Telehealth Network Minas Gerais test-set statistics by age groups with separate bars for females and males. The female bars are positioned on the left, and the male bars on the right for visual clarity.

To assess the generalization capacity of the models across different datasets, tests of the models were conducted on the CPSC and SNH datasets. The performance of the models was evaluated using the proposed metrics, and the results were analyzed to determine their generalization performance.

4. Results

The experiment results are divided into two subsections for presentation. In the first subsection, we focus on the findings related to the base model architecture, exploring how its performance varies based on distinctive characteristics of the subsets used in training. In the second subsection, we shift our attention to the proposed model with attention mechanisms evaluating its performance across the same subsets. By segregating the analysis, we can understand both the base model’s behavior and the potential enhancements brought about by incorporating attention mechanisms.

4.1. Effect of dataset characteristics

The models trained on the subsets are tested on the test set of [6] and are compared with the result of the trained model in [6]. The performance metrics of the models are presented in Table 3.

Table 3.

Analyzing deep neural network performance: evaluating inference results on Telehealth Network Minas Gerais test-set, China Physiological Signal Challenge 2018 and Shaoxing and Ningbo Hospital's datasets for various subsets (I, II, III & IV) versus the full dataset training.

Subset	I	II	III	IV	Full [6]
Inference on TMNG test-set
Precision	0.81	0.77	0.80	0.79	0.92
Recall	0.87	0.93	0.89	0.82	0.94
Specificity	0.99	0.99	0.99	0.99	1.00
F₁ Score	0.84	0.83	0.83	0.80	0.93
Inference on CPSC dataset
Precision	0.93	0.88	0.90	0.80	0.91
Recall	0.84	0.84	0.86	0.76	0.84
Specificity	0.98	0.97	0.98	0.95	0.98
F₁ Score	0.88	0.86	0.88	0.77	0.87
Inference on SNH dataset
Precision	0.48	0.51	0.48	0.15	0.48
Recall	0.50	0.55	0.53	1.00	0.53
Specificity	0.85	0.82	0.82	0.00	0.82
F₁ Score	0.47	0.50	0.49	0.22	0.48

Open in a new tab

CPSC: China Physiological Signal Challenge 2018; SNH: Shaoxing and Ningbo Hospitals; TNMG: Telehealth Network Minas Gerais.

The trained models underwent evaluation on the CPSC dataset through model inference, and their performance was compared with that of the model presented in [6]. The results are summarized in Table 3.

The models were inferred on the SNH dataset and their effectiveness were compared with that of the model introduced in [6]. The comparison results are illustrated in Table 3.

Since the F₁ score, which integrates both recall and precision, is the primary metric of focus in this study, Table 4 provides a comprehensive overview of the F₁ scores for each cardiac abnormality. Furthermore, it offers a detailed breakdown of the performance of Attention Deep Neutral Network for each specific abnormality class.

Table 4.

Examination of F₁ inference outcomes across diverse datasets for particular abnormality types, employing models trained on different subsets.

Subset	I	II	III	IV	Full	I	II	III	IV	Full
Abnormalities	DNN inference on TNMG				A-DNN inference on TNMG
1dAVb	0.70	0.72	0.65	0.62	0.90	0.71	0.71	0.73	0.76	–
RBBB	0.91	0.89	0.91	0.90	0.94	0.91	0.90	0.93	0.91	–
LBBB	0.92	0.97	0.93	0.84	1.00	0.97	0.95	0.94	0.93	–
SB	0.75	0.82	0.80	0.82	0.88	0.90	0.85	0.88	0.86	–
AF	0.86	0.65	0.78	0.67	0.87	0.72	0.67	0.83	0.67	–
ST	0.90	0.94	0.93	0.94	0.96	0.91	0.90	0.94	0.94	–
Average	0.84	0.83	0.83	0.8	0.93	0.85	0.83	0.88	0.84	–
Abnormalities	DNN inference on CPSC				DNN inference on SNH
1dAVb	0.87	0.84	0.85	0.57	0.85	0.08	0.06	0.07	0.03	0.10
RBBB	0.84	0.81	0.83	0.78	0.79	0.23	0.43	0.35	0.06	0.46
LBBB	0.89	0.90	0.89	0.87	0.89	0.46	0.55	0.52	0.03	0.51
SB	–	–	–	–	–	0.74	0.72	0.71	0.70	0.71
AF	0.93	0.90	0.94	0.86	0.94	0.50	0.53	0.50	0.19	0.44
ST	–	–	–	–	–	0.77	0.73	0.78	0.32	0.67
Average	0.88	0.86	0.88	0.77	0.87	0.47	0.50	0.49	0.22	0.48

Open in a new tab

The boldface indicates the average results of the models performed on different datasets.

1dAVb: First-degree atrioventricular block; A-DNN: Attention deep neural network; AF: Atrial fibrillation; CPSC: China Physiological Signal Challenge 2018; DNN: Deep neural network; LBBB: Left bundle branch block; RBBB: Right bundle branch block; SB: Sinus bradycardia; SNH: Shaoxing and Ningbo Hospital; ST: Sinus tachycardia; TNMG: Telehealth Network Minas Gerais.

4.2. Boosted model with attention mechanism

To comprehend the impact of the attention mechanism on the generalizability of the model, we trained the proposed model architecture depicted in Figure 1B on the four subsets. Its performance on the test set of reference [6] is presented in Table 5.

Table 5.

The inference results of the attention deep neural network trained on different subsets (I, II, III & IV) evaluated on the Telehealth Network Minas Gerais test-set.

Subset	I	II	III	IV
Precision	0.86	0.82	0.84	0.88
Recall	0.85	0.84	0.92	0.83
Specificity	1.00	0.99	0.99	1.00
F₁ score	0.85	0.83	0.88	0.84

Open in a new tab

The top-performing model, which incorporates the attention mechanism and achieves the highest F₁ score (Subset III), was evaluated on both the CPSC and SNH datasets through model inference. Its performance was compared with that of the TNMG model introduced in [6], and the results are presented in Table 6.

Table 6.

Comparative inference evaluation of attention deep neural network performance trained on Telehealth Network Minas Gerais subset III to the original deep neural network trained on the entire Telehealth Network Minas Gerais dataset.

	A-DNN III		DNN [6]
	CPSC	SNH	CPSC	SNH
Precision	0.90	0.50	0.91	0.48
Recall	0.86	0.57	0.84	0.53
Specificity	0.97	0.81	0.98	0.82
F₁ score	0.88	0.52	0.87	0.48

Open in a new tab

A-DNN: Attention deep neural network trained on subset III; CPSC: China Physiological Signal Challenge 2018; DNN: Deep neural network; SNH: Shaoxing and Ningbo Hospital.

5. Discussion

Based on the experimental results, it is evident that the training data’s characteristics significantly impact the model’s performance, even when using the same model structure and amount of training data. The models trained on subsets I, II, and III exhibited better performance compared with the model trained on subset IV, achieving F₁ scores of 0.84, 0.83, 0.83, and 0.80, respectively. The improved performance of subsets I, II, and III can be attributed to the balanced nature of their training datasets, which had similar abnormality classifications. While a larger dataset can lead to better in-sample performance, it is crucial to acknowledge that this performance boost may be attributed to the presence of data from each class of abnormality. However, considering that the subset comprises only around 1% of the original dataset, the models trained on these subsets still demonstrate commendable performance.

The story changes when it comes to model generalization. As discussed earlier, for inference performance on the CPSC dataset, models trained on subsets I, II and III still outperformed the model trained on subset IV, demonstrating the significance of a balanced dataset, achieving F₁ scores of 0.88, 0.86, 0.88, and 0.77, respectively. However, what is even more surprising is that Ribeiro et al.’s [6] model, which was trained on the entire TNMG dataset and achieved an F₁ score of 0.87, did not perform better than the models trained on smaller subsets I and III. It is only marginally better performing than the model trained on subset II. This discovery is particularly interesting, considering that the subsets use only approximately 1% of the entire TNMG dataset. This finding suggests that a well-balanced and properly curated smaller dataset can still lead to competitive model performance compared with larger and more diverse datasets.

For inference performance on the SNH dataset, the model’s performance experienced a significant decrease, which aligns with the long-standing issue in biomedical machine learning models. These models often struggle to generalize well when faced with changes in data, equipment, environment, and patient characteristics. To address this problem, some researchers have proposed patient-specific models, and there is a growing body of research in this direction.

Ribeiro et al.’s [6] model achieved a considerably lower F₁ score of 0.48 on the SNH dataset, indicating that even with a large dataset, it still faces challenges with model generalization in the biomedical sector. Conversely, the models trained on subsets I, II, and III outperformed those trained on subset IV, with F₁ scores of 0.47, 0.50, 0.49, and 0.22, respectively. This reaffirms the importance of having a balanced dataset for training. Remarkably, the same interesting discovery was made when compared with Ribeiro et al.’s [6] model on the SNH dataset. The model once again failed to outperform models trained on only 1% of its data. It performed lower than models trained on subsets II and III and slightly better than those trained on subset I. This discovery further confirms the previous findings observed in the CPSC dataset inference results. It underscores the importance of dataset curation and balance even when dealing with large datasets in the biomedical domain.

When considering the attention mechanism employed to enhance the model’s performance, an observation from Table 4 indicates that a significant portion of abnormalities has experienced improvements in their F₁ scores. This observation highlights the effectiveness of a straightforward attention mechanism in enhancing the in-sample performance of the model.

The final part of the experiment focuses on the impact of the attention mechanism on performance and generalization. The results clearly indicate that the attention mechanism enhances the performance of almost all subsets in terms of the F₁ score, with a particularly significant improvement observed in subset IV, which initially had the lowest F₁ score when testing the models on the TNMG test set. The performance of the models has improved to F₁ scores of 0.85, 0.83, 0.88, and 0.84, respectively, as shown in Table 5. This trend also translates into generalization, as evidenced by the comparison of Ribeiro et al.’s [6] model and the model with the attention mechanism trained on subset III. As the best-performing attention model trained on all subsets, the attention model trained on subset III has outperformed Ribeiro et al.’s [6] model in inference performance on both CPSC and SNH datasets is illustrated in Table 6. Considering the fact that the new model was trained with only 1% of the TNMG dataset, we consider this performance as a solid confirmation of our previous finding on the significance of balanced data. Our findings also confirm the positive impact of the attention mechanism on improving the model’s accuracy and inference performances.

Again, looking at the performance of the models trained on subsets, the performance between models trained on I, II and III are not clearly higher or lower, indicating that the selection criteria used for subsets I, II, and III with age preference and multi-abnormality patients do not have a clear impact on the model performance. Therefore, balanced data is the most important factor. This study also clears many researchers’ doubts, encouraging using datasets with characteristics matching the general population. The TNMG dataset, claimed to be the largest ECG dataset, matches the proportion of various abnormalities in the general population. Nonetheless, models trained using this dataset struggle to exhibit the same level of generalization as models trained on a smaller dataset that is balanced. Therefore, researchers should use balanced ECG datasets for training to achieve the best-performing models.

We argued the utilization of techniques such as bootstrapping to elaborate the statistical robustness of this study. Nevertheless, we have opted against bootstrapping inclusion for several compelling reasons. First, given that we subjected the model to the entirety of the extensive dataset to evaluate its capacity for generalization, the application of bootstrapping may not be ideally suited to our context. Furthermore, the datasets employed for the purpose of generalization exhibit conspicuous class imbalances, raising concerns that bootstrapping could yield potentially misleading outcomes.

6. Conclusion

The study highlights the importance of dataset characteristics in influencing model performance and generalization. The researchers observed that models trained on balanced subsets consistently outperformed those trained on unbalanced data, despite the latter’s larger size, emphasizing the critical role of dataset balance.

Additionally, the introduction of the attention mechanism proved beneficial, notably improving model performance across various subsets. This enhancement was particularly significant in subset IV, which initially had lower performance. The positive impact of the attention mechanism extended to external datasets (CPSC and SNH), indicating its efficacy in enhancing model accuracy and generalization.

These findings have implications for researchers and practitioners in the biomedical field, emphasizing the value of using balanced datasets and incorporating attention mechanisms to optimize model performance. By addressing these factors, the research aims to contribute to the ongoing advancements in biomedical machine learning.

Acknowledgments

Z Huang would like to acknowledge the support of the Research Training Program provided by the Australian Government.

Author contributions

Z Huang: writing (lead), conceptualization (equal), data curation (equal), formal analysis (equal), investigation (equal), methodology (equal), software (equal), validation (equal), visualization (lead); S MacLachlan: conceptualization (equal), data curation (equal), formal analysis (equal), investigation (equal), methodology (equal), software (equal), validation (equal); L Yu: writing (support); LFH Contreras: visualization (support); ND Truong: supervision (support); AH Ribeiro: supervision (support); O Kavehei: supervision (lead).

Financial disclosure

The authors have no financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Competing interests disclosure

The authors have no competing interests or relevant affiliations with any organization or entity with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Writing disclosure

No writing assistance was utilized in the production of this manuscript.

Code availability

You can access the attention model code through this link: https://github.com/NeuroSyd/ECG-Attention. Keep in mind that certain terms, conditions, or usage restrictions may apply to the code.

Data availability statement

This paper employs publicly accessible datasets such as CPSC and SNH. The TNMG dataset is not publicly available, but access can be granted upon request via the data custodian’s approval.

References

Papers of special note have been highlighted as: • of interest; •• of considerable interest

1.Nattel S, Andrade J, Macle L, et al. New directions in cardiac arrhythmia management: present challenges and future solutions. Can J Cardiol. 2014;30(12):S420–S430. doi: 10.1016/j.cjca.2014.09.027 [DOI] [PubMed] [Google Scholar]
2.AlGhatrif M, Lindsay J. A brief review: history to understand fundamentals of electrocardiography. J Community Hosp Intern Med Perspect. 2012;2(1):14383. doi: 10.3402/jchimp.v2i1.14383 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Breen C, Kelly G, Kernohan W. ECG interpretation skill acquisition: a review of learning, teaching and assessment. J Electrocardiol. 2022;73:125–128. doi: 10.1016/j.jelectrocard.2019.03.010 [DOI] [PubMed] [Google Scholar]
4.Bond R, Zhu T, Finlay D, et al. Assessing computerized eye tracking technology for gaining insight into expert interpretation of the 12-lead electrocardiogram: an objective quantitative approach. J Electrocardiol. 2014;47(6):895–906. doi: 10.1016/j.jelectrocard.2014.07.011 [DOI] [PubMed] [Google Scholar]
5.Martis RJ, Acharya UR, Adeli H. Current methods in electrocardiogram characterization. Comput Biol Med. 2014;48:133–149. doi: 10.1016/j.compbiomed.2014.02.012 [DOI] [PubMed] [Google Scholar]
6.Ribeiro AH, Ribeiro MH, Paixão GM, et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun. 2020;11(1):1760. doi: 10.1038/s41467-020-15432-4 [DOI] [PMC free article] [PubMed] [Google Scholar]; •• A highly advanced model for detecting cardiac abnormalities, leveraging a vast dataset.
7.Ullah A, Rehman SU, Tu S, et al. A hybrid deep CNN model for abnormal arrhythmia detection based on cardiac ECG signal. Sensors. 2021;21(3):951. doi: 10.3390/s21030951 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Jamil S, Rahman M. A novel deep-learning-based framework for the classification of cardiac arrhythmia. J Imaging. 2022;8(3):70. doi: 10.3390/jimaging8030070 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473. 2014. [Google Scholar]; • Laid the foundation for modern neural machine translation systems.
10.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:30. [Google Scholar]; •• A groundbreaking model in the field of deep learning.
11.Liu F, Liu C, Zhao L, et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. J Med Imaging Health Infor. 2018;8(7):1368–1373. doi: 10.1166/jmihi.2018.2442 [DOI] [Google Scholar]; •• A well-organized dataset for cardiac abnormalities.
12.Nikolaidou T, Ghosh JM, Clark AL. Outcomes related to first-degree atrioventricular block and therapeutic implications in patients with heart failure. JACC Clin Electrophysiol. 2016;2(2):181–192. doi: 10.1016/j.jacep.2016.02.012 [DOI] [PubMed] [Google Scholar]
13.Wang Z, Chen Z, Wang X, et al. The disease burden of atrial fibrillation in China from a national cross-sectional survey. Am J Cardiol. 2018;122(5):793–798. doi: 10.1016/j.amjcard.2018.05.015 [DOI] [PubMed] [Google Scholar]
14.Pérez-Riera AR, Barbosa-Barros R, de Rezende Barbosa MP, et al. Left bundle branch block: epidemiology, etiology, anatomic features, electrovectorcardiography, and classification proposal. Ann Noninvasive Electrocardiol. 2019;24(2):e12572. doi: 10.1111/anec.12572 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Xiong Y, Wang L, Liu W, et al. The prognostic significance of right bundle branch block: a meta-analysis of prospective cohort studies. Clin Cardiol. 2015;38(10):604–613. doi: 10.1002/clc.22454 [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Still A-M, Raatikainen P, Ylitalo A, et al. Prevalence, characteristics and natural course of inappropriate sinus tachycardia. EP Europace. 2005;7(2):104–112. doi: 10.1016/j.eupc.2004.12.007 [DOI] [PubMed] [Google Scholar]
17.Chiu S-N, Lin L-Y, Wang J-K, et al. Long-term outcomes of pediatric sinus bradycardia. J Pediatr. 2013;163(3):885–889.e1. doi: 10.1016/j.jpeds.2013.03.054 [DOI] [PubMed] [Google Scholar]
18.Zheng J, Zhang J, Danioko S, et al. A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Sci Data. 2020;7(1):48. doi: 10.1038/s41597-020-0386-x [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zheng J, Chu H, Struppa D, et al. Optimal multi-stage arrhythmia classification approach. Sci Rep. 2020;10(1):2898. doi: 10.1038/s41598-020-59821-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This paper employs publicly accessible datasets such as CPSC and SNH. The TNMG dataset is not publicly available, but access can be granted upon request via the data custodian’s approval.

[CIT00001] 1.Nattel S, Andrade J, Macle L, et al. New directions in cardiac arrhythmia management: present challenges and future solutions. Can J Cardiol. 2014;30(12):S420–S430. doi: 10.1016/j.cjca.2014.09.027 [DOI] [PubMed] [Google Scholar]

[CIT00002] 2.AlGhatrif M, Lindsay J. A brief review: history to understand fundamentals of electrocardiography. J Community Hosp Intern Med Perspect. 2012;2(1):14383. doi: 10.3402/jchimp.v2i1.14383 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT00003] 3.Breen C, Kelly G, Kernohan W. ECG interpretation skill acquisition: a review of learning, teaching and assessment. J Electrocardiol. 2022;73:125–128. doi: 10.1016/j.jelectrocard.2019.03.010 [DOI] [PubMed] [Google Scholar]

[CIT00004] 4.Bond R, Zhu T, Finlay D, et al. Assessing computerized eye tracking technology for gaining insight into expert interpretation of the 12-lead electrocardiogram: an objective quantitative approach. J Electrocardiol. 2014;47(6):895–906. doi: 10.1016/j.jelectrocard.2014.07.011 [DOI] [PubMed] [Google Scholar]

[CIT00005] 5.Martis RJ, Acharya UR, Adeli H. Current methods in electrocardiogram characterization. Comput Biol Med. 2014;48:133–149. doi: 10.1016/j.compbiomed.2014.02.012 [DOI] [PubMed] [Google Scholar]

[CIT00006] 6.Ribeiro AH, Ribeiro MH, Paixão GM, et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat Commun. 2020;11(1):1760. doi: 10.1038/s41467-020-15432-4 [DOI] [PMC free article] [PubMed] [Google Scholar]; •• A highly advanced model for detecting cardiac abnormalities, leveraging a vast dataset.

[CIT00007] 7.Ullah A, Rehman SU, Tu S, et al. A hybrid deep CNN model for abnormal arrhythmia detection based on cardiac ECG signal. Sensors. 2021;21(3):951. doi: 10.3390/s21030951 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT00008] 8.Jamil S, Rahman M. A novel deep-learning-based framework for the classification of cardiac arrhythmia. J Imaging. 2022;8(3):70. doi: 10.3390/jimaging8030070 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT00009] 9.Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:14090473. 2014. [Google Scholar]; • Laid the foundation for modern neural machine translation systems.

[CIT00010] 10.Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:30. [Google Scholar]; •• A groundbreaking model in the field of deep learning.

[CIT00011] 11.Liu F, Liu C, Zhao L, et al. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. J Med Imaging Health Infor. 2018;8(7):1368–1373. doi: 10.1166/jmihi.2018.2442 [DOI] [Google Scholar]; •• A well-organized dataset for cardiac abnormalities.

[CIT00012] 12.Nikolaidou T, Ghosh JM, Clark AL. Outcomes related to first-degree atrioventricular block and therapeutic implications in patients with heart failure. JACC Clin Electrophysiol. 2016;2(2):181–192. doi: 10.1016/j.jacep.2016.02.012 [DOI] [PubMed] [Google Scholar]

[CIT00013] 13.Wang Z, Chen Z, Wang X, et al. The disease burden of atrial fibrillation in China from a national cross-sectional survey. Am J Cardiol. 2018;122(5):793–798. doi: 10.1016/j.amjcard.2018.05.015 [DOI] [PubMed] [Google Scholar]

[CIT00014] 14.Pérez-Riera AR, Barbosa-Barros R, de Rezende Barbosa MP, et al. Left bundle branch block: epidemiology, etiology, anatomic features, electrovectorcardiography, and classification proposal. Ann Noninvasive Electrocardiol. 2019;24(2):e12572. doi: 10.1111/anec.12572 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT00015] 15.Xiong Y, Wang L, Liu W, et al. The prognostic significance of right bundle branch block: a meta-analysis of prospective cohort studies. Clin Cardiol. 2015;38(10):604–613. doi: 10.1002/clc.22454 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT00016] 16.Still A-M, Raatikainen P, Ylitalo A, et al. Prevalence, characteristics and natural course of inappropriate sinus tachycardia. EP Europace. 2005;7(2):104–112. doi: 10.1016/j.eupc.2004.12.007 [DOI] [PubMed] [Google Scholar]

[CIT00017] 17.Chiu S-N, Lin L-Y, Wang J-K, et al. Long-term outcomes of pediatric sinus bradycardia. J Pediatr. 2013;163(3):885–889.e1. doi: 10.1016/j.jpeds.2013.03.054 [DOI] [PubMed] [Google Scholar]

[CIT00018] 18.Zheng J, Zhang J, Danioko S, et al. A 12-lead electrocardiogram database for arrhythmia research covering more than 10,000 patients. Sci Data. 2020;7(1):48. doi: 10.1038/s41597-020-0386-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[CIT00019] 19.Zheng J, Chu H, Struppa D, et al. Optimal multi-stage arrhythmia classification approach. Sci Rep. 2020;10(1):2898. doi: 10.1038/s41598-020-59821-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Generalization challenges in electrocardiogram deep learning: insights from dataset characteristics and attention mechanism

Zhaojing Huang

Sarisha MacLachlan

Leping Yu

Luis Fernando Herbozo Contreras

Nhan Duy Truong

Antonio Horta Ribeiro

Omid Kavehei

Abstract

Plain language summary

Plain language summary

Article highlights.

Objective

Findings

Methodology

Implications

Limitations & future research

Conclusion

1. Background

1.1. DNN architecture

Figure 1.

1.2. Attention layer

2. Materials & methods

2.1. Datasets

2.1.1. TNMG dataset

Table 1.

Figure 2.

2.1.2. CPSC dataset

2.1.3. SNH dataset

2.2. Data subsets

2.3. Attention mechanism

2.4. Evaluation metrics

3. Experiment

3.1. Strategy of subsetting

Table 2.

Figure 3.

3.2. Training

3.3. Model testing

Figure 4.

4. Results

4.1. Effect of dataset characteristics

Table 3.

Table 4.

4.2. Boosted model with attention mechanism

Table 5.

Table 6.

5. Discussion

6. Conclusion

Acknowledgments

Author contributions

Financial disclosure

Competing interests disclosure

Writing disclosure

Code availability

Data availability statement

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases