Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jul 29;15:27557. doi: 10.1038/s41598-025-12303-0

A dual branch feature extraction network for heart sound signal analysis

Hao Chen 1,2,, Wenye Gu 3
PMCID: PMC12307704  PMID: 40731147

Abstract

The analysis of heart sound signals is critical in the early diagnosis of cardiovascular disease. However, the complexity and diversity of these signals pose significant challenges for accurate recognition. In this paper, we propose a novel heart sound dual-branch feature extraction network (HSDFE-Net) to address these challenges. The model first extracts both conventional audio features and bi-spectrum features from preprocessed heart sound signals, enabling comprehensive characterization of their linear spectral and nonlinear properties. Unlike conventional single-branch networks, HSDFE-Net employs a dual-branch architecture where one branch processes bi-spectrum features to capture nonlinear relationships, while the other processes conventional audio features. By fusing these complementary feature sets, the network achieves a more thorough understanding of signal characteristics. Furthermore, a squeeze-and-excitation module is integrated into the conventional audio branch to adaptively emphasize key feature channels, which enhances overall model performance. Experimental results on three public datasets demonstrate that HSDFE-Net achieves accuracies of 99.00%, 99.53%, and 83.33%, validating its effectiveness and robustness in heart sound analysis and providing a promising solution for heart sound recognition.

Keywords: Heart sound signals, HSDFE-Net, Bi-spectrum features, Audio features

Subject terms: Biological techniques, Computational biology and bioinformatics

Introduction

Cardiovascular diseases pose a serious threat to human health, and early detection is critical1. Heart sound signals are key indicators of cardiac function and are widely used in the diagnosis of cardiovascular diseases2. Accurate analysis of heart sound abnormalities, such as murmurs or extra heart sounds, enables early detection of a wide range of cardiovascular diseases. However, conventional cardiac auscultation methods are highly dependent on the experience of the physician. Moreover, the human ear has limited sensitivity to sound frequencies and cannot detect all subtle abnormalities.

Machine learning techniques have developed rapidly in the field of medical signal processing. Especially in electrocardiogram and electroencephalogram classification tasks, it has demonstrated efficient feature extraction and classification capabilities3,4. However, despite remarkable achievements in this field, automatic classification of heart sound signals continues to face challenges. The complexity and diversity of heart sound signals, such as variable waveform patterns, potential murmur interference, and significant individual differences, further increase the difficulty of automatic classification of heart sound signals5.

Currently, heart sound signal classification methods primarily focus on two aspects. The first is based on traditional machine learning methods, which rely on manually crafted features extracted from the signals, such as time-domain and frequency-domain features, for processing heart sound signals. However, they have difficulty in capturing complex patterns and nonlinear relationships in the signal, limiting the classification effectiveness6; The second is based on deep learning methods7. These methods can automatically extract features and have shown particular advantages in handling the nonlinear characteristics of heart sound signals. However, certain challenges remain when faced with data diversity and high-noise environments. Furthermore, most existing studies rely solely on linear spectral representations, overlooking higher-order characteristics that may carry crucial diagnostic information. Therefore, a more robust feature extraction framework that captures both linear and nonlinear signal properties is essential for improving classification accuracy in real-world conditions.

To address the limitations of existing methods and better capture the complexity of heart sound signals, this paper proposes a heart sound dual-branch feature extraction network (HSDFE-Net). The network integrates both conventional audio features and higher-order spectral features, enabling a more comprehensive representation of heart sound characteristics. The main contributions are as follows:

  1. Comprehensive feature extraction: conventional audio features and bi-spectrum features are extracted from heart sound signals. These features encompass the linear spectral properties of the signal and the nonlinear characteristics obtained through bi-spectrum analysis, aiding in a more accurate description of the complexity of the signal.

  2. A dual-branch feature extraction network is proposed. One branch processes the bi-spectrum features, capturing the nonlinear relationships within the signal, while the other branch processes the conventional audio features. By fusing these two complementary feature sets, the network can more comprehensively understand the signal properties. This method enables the network to effectively handle the diversity of signals, thus further improving its performance.

  3. To enhance the importance of key feature channels, a Squeeze-and-Excitation Network (SENet) is introduced in the branch processing conventional audio features. This module adaptively recalibrates channel-wise feature responses, further improving the overall performance of the model.

The rest of the paper is organized as follows: Sect. 2 presents the related works. The methodology proposed in this paper is described in Sect. 3. Section 4 presents the experiments and discussion. Section 5 gives conclusions.

Related works

Early research on heart sound signal classification primarily relied on traditional machine learning methods. These methods involve extracting time-domain features, frequency-domain features, and other features from the signals. The extracted features are then classified using machine learning methods such as Support Vector Machines (SVM) and Random Forests (RF)8,9. For example, See et al.10 extracted the Shannon entropy and spectral entropy of each frequency band as input to an SVM. Narváez et al.11 extracted six power features and then compared the classification performance of SVM, K-Nearest Neighbors (KNN), RF, and Multilayer Perceptron (MLP). Yadav et al.6 proposed a method for the automatic classification of heart sound signals. First, the extracted features were filtered using the Wilcoxon rank-sum test. These features were then fed into various classifiers. Finally, the SVM achieved an accuracy of 97.78%. In general, these methods have produced effective results in classifying heart sound signals. However, these methods are still inadequate in capturing the complexity and nonlinear relationships of heart sound signals. Specifically, the traditional feature selection process is unable to fully capture the features of heart sound signals, limiting the performance of the model. Table 1 summarizes the key methods discussed in Sect. 2, organized by category, with their representative method, core contribution, and references.

Table 1.

Summary of heart sound classification methods.

Category Representative method Core contribution References
Traditional machine learning SVM, RF, KNN, MLP Combined handcrafted features with classical classifiers 6,811
Deep learning CNN, one-dimensional CNN, bi-directional CNN Combined classical feature engineering with CNN architecture optimization 1522
Vision Transformer, multi-CNN architecture Based on higher-order spectral features, modeling nonlinear and higher-order interactions 2326
Multi-branch CNN, attention-based CNN Multimodal feature fusion with spatial and channel attention mechanisms to enhance key information 27,28,3133

In recent years, deep learning methods have been widely used in the task of classifying heart sound signals12. Typical deep learning models, such as Convolutional Neural Networks (CNN), possess automatic feature extraction capabilities13. Meanwhile, integrating Transformer encoders enables better capture of features in heart sound signals14. Li et al.15 extracted 497 features from the heart sound signals. These features were then fed into a CNN for processing, resulting in good classification performance. Krishnan et al.16 proposed a network architecture that combines a one-dimensional convolutional neural network with a feedforward neural network for the classification of unsegmented phonocardiogram signals. Oh et al.17 proposed the WaveNet model for the classification of heart valve disease types. The model consists of a residual block with gated activation. Xiao et al.18 introduced a bi-directionally connected structure in order to enhance the information flow of the CNN, which resulted in good classification performance. Bao et al.19 investigated the selection of time-frequency distributions and the design of CNN architectures, providing new insights for the automatic diagnosis of heart sounds. However, the above studies mainly focus on the extraction of time-domain or frequency-domain features, which is still insufficient for capturing the nonlinear features and complex interactions in heart sound signals.

In the field of signal processing, the choice of feature extraction strategy directly affects the classification performance. Deng et al.20 proposed a heart sound classification method that combines Mel-frequency Cepstral Coefficients (MFCC) features and convolutional recurrent neural networks to address the challenge of describing heart sound features using traditional methods. Kui et al.21 extracted Mel-frequency spectral coefficients from heart sound signals. These features were then used to train a CNN. Tariq et al.22 extracted three distinct features from audio samples: Spectrogram, MFCC, and Chromagram. They then built an ensemble of three convolutional neural network models, which effectively classified the signals. However, research has shown that features extracted using higher-order spectral analysis methods in the field of modern digital signal processing are significantly superior to those obtained through lower-order feature extraction methods23. For example, Liu et al.24 proposed an innovative method based on bi-spectrum feature extraction and Vision Transformer model for heart sound classification. Berraih et al.25 proposed a bi-spectrum technique based on higher-order statistics for analyzing the complexity and nonlinearity of heart sound signals. They extracted 15 higher-order statistical features from 9 different heart sound signals, and the method demonstrated high potential in distinguishing pathological phonocardiogram signals. Li et al.26 first extracted higher-order spectral features from heart sound signals and subsequently designed a multi-convolutional neural network model. Finally, by introducing a multilayer dimensionality reduction technique into this model, the performance of heart sound classification was improved.

Traditional frequency-domain features capture the spectral information of a signal, while higher-order features reveal nonlinear interactions in the signal. While the use of low-order or higher-order features alone can provide some signal description, their performance is limited. Combining these two types of features helps to describe the signal more comprehensively. Multi-branch network structures have already demonstrated its advantages in multimodal signal processing tasks. By separately processing inputs from different modalities or feature spaces, this structure can more effectively capture the diversity of signals27,28. However, complex feature extraction strategies may introduce a significant amount of redundant information, making it important to explore how to enhance the model’s focus on key features. Studies have shown that the introduction of an attention mechanism enhances the focus of the model on key regions29,30. Deng et al.31 proposed a partition attention module that allows the network to automatically focus on important regions of the dual spectrum. Chowdhury et al.32 proposed a spectral attention block, which emphasizes certain regions of the spectrogram through a deep attention network, thereby highlighting important features relevant to heart disease detection. Yin et al.33 addressed the issue of inefficient feature utilization in traditional methods. They incorporated a channel attention module within a multi-scale attention convolutional block to enhance key features. This design addresses the issue of how to effectively extract and use important information in heart sound signals.

In summary, with the shift from traditional machine learning to deep learning, heart sound signal processing has made significant progress in feature extraction and classification performance. The introduction of attentional mechanisms, especially the channel attentional mechanism, enhances the focus of the model on key regions. Additionally, the combination of conventional features and higher-order features effectively copes with the complexity and nonlinearity of heart sound signals. It provides strong support for the accurate classification of heart sound signals.

Methodology

HSDFE-Net structure

Figure 1 illustrates the structure of HSDFE-Net. The input audio signal undergoes a series of prepossessing steps including filtering, down-sampling and normalization to ensure clean and consistent data. The preprocessed audio signal is split into two parallel processing branches. The first branch focuses on the extraction of conventional audio features that capture the essential properties of the audio signal. Subsequently, these features are subjected to multiple convolution and pooling operations to extract a more efficient feature representation. To further improve the performance of the model, the branch also introduces an attention mechanism that allows the model to automatically focus on key features. Meanwhile, the second branch focuses on extracting bi-spectrum features from the processed audio signal. The bi-spectrum features can reveal complex hidden structures within audio signals. These bi-spectrum features are also subjected to convolution and pooling operations to extract information that is useful for the classification task. Next, the features extracted from the two branches are fused to form a feature vector containing comprehensive information. The fused feature vectors are regularized through the Dropout layer to reduce the risk of overfitting. Finally, the Fully Connected layer outputs the predicted probability for each category by learning the mapping relationship between feature vectors and category labels.

Fig. 1.

Fig. 1

HSDFE-Net structure.

Feature selection

To effectively extract key features from audio data, this paper employs two groups of features: conventional audio features and higher-order spectral features, specifically the bi-spectrum. These feature groups provide complementary information to capture both linear and nonlinear characteristics of heart sound signals.

The bi-spectrum is a type of higher-order spectral analysis based on the third-order cumulant of a signal. Unlike traditional second-order spectral method, which only capture linear or second-order statistical properties, the bi-spectrum can detect nonlinear interactions and phase relationships among frequency components. It is defined as the two-dimensional Fourier transform of the third-order cumulant function, which mathematically characterizes the non-Gaussianity and nonlinearity of signals. There are different types of higher-order spectra, including the bi-spectrum, which captures quadratic phase coupling and nonlinear interactions between frequency components, and the fourth-order spectrum, which extends bi-spectrum analysis to even higher-order interactions.

In heart sound analysis, the bi-spectrum is particularly useful because heart sounds are often generated by complex nonlinear physiological processes. The bi-spectrum can reveal subtle nonlinear features and phase coupling that are not observable through conventional linear spectral methods. These features improve the discrimination and classification of heart sound signals, facilitating better detection of pathological conditions. For a discrete-time signal Inline graphic, its third-order cumulant is defined as Eq. (1).

graphic file with name d33e488.gif 1

where Inline graphic denotes the expectation operation, Inline graphic and Inline graphic are the time delays. Next, by performing a Fourier transform on Inline graphic, the third-order cumulant spectral density can be obtained, as shown in Eq. (2).

graphic file with name d33e524.gif 2

where Inline graphic and Inline graphic are frequency variables.

Conventional audio features are a set of time-domain and frequency-domain features used for audio analysis, including zero crossing rate, spectral centroid, spectral bandwidth, spectral Rolloff, MFCC, and chroma features. These features are based on the fundamental properties of the audio signal. They provide detailed information about the energy distribution, spectral characteristics, and pitch variations of the signal.

Bi-spectrum features complement information not captured by conventional audio features by revealing nonlinear relationships, while conventional audio features provide a good description of the underlying properties of the signal. The combination of the two sets of features allows the model to understand the audio signal more comprehensively, resulting in better performance in classification tasks.

Dual-branch feature extraction network

Figure 2 illustrates the dual-branch feature extraction network structure. The first branch is a conventional audio feature extraction network. Among them, Block1 includes convolutional layer, batch normalization, activation function, and pooling layer. Assuming that the conventional feature is denoted as Inline graphic, after Block1 the feature definition is obtained as Eq. (3).

graphic file with name d33e574.gif 3

Fig. 2.

Fig. 2

Dual-branch feature extraction network structure.

where Inline graphic represents the max pooling operation, Inline graphic represents the rectified linear unit, Inline graphic represents the one-dimensional convolution operation, and Inline graphic represents batch normalization for one-dimensional input. Inline graphic represents different channels, with the channel values for the three blocks being 32, 64, and 128, respectively. The weights of the features are then re-adjusted in the channel dimension by SENet to make the network focus more on key features. Assume that the feature map obtained after passing through three Block1 layers is Inline graphic, and the squeeze operation is defined by Eq. (4).

graphic file with name d33e622.gif 4

where Inline graphic represents the height of the feature map, Inline graphic represents the width of the feature map, Inline graphic is the global average value of the channel Inline graphic, and Inline graphic is the value of the input feature map at position Inline graphic in channel Inline graphic. The excitation is composed of two fully connected layers, as defined in Eq. (5).

graphic file with name d33e676.gif 5

where Inline graphic is the vector composed of all Inline graphic, Inline graphic and Inline graphic are the weight matrices of the two fully connected layers, and Inline graphic represents the Sigmoid function. Rescale the feature maps for each channel by the scaling factor generated by the excitation operation. Finally, the recalibrated feature map on channel Inline graphic is obtained defined as Eq. (6).

graphic file with name d33e724.gif 6

where Inline graphic is the scaling factor for the channel Inline graphic. Next, reshape the output of the SENet. Then, use a fully connected layer to map the input features onto the fully connected layer, resulting in the output feature Inline graphic.

The second branch is a bi-spectrum feature extraction network. Among them, Block2 includes convolutional layer, batch normalization, activation function and pooling layer. Assuming that the bi-spectrum feature is denoted as Inline graphic, after Block2 the feature definition is obtained as Eq. (7).

graphic file with name d33e761.gif 7

where Inline graphic represents the two-dimensional convolution operation, and Inline graphic represents batch normalization for two-dimensional input. Inline graphic represents different channels, with the channel values for the three blocks being 32, 16, and 8, respectively. The convolutional layer is used to further extract higher-level features of the bi-spectrum features. After batch normalization, the multidimensional feature map is flattened into a one-dimensional vector for easy input into the fully connected layer. The fully connected layer then maps the flattened features to the target dimension, resulting in the output feature Inline graphic.

The outputs of Branch 1 and Branch 2 are fused to obtain the feature Inline graphic. Inline graphic is defined in Eq. (8).

graphic file with name d33e810.gif 8

Apply a Dropout layer to prevent overfitting. The final output is as shown in Eq. (9).

graphic file with name d33e821.gif 9

Algorithmic process

The process of HSDFE-Net is given by Algorithm 1. First, preprocess the audio data by applying a band-pass filter to retain frequencies between 25 Hz and 400 Hz while removing noise. The filtered audio signal is then resampled to reduce data size and lower the processing load. The audio signal is then normalized to a maximum absolute value of one. In the feature extraction stage, bi-spectrum features and conventional audio features are acquired in parallel. For the bi-spectrum features, first, calculate the bi-spectrum. Then, normalize it to the range of 0 to 255 and reshape it to meet the input requirements of the CNN. On the other hand, the mean and standard deviation of the conventional audio features are computed and integrated to form the final feature vector.

Next, define two sets of CNN: one set uses a two-dimensional convolutional neural network (2DCNN) to extract information from the bi-spectrum features, while the other set uses a one-dimensional convolutional neural network (1DCNN) to extract features from conventional audio features. To enhance the focus of the model on key features, an attention mechanism is introduced in 1DCNN. After building the model, proceed with training and evaluation. Finally, accuracy, precision, recall and F1-score are obtained through evaluation to fully assess the performance of HSDFE-Net.

Algorithm 1.

Algorithm 1

Heart Sound Dual-Branch Feature Extraction Network (HSDFE-Net).

Experiments

Datasets

The Yaseen-PCG dataset34 is constructed using a publicly available phonocardiogram dataset curated by Yaseen et al., designed for automatic classification of valvular heart diseases. The dataset contains one normal (N) category and four abnormal categories. The four abnormal categories are: Aortic Stenosis (AS), Mitral Stenosis (MS), Mitral Regurgitation (MR), and Mitral Valve Prolapse (MVP). After thorough cross-checks to ensure data quality, recordings with extreme noise or artifacts are excluded. The remaining heart sound signals are resampled at an 8,000 Hz sampling rate, converted to mono-channel format, and standardized to contain three cardiac cycles.

Both PhysioNet-e and PhysioNet-d datasets35 are derived from subsets of the PhysioNet Challenge 2016 heart sound database. This large-scale dataset comprises recordings collected in both clinical and non-clinical environments from adult and pediatric subjects, encompassing both a normal (N) category and an abnormal (AN) category. Each subject contributes between one and six phonocardiogram recordings, all resampled to 2,000 Hz and stored in single-channel WAV format. The datasets are curated by excluding severely noisy or ambiguous samples, retaining only those with clear signal quality to ensure consistency and reliability for downstream classification tasks. The datasets are divided into training, validation, and test sets in an 8:1:1 ratio. The details of the datasets used in the experiment are shown in Table 2.

Table 2.

Dataset details.

Dataset Category Description Number
Yaseen-PCG AS Aortic stenosis 200
MR Mitral regurgitation 200
MS Mitral stenosis 200
MVP Mitral valve prolapses 200
N Normal 200
PhysioNet-e N Normal 183
AN Abnormal 1958
PhysioNet-d AN Abnormal 27
N Normal 28

The experiments are conducted using Python on the Windows 10. The hardware environment includes an NVIDIA RTX 3060 GPU with 12GB of video memory. The CUDA version is 11.7. To optimize the objective, the Adam optimizer is used. This optimizer is widely applied in deep learning tasks due to its good convergence and stability. The learning rate of the optimizer is set to 0.001, a commonly used default value that provides better results without losing model convergence. The training process consists of 200 epochs. The batch size is set to 16.

Evaluation metrics

To evaluate the performance of HSDFE-Net. Accuracy, precision, recall and F1-score34 are used for evaluation. Suppose Inline graphic denotes true positives, Inline graphic denotes true negatives, Inline graphic denotes false positives, and Inline graphic denotes false negatives. Accuracy, precision, recall and F1-score are calculated as shown in Eq. (10) to Eq. (13). In addition, the macro-averaged F1-score (mF1) is calculated by averaging the F1 scores across all classes, providing a balanced evaluation for multi-class classification tasks.

graphic file with name d33e996.gif 10
graphic file with name d33e1002.gif 11
graphic file with name d33e1008.gif 12
graphic file with name d33e1014.gif 13

Experiments and discussion

HSDFE-Net performance

In the processing of heart sound signals, various noise disturbances are usually present in the original audio data. To extract a purer heart sound signal, it is first necessary to digitally filter the audio file. Filtering is used to improve the signal-to-noise ratio, making subsequent analysis and processing more accurate. After filtering, the processed audio signal is down-sampled to reduce the computational load of the model, lowering the sampling rate to 1000 Hz. Furthermore, because audio files in different datasets may have significant differences in amplitude and time scales, all audio signals are normalized to ensure that the data format is uniform. Figure 3 shows the stages of audio signal processing: (a) original audio signal, (b) filtered audio signal, (c) down-sampled signal, and (d) normalized signal.

Fig. 3.

Fig. 3

Audio signal preprocessing. (a) original signal. (b) filtered audio signal. (c) down-sampled signal. (d) normalization signal.

Figures 4 (a) and (b) demonstrate the bi-spectrum feature plots of heart sound signals in abnormal and normal states. From the figures, it can be observed that the bi-spectrum plots of abnormal heart sound signals exhibit more dispersed features, indicating that their frequency components are more complex and widely distributed. On the other hand, the bi-spectrum plots of the normal heart sound signal display a more concentrated energy distribution and a more compact spectral pattern. This suggests that heart sound signals of different states have significant feature distribution differences in the frequency domain. By exploiting these differences, the model can distinguish normal and abnormal heart sound signals more efficiently.

Fig. 4.

Fig. 4

Contour Plot of the Bi-spectrum features. (a) abnormal heart sounds. (b) normal heart sounds.

Table 3 shows the experimental results of HSDFE-Net on both datasets. For Yaseen-PCG dataset, the model shows high performance in classifying different categories of heart sound signals. Specifically, the precision, recall, and F1-score for the AS, MS, and N categories all reached 100%. Its mF1 and accuracy are both 99.00%. On PhysioNet-e dataset, HSDFE-Net demonstrates excellent classification ability, achieving an accuracy of 99.53%. For PhysioNet-d dataset, HSDFE-Net achieves a mF1 of 82.86% and an accuracy of 83.33%. These results demonstrate the effectiveness of HSDFE-Net in heart sound signal classification tasks.

Table 3.

Experimental results of HSDFE-Net.

Dataset Category Precision (%) Recall (%) F1-score (%) Accuracy (%)
Yaseen-PCG AS 100.00 100.00 100.00 99.00
MR 100.00 95.00 97.44
MS 100.00 100.00 100.00
MVP 95.24 100.00 97.56
N 100.00 100.00 100.00
Average 99.05 99.00 99.00
PhysioNet-e AN 100.00 94.44 97.14 99.53
N 99.49 100.00 99.75
Average 99.75 97.22 98.45
PhysioNet-d AN 75.00 100.00 85.71 83.33
N 100.00 66.67 80.00
Average 87.50 83.34 82.86

Ablation experiments

Ablation experiments reveal the effect of different modules in the model on the classification performance of heart sound signals. Table 4 illustrates the results of the ablation experiments. 1DCNN denotes the conventional audio feature extraction network, and 2DCNN denotes the bi-spectrum feature extraction network. The experiment compared the F1-score of using 1DCNN alone, 2DCNN alone, and HSDFE-Net. On Yaseen-PCG dataset, there is a significant difference in the performance of F1-score across categories when using 1DCNN. For example, the F1-score is 95.24% for the AS, 87.18% for the MR, and reached 100% for the N. When 2DCNN is used alone, the mF1 is slightly higher than 1DCNN, at 94.04%. However, when the two modules are combined, the mF1 reaches 99.00%. This indicates that the combination of the two branches can effectively complement each other. On PhysioNet-e dataset, the F1-score of 2DCNN on the AN is only 78.79%. However, when the two modules are combined, the F1-score on the AN increase dramatically to 97.14%. In summary, HSDFE-Net can significantly improve the classification performance of heart sound signals by combining the advantages of 1DCNN and 2DCNN.

Table 4.

Ablation experiment results: F1-score (%).

Dataset Category 1DCNN 2DCNN HSDFE-Net
Yaseen-PCG AS 95.24 90.91 100.00
MR 87.18 91.89 97.44
MS 97.56 97.44 100.00
MVP 89.47 94.74 97.56
N 100.00 95.24 100.00
Average 93.89 94.04 99.00
PhysioNet-e AN 99.49 78.79 97.14
N 94.12 98.24 99.75
Average 96.81 88.52 98.45
PhysioNet-d AN 66.67 66.67 85.71
N 66.67 66.67 80.00
Average 66.67 66.67 82.86

Table 5 shows the results of comparing the F1-score of HSDFE-Net on three datasets, with and without the attention mechanism. The experimental results allow to analyze the influence of the attention mechanism on the classification performance of heart sound signals. On Yaseen-PCG dataset, after incorporating the attention mechanism, HSDFE-Net showed varying degrees of improvement across most categories. The mF1 increased from 96.94 to 99.00%. On PhysioNet-e dataset, with the introduction of the attention mechanism, the mF1 significantly increased from 92.22 to 98.45%. Notably, PhysioNet-d dataset demonstrates the most substantial relative gain, with mF1 jumping from 62.50 to 82.86%. As a result, the attention mechanism has a significant impact on the classification performance of heart sound signals.

Table 5.

Results of ablation experiment with and without attention: F1-score (%).

Dataset Category Without attention HSDFE-Net
Yaseen-PCG AS 100.00 100.00
MR 91.89 97.44
MS 100.00 100.00
MVP 95.24 97.56
N 97.56 100.00
Average 96.94 99.00
PhysioNet-e AN 85.71 97.14
N 98.73 99.75
Average 92.22 98.45
PhysioNet-d AN 50.00 85.71
N 75.00 80.00
Average 62.50 82.86

Figures 5 (a) - (b) show the confusion matrices on Yaseen-PCG dataset without and with the attention mechanism. The rows of the confusion matrix represent the actual categories and the columns represent the predicted categories. Specifically, in Fig. 5 (a), three samples are incorrectly classified into different categories. In Fig. 5 (b), only one sample is misclassified. This reflects the rationality and necessity of incorporating the attention mechanism.

Fig. 5.

Fig. 5

Confusion matrix for Yaseen-PCG dataset. (a) without attention. (b) with attention.

Figure 6 visually illustrates the results of comparing the actual and predicted labels in Yaseen-PCG dataset. After adding the attention mechanism, the predicted labeling line almost matches the actual labeling line, and only individual sample points are biased. This result is consistent with the previous analysis of the confusion matrix. The attention mechanism significantly reduced classification errors and improved the model’s classification accuracy.

Fig. 6.

Fig. 6

Comparison of actual and predicted labels. (a) without attention. (b) with attention.

Comparison experiments

Table 6 presents the experimental results of using different audio features as inputs to 2DCNN. This paper compares three features: MFCC, Discrete Wavelet Transform (DWT), and bi-spectrum. In Yaseen-PCG dataset, the model has the highest classification accuracy of 94.00% when used bi-spectrum. In PhysioNet-e dataset, the bi-spectrum still perform well. Although the accuracy is slightly lower than MFCC, its mF1 is more balanced at 88.52%. In general, bi-spectrum performs better in both classification tasks, demonstrating their advantages in capturing the complex features of heart sound signals.

Table 6.

Experimental results of selecting different features.

Dataset Feature Accuracy (%) mF1 (%)
Yaseen-PCG MFCC 87.00 86.87
DWT 82.00 81.26
Bi-spectrum 94.00 94.04
PhysioNet-e MFCC 95.35 87.34
DWT 81.86 58.04
Bi-spectrum 93.02 88.52

Figure 7 shows the comparison of labels with different feature selections. Different feature selections have varying impacts on the prediction results, with the prediction labels using bi-spectrum showing the highest alignment with the actual labels. In contrast, the prediction results using DWT show some deviation from the actual labels.

Fig. 7.

Fig. 7

Comparison results of selecting different features in Yaseen-PCG dataset. (a) MFCC. (b) DWT. (c) Bi-spectrum.

To ensure meaningful comparisons, we selected four reference methods: RF11, SVM10, MLP11, and CTENN14. RF and SVM are traditional classifiers that depend on hand-crafted features; the former aggregates multiple decision trees, while the latter constructs optimal separation boundaries in feature space. MLP represents a basic neural model with limited capacity for temporal features. CTENN, by contrast, combines one-dimensional convolution and Transformer encoders, directly learning from raw heart sound signals without explicit preprocessing. These baselines span conventional and deep learning approaches, offering a diverse benchmark for validating HSDFE-Net. Table 7 shows the results of the comparison. HSDFE-Net consistently achieves the highest F1-scores on both Yaseen-PCG and PhysioNet-e datasets. In the Yaseen-PCG dataset, it reaches 100% on three out of five categories (AS, MS, N), and exceeds all baselines in the remaining two (MR, MVP), demonstrating superior sensitivity to subtle pathological variations. On PhysioNet-e, HSDFE-Net attains 99.75% in the N class and significantly improves the detection of AN cases, with an F1-score of 97.14%, outperforming the second-best method (CTENN) by over 6%. These results confirm the model’s strong class-wise discrimination, especially under challenging conditions such as class imbalance and inter-class similarity.

Table 7.

Comparison of experimental results: F1-score (%).

Dataset Category RF SVM MLP CTENN HSDFE-Net
Yaseen-PCG AS 90.91 90.91 93.02 100.00 100.00
MR 94.74 87.18 90.00 91.89 97.44
MS 100.00 92.68 97.44 97.44 100.00
MVP 84.21 83.33 84.21 93.02 97.56
N 100.00 100.00 100.00 97.56 100.00
Average 93.97 90.82 92.93 95.98 99.00
PhysioNet-e N 98.75 97.96 98.73 99.24 99.75
AN 83.87 78.95 86.49 90.91 97.14
Average 91.31 88.46 92.61 95.08 98.45

The classification accuracy and mF1 of the compared models are shown in Table 8. On Yaseen-PCG dataset, the accuracy of the HSDFE-Net is 99.00%, which is 5% higher than RF, 8% higher than SVM, 6% higher than MLP, and 3% higher than CTENN. On PhysioNet-e dataset, it records 99.53% accuracy and 98.45% mF1, outperforming CTENN by 3.37% in mF1. Most notably, on the PhysioNet-d dataset, HSDFE-Net achieves 82.86% mF1, surpassing all baselines by a large margin—up to 49.53% higher than RF and 16.19% higher than the best baseline (CTENN). The consistently higher mF1 indicate that HSDFE-Net maintains balanced performance across all classes, further validating its generalization capability and robustness in diverse heart sound classification tasks.

Table 8.

Comparison of experimental results: accuracy (%) and mF1 (%).

Dataset Model Accuracy (%) mF1 (%)
Yaseen-PCG RF 94.00 93.97
SVM 91.00 90.82
MLP 93.00 92.93
CTENN 96.00 95.98
HSDFE-Net 99.00 99.00
PhysioNet-e RF 97.67 91.31
SVM 96.28 88.46
MLP 97.67 92.61
CTENN 98.60 95.08
HSDFE-Net 99.53 98.45
PhysioNet-d RF 33.33 33.33
SVM 66.67 66.67
MLP 50.00 48.57
CTENN 66.67 66.67
HSDFE-Net 83.33 82.86

Figure 8 illustrates the comparison of actual and predicted labels for the three models, RF, SVM, and MLP, on Yaseen-PCG dataset. Despite the fact that these models can capture the features of the heart sound signal to some extent, some data points still have some bias. Compared to HSDFE-Net, their prediction accuracy and consistency are still significantly deficient.

Fig. 8.

Fig. 8

Comparison of actual and predicted labels for the models. (a) RF. (b) SVM. (c) MLP.

Discussion

This paper proposes a deep learning model for heart sound classification with the name HSDFE-Net. Through experiments on three public datasets, we find that HSDFE-Net significantly outperforms other models in terms of classification performance.

In the design of the model, 1DCNN and 2DCNN are used to process different types of features respectively.1DCNN is mainly used to extract the conventional features of the audio signal. These features include zero crossing rate, spectral centroid, spectral bandwidth, spectral Rolloff, MFCC, and chroma features. 2DCNN is used to process bi-spectrum features. The bi-spectrum is a higher-order feature that can capture complex spectral patterns in heart sound signals. 2DCNN can fully utilize the convolution operation to extract spatial features when dealing with these features, which enhances the ability of the model to understand the signal.

The attention mechanism plays a key role in the performance improvement of HSDFE-Net. The ablation experiment reveals that, after incorporating the attention mechanism, the F1-score of the model significantly improved across all types of heart sound signals. Yin et al.33 pointed out in their study that the issue of insufficient feature utilization in traditional methods can be addressed through a channel attention module. SENet effectively improves the efficiency and accuracy of feature extraction by increasing the channel importance of key features. Moreover, 2DCNN has already fully extracted the spatial features by convolution operation when dealing with bi-spectrum features. Therefore, adding SENet increases the computational complexity and the improvement in effectiveness is not significant.

In the comparison experiment of feature extraction methods, this paper compares the performance of three features: MFCC, DWT, and bi-spectrum. The bi-spectrum feature provides richer higher-order to more effectively capture complex patterns in heart sound signals. In contrast, MFCC and DWT performed relatively weakly in heart sound signals.

HSDFE-Net shows consistent superiority over both traditional machine learning and deep learning methods on all datasets. RF improves the accuracy of the model by constructing multiple decision trees and integrating their predictions. It has advantages in handling high-dimensional data and avoiding overfitting. SVM has strong performance in small samples and high dimensional spaces. However, due to the complex nonlinear relationships between features in the heart sound signal dataset, the above methods show some limitations in dealing with these complex relationships. MLP, as a feed-forward neural network, can learn complex nonlinear relationships. Its performance in classification tasks is slightly better than SVM. However, despite the fact that MLP can learn complex nonlinear relationships, its generalization ability is limited. CTENN integrates one-dimensional convolutional layers with Transformer encoders to automatically extract features from raw signals, representing a strong deep learning baseline. Nevertheless, HSDFE-Net extracts more comprehensive information from heart sound signals, thus showing significant advantages in classification tasks.

In summary, while HSDFE-Net achieves promising classification performance, this study has several limitations that should be acknowledged. First, the model’s evaluation was conducted on a limited number of datasets, which may not fully represent the variability encountered in clinical practice. Second, the computational complexity of the bi-spectrum feature extraction and the dual-branch network architecture could pose challenges for real-time applications on resource-constrained devices.

Conclusion

In this paper, a dual-branch network for heart sound analysis with attention and multimodal features is proposed. By combining conventional audio features and higher-order features, the model’s ability to describe signal complexity is effectively improved. Specifically, the branch dealing with bi-spectrum features capture nonlinear relationships in the signal, while the branch dealing with conventional audio features provides stable spectral characterization. By combining these two features, the network can comprehensively and accurately understand heart sound signals. Subsequently, feature extraction is performed using 1DCNN and 2DCNN separately, enhancing the ability to analyze features across different dimensions. Moreover, the introduction of SENet allows the model to better capture and utilize key information in the signal. The experimental results validate the effectiveness of HSDFE-Net, achieving accuracies of 99.00%, 99.53%, and 83.33%, with mF1 of 99.00%, 98.45%, and 82.86% respectively.

Future work will focus on two key improvements to enhance HSDFE-Net’s clinical applicability: (1) extensive validation on larger heart sound datasets to assess generalizability across diverse populations and recording conditions; (2) optimization of the bi-spectrum feature extraction pipeline to improve computational efficiency for real-time applications.

Acknowledgements

This research was funded by Natural Science Research of Jiangsu Higher Education Institutions of China (Grant 23KJD520011) and Nantong Social Livelihood Science and Technology Plan (Directive Project) (Grant MS2024019).

Author contributions

H.C. conceptualized the study, developed the methodology, conducted the experiments, and wrote the main manuscript text. W.G. collected the data, performed the statistical analysis, prepared the figures, contributed to the interpretation of the results, and provided critical revisions. All authors reviewed and approved the final manuscript.

Data availability

Yaseen-PCG dataset: The data that support the findings of this study are openly available in [https://github.com/yaseen21khan/Classification-of-Heart-Sound-Signal-Using-Multiple-Features-].PhysioNet-e and PhysioNet-d datasets: The data that support the findings of this study are openly available in [https://physionet.org/content/challenge-2016/1.0.0/#files].

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Xu, W. et al. Automatic pediatric congenital heart disease classification based on heart sound signal. Artif. Intell. Med.126, 102257 (2022). [DOI] [PubMed] [Google Scholar]
  • 2.Chen, J., Guo, Z., Xu, X., Jeon, G. & Camacho, D. Artificial intelligence for heart sound classification: A review. Expert Syst., 41(4), e13535. (2024).
  • 3.Ketu, S. & Mishra, P. K. Empirical analysis of machine learning algorithms on imbalance electrocardiogram based arrhythmia dataset for heart disease detection. Arab. J. Sci. Eng.47 (2), 1447–1469 (2022). [Google Scholar]
  • 4.Alqahtani, A. et al. Classifying electroencephalogram signals using an innovative and effective machine learning method based on chaotic elephant herding optimum. Expert Syst. 42(1), e13383. 10.1111/exsy.13383 (2023). [Google Scholar]
  • 5.Shuvo, S. B. et al. NRC-Net: automated noise robust cardio net for detecting valvular cardiac diseases using optimum transformation method with heart sound signals. Biomed. Signal Process. Control. 86, 105272 (2023). [Google Scholar]
  • 6.Yadav, A., Singh, A., Dutta, M. K. & Travieso, C. M. Machine learning-based classification of cardiac diseases from PCG recorded heart sounds. Neural Comput. Appl.32 (24), 17843–17856 (2020). [Google Scholar]
  • 7.Ren, Z. et al. A comprehensive survey on heart sound analysis in the deep learning era. IEEE Comput. Intell. Mag.19(3), 42–57 (2024). [Google Scholar]
  • 8.Li, J., Ke, L. & Du, Q. Classification of heart sounds based on the wavelet fractal and twin support vector machine. Entropy21 (5), 472 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Esmail, M. Y., Ahmed, D. H. & Eltayeb, M. Classification system for heart sounds based on random forests. J. Clin. Eng.44(2), 76–80 (2019). [Google Scholar]
  • 10.See, A. R., Cabili, I. S. & Chen, Y. J. Heart sound classification based on nonlinear time-frequency features. Sens. Mater.34 (1), 217–223 (2022). [Google Scholar]
  • 11.Narváez, P., Gutierrez, S. & Percybrooks, W. S. Automatic segmentation and classification of heart sounds using modified empirical wavelet transform and power features. Appl. Sci.10 (14), 4791 (2020). [Google Scholar]
  • 12.Chen, W. et al. Deep learning methods for heart sounds classification: A systematic review. Entropy23(6), 667 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bozkurt, B., Germanakis, I. & Stylianou, Y. A study of time-frequency features for CNN-based automatic heart sound classification for pathology detection. Comput. Biol. Med.100, 132–143 (2018). [DOI] [PubMed] [Google Scholar]
  • 14.Cheng, J. & Sun, K. Heart sound classification network based on convolution and transformer. Sensors23(19), 8168 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Li, F., Tang, H., Shang, S., Mathiak, K. & Cong, F. Classification of heart sounds using convolutional neural network. Appl. Sci.10 (11), 3956 (2020). [Google Scholar]
  • 16.Krishnan, P. T., Balasubramanian, P. & Umapathy, S. Automated heart sound classification system from unsegmented phonocardiogram (PCG) using deep neural network. Physical Engineering Sciences Medicine. 43 (2), 505–515 (2020). [DOI] [PubMed] [Google Scholar]
  • 17.Oh, S. L. et al. Classification of heart sound signals using a novel deep WaveNet model. Comput. Methods Programs Biomed.196, 105604 (2020). [DOI] [PubMed] [Google Scholar]
  • 18.Xiao, B., Xu, Y., Bi, X., Zhang, J. & Ma, X. Heart sounds classification using a novel 1-D convolutional neural network with extremely low parameter consumption. Neurocomputing392, 153–159 (2020). [Google Scholar]
  • 19.Bao, X. et al. Time-frequency distributions of heart sound signals: A comparative study using convolutional neural networks. Biomedical Eng. Adv.5, 100093 (2023). [Google Scholar]
  • 20.Deng, M. et al. Heart sound classification based on improved MFCC features and convolutional recurrent neural networks. Neural Netw.130, 22–32 (2020). [DOI] [PubMed] [Google Scholar]
  • 21.Kui, H., Pan, J., Zong, R., Yang, H. & Wang, W. Heart sound classification based on log Mel-frequency spectral coefficients features and convolutional neural networks. Biomed. Signal Process. Control. 69, 102893 (2021). [Google Scholar]
  • 22.Tariq, Z., Shah, S. K. & Lee, Y. Feature-based fusion using CNN for lung and heart sound classification. Sensors22 (4), 1521 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Alquran, H., Alqudah, A. M., Abu-Qasmieh, I., Al-Badarneh, A. & Almashaqbeh, S. ECG classification using higher order spectral Estimation and deep learning techniques. Neural Netw. World. 29 (4), 207–219 (2019). [Google Scholar]
  • 24.Liu, Z. et al. Heart sound classification based on Bi-spectrum features and vision transformer mode. Alexandria Eng. J.85, 49–59 (2023). [Google Scholar]
  • 25.Berraih, S. A., Baakek, Y. N. E. & Debbal, S. M. E. A. Pathological discrimination of the phonocardiogram signal using the bispectral technique. Phys. Eng. Sci. Med.43(4), 1371–1385 (2020). [DOI] [PubMed] [Google Scholar]
  • 26.Li, Y. et al. Heart sounds classification based on high-order spectrogram and multi‐convolutional neural network after a new screening strategy. Adv. Theory Simulations. 7 (1), 2300549 (2024). [Google Scholar]
  • 27.Wang, R. et al. PCTMF-Net: heart sound classification with parallel CNNs-transformer and second-order spectral analysis. Visual Comput.39 (8), 3811–3822 (2023). [Google Scholar]
  • 28.Ranipa, K., Zhu, W. P. & Swamy, M. N. S. A novel feature-level fusion scheme with multimodal attention CNN for heart sound classification. Comput. Methods Programs Biomed.248, 108122 (2024). [DOI] [PubMed] [Google Scholar]
  • 29.Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141). (2018).
  • 30.Wang, Q. et al. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11534–11542). (2020).
  • 31.Deng, E., Jia, Y., Zhu, G. & Zhou, E. Heart sound signals classification with image conversion employed. Electronics13(7), 1179 (2024). [Google Scholar]
  • 32.Chowdhury, S., Morshed, M. & Fattah, S. A. SpectroCardioNet: An attention-based deep learning network using triple-spectrograms of PCG signal for heart valve disease detection. IEEE Sens. J.22(23), 22799–22807 (2022). [Google Scholar]
  • 33.Yin, C. et al. Detection of coronary artery disease based on clinical phonocardiogram and multiscale attention convolutional compression network. IEEE J. Biomedical Health Informatics 28(3), 1353–1362 (2024). [DOI] [PubMed]
  • 34.Yaseen, Son, G. Y. & Kwon, S. Classification of heart sound signal using multiple features. Appl. Sci.8 (12), 2344 (2018). [Google Scholar]
  • 35.Liu, C. et al. Classification of heart sound recordings-the physionet computing in cardiology challenge 2016. PhysioNet. (2016).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Yaseen-PCG dataset: The data that support the findings of this study are openly available in [https://github.com/yaseen21khan/Classification-of-Heart-Sound-Signal-Using-Multiple-Features-].PhysioNet-e and PhysioNet-d datasets: The data that support the findings of this study are openly available in [https://physionet.org/content/challenge-2016/1.0.0/#files].


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES