Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2022 Nov 7;213:119212. doi: 10.1016/j.eswa.2022.119212

Fruit-CoV: An efficient vision-based framework for speedy detection and diagnosis of SARS-CoV-2 infections through recorded cough sounds

Long H Nguyen a,1, Nhat Truong Pham b,c,⁎⁎,1, Van Huong Do d, Liu Tai Nguyen a, Thanh Tin Nguyen e, Hai Nguyen f, Ngoc Duy Nguyen g, Thanh Thi Nguyen h, Sy Dzung Nguyen i,j, Asim Bhatti g, Chee Peng Lim g
PMCID: PMC9639421  PMID: 36407848

Abstract

COVID-19 is an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). This deadly virus has spread worldwide, leading to a global pandemic since March 2020. A recent variant of SARS-CoV-2 named Delta is intractably contagious and responsible for more than four million deaths globally. Therefore, developing an efficient self-testing service for SARS-CoV-2 at home is vital. In this study, a two-stage vision-based framework, namely Fruit-CoV, is introduced for detecting SARS-CoV-2 infections through recorded cough sounds. Specifically, audio signals are converted into Log-Mel spectrograms, and the EfficientNet-V2 network is used to extract their visual features in the first stage. In the second stage, 14 convolutional layers extracted from the large-scale Pretrained Audio Neural Networks for audio pattern recognition (PANNs) and the Wavegram-Log-Mel-CNN are employed to aggregate feature representations of the Log-Mel spectrograms and the waveform. Finally, the combined features are used to train a binary classifier. In this study, a dataset provided by the AICovidVN 115M Challenge is employed for evaluation. It includes 7,371 recorded cough sounds collected throughout Vietnam, India, and Switzerland. Experimental results indicate that the proposed model achieves an Area Under the Receiver Operating Characteristic Curve (AUC) score of 92.8% and ranks first on the final leaderboard of the AICovidVN 115M Challenge. Our code is publicly available.

Keywords: Sound classification, COVID-19, Recorded cough sounds, Delta variant, EfficientNet, SARS-CoV-2 infections, Deep learning, Neural network, Machine vision, Remote detection, Speedy detection, PANNs, Log-Mel spectrogram, Self-testing service, Wavegram

Graphical abstract

graphic file with name ga1_lrg.jpg

1. Introduction

According to the World Health Organization, the COVID-19 pandemic is responsible for more than 4.4 million deaths, and nearly 210 million confirmed cases worldwide, as of August 2021.2 As a result, it has seriously affected the world economy, especially the tourism industry, according to the report by Mishra, Urolagin, Jothi, Neogi, and Nawaz (2021). Therefore, many approaches have been proposed to detect COVID-19 infections and control the pandemic (Imran, et al., 2020, Miao et al., 2022, Mohammed, et al., 2021, Sainz-Pardo and Valero, 2021). Recent advances in deep learning and data analytics provide novel methods that can monitor and track COVID-19 infections (Castorina et al., 2020, Pham, et al., 2020, Ting et al., 2020, Vaishya et al., 2020), estimate potential outbreak (Giordano, et al., 2020, Heroy, 2020, Peng, et al., 2020), or detect and diagnose COVID-19 by using CT images, cough and breath sounds (Balaha et al., 2021, Coppock, et al., 2021, Kumar and Alphonse, 2021, Li, et al., 2021, Morís et al., 2021, Nessiem, et al., 2021).

Researchers have analyzed genomic data and protein structures to facilitate the development of vaccines for preventing SARS-CoV-2 (Abdel-Basset, et al., 2020, Abdelmageed, et al., 2020, Banerjee et al., 2020, Bock and Ortea, 2020, Li, et al., 2020, Nguyen, et al., 2021). Currently, there are three main types of SARS-CoV-2 tests, namely nucleic acid detection tests to detect SARS-CoV-2 viral RNA (Ribonucleic acid); rapid antigen tests to detect antigen viral proteins from the SARS-CoV-2 virus; and serology tests to detect IgM and/or IgG antibodies against SARS-CoV-2.3 These tests are expensive and time-consuming to conduct. Furthermore, they require the physical presence of patients. It is, therefore, crucial to design a inexpensive self-testing service from home with an instant and accurate outcome. This study investigates such a method for detecting and diagnosing SARS-CoV-2 through cough sounds recorded using smart phones.

In this work, a vision-based model is proposed that first transforms recorded audio signals into Log-Mel spectrograms. Then, a two-stage deep neural network with supervised learning is introduced to determine infected or non-infected subjects. The first stage involves the backbone EfficientNet-V2 network (Tan & Le, 2021) that extracts visual features of Log-Mel spectrograms. In the second stage, a Wavegram-Log-Mel-CNN is employed, followed by 14 convolutional layers extracted from the PANNs (Kong, et al., 2020) to further process feature representations of the Log-Mel spectrograms and waveforms. The features are then aggregated to detect infected cough sounds. Moreover, to improve generalization and to focus on important features, a generalized mean pooling (GeM pooling) layer (Radenović, Tolias, & Chum, 2018) is added after the EfficientNet-V2 (EffNetV2) backbone.

For evaluation, a dataset provided by the AICovid-VN 115M Challenge is used to verify the efficiency of the proposed framework. It combines two public datasets from India and Switzerland and a collection of recorded cough sounds in Vietnam. As a result, 4,508 training sounds, 1,230 public test sounds, and 1,627 private test sounds are available. The proposed framework achieves an AUC (area under the Receiver Operating Characteristic Curve) score of 92.8% in the private test. Moreover, the proposed Fruit-CoV framework has been deployed as a Docker service that can be easily integrated into call centers, smartphone-based applications, and VoIP systems to facilitate SARS-CoV-2 detection. Our code is publicly available at https://git.io/JMYAx.

In summary, this study makes the following contributions:

  • Two different CNN-based networks are proposed to extract features of recorded cough signals: one using Log-Mel spectrograms and one using a Wavegram-Log-Mel-CNN. The extracted features are combined before feeding into Fruit-CoV.

  • A two-stage vision-based framework, namely Fruit-CoV, is designed using EffNetV2 and PANNs to extract high-level representative features from Log-Mel spectrograms and Wavegram-Log-Mel-CNN.

  • A comprehensive evaluation with data processed based on sampling rates. Specifically, three dominant sampling rates (e.g., 4 kHz, 8 kHz, and 48 kHz) are selected, and three corresponding Fruit-CoV models are employed. A GeM pooling layer is added to the tail of the EffNetV2 backbone. The aggregated features in the second stage are used to classify whether a recorded cough sound is positive or negative.

  • The proposed framework has been deployed as a Docker service that can be easily integrated into existing applications and systems.

The remaining parts of this paper are organized as follows. Related studies are summarized in Section 2. The proposed framework, i.e., Fruit-CoV, is explained in detail in Section 3. In Section 4, the evaluation results are presented. Finally, concluding remarks are presented in Section 5.

2. Related studies

Developing low-cost solutions for detecting SARS-CoV-2 through cough and breath sounds has become an important research topic. In Laguarta, Hueto, and Subirana (2020), transfer learning was used in an Artificial Intelligence (AI)-based diagnostic system using recorded cough sounds. In particular, Mel Frequency Cepstral Coefficients (MFCCs) are extracted from sounds to train a classifier based on three pre-trained ResNet50s (He, Zhang, Ren, & Sun, 2016) modules and a Poisson biomarker layer. The model achieves an accuracy of 95% and 99.9% on groups of 25 subjects with five and three positive cases, respectively.

Brown, et al. (2020) developed a machine learning binary classifier to identify the presence of COVID-19 using cough and breath sounds in a crowdsourced respiratory dataset. Pahar, Klopper, Warren, and Niesler (2021) employed various machine learning and deep learning techniques to detect infected cough sounds. They include logistic regression (LR), support vector machine (SVM), multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM), and ResNet50. SMOTE oversampling has been adopted to the minor class, which accounts for only 7.86% of cough sounds. In Valdés, et al. (2021), both unsupervised and supervised learning models were utilized to analyze different types of cough sounds, including dry, wet, whooping, and COVID-19 coughs. However, the method only uses raw audio signals from a limited dataset. Mohammed, et al. (2021) proposed a new technique as a data augmentation method to enrich the crowdsourced cough dataset based on the publicly datasets (Faezipour and Abuzneid, 2020, Sharma, et al., 2020). Then an ensemble learning technique consisting of a shallow MLP, a CNN, and pre-trained CNNs was designed to early detect COVID-19 status (either positive or negative). However, the accuracy of this proposed method is lower than 90%.

Recently, researchers have examined the use of computer audition (CA) for COVID-19 detection and diagnosis. CA and AI tools were deployed in Schuller, et al. (2021) to evaluate the factor of speech and sound in COVID-19 diagnosis and monitoring. Various speech and sound records have been analyzed, including breathing, dry and wet cough sound through a cold, eating food, and drowsiness. Besides, Qian, Schmitt, et al. (2021) introduced a multi-task speech corpus for COVID-19. Using traditional machine learning methods and deep learning techniques, the corpus samples aid researchers in developing COVID-19-fighting techniques and tracking the spread of COVID-19 using monitoring systems. Similarly, Qian, Schuller, and Yamamoto (2021) used CA for diagnosing COVID-19. In this study, developing a new AI-based CA framework is investigated to detect COVID-19 through recorded cough signals.

Many cough datasets have been created to encourage the research community to combat SARS-CoV-2. Sharma, et al. (2020) created a COVID-19 dataset, namely Coswara. It comprises recorded breathing, coughing, and voice signals of 6,507 clear sounds and 1,117 noisy ones collected from 941 subjects. Additionally, Cohen-McFarlane, Goubran, and Knoefel (2020) created a dataset of 73 samples to contribute toward the availability of COVID-19 data. Another dataset, i.e., COUGHVID, was developed in Orlandic, Teijeiro, and Atienza (2021). It contains approximately 20,000 recorded cough sounds, accompanied by bio-information such as age, gender, geographic location, and COVID-19 positive/negative test outcomes.

In general, many studies tend to employ CNNs for audio, speech, and signal processing by using 2D representations such as MFCC, spectrogram, Mel spectrogram, and other time–frequency structures. Indeed, pre-trained CNN architectures are widely applied to audio classification, sound-event detection, speech emotion recognition, and speech recognition. Many recent studies on detecting and diagnosing COVID-19 are based on audio, speech, and signal processing methods. One of the most common pre-trained models for audio is PANNs (Kong, et al., 2020), which is used to serve different audio pattern recognition tasks using the large-scale AudioSet dataset (Gemmeke, et al., 2017). Because AudioSet consists of many respiratory sounds such as breath, cough, sneeze, and sniff, it becomes helpful to extract features from recorded cough sounds for COVID-19 detection. In this study, the pre-trained CNN14 of PANNs is employed in the second stage of the proposed Fruit-CoV framework to extract high-level feature representations pertaining to Wavegram-Log-Mel-CNN.

Recent advances in AI have accelerated the use of neural networks and deep learning techniques to detect COVID-19 through cough and breath sounds. In Han, et al. (2020), two feature sets extracted from the openSMILE (Eyben, Wöllmer, & Schuller, 2010) toolkit with SVM were used to diagnose COVID-19. To detect COVID-19 through recorded voice sounds, Pinkas, et al. (2020) utilized a three-stage deep learning and machine learning approach. It includes a self-supervised bidirectional transformer encoder, recurrent neural networks, and an ensemble stacking technique with linear SVM. The preliminary findings are helpful in enhancing the recognition of COVID-19. Finally, Verde, et al. (2021) proposed a vocal-fold strategy to accurately identify COVID-19 from speech and voice analysis.

In another approach, researchers have investigated non-iterative techniques such as extreme machine learning (EML) and random vector functional link (RVFL) for diagnosing SARS-CoV-2 infections. For instance, Rajpal, et al. (2022) designed a COV-ELM framework based on EML to classify COVID-19 into three classes: normal, COVID-19, and pneumonia. The COV-ELM includes three stages as follows: (1) preprocessing and transformation stage; (2) feature extraction stage; and (3) EML classification stage. The EML-based classifier is better than the conventional gradient-based learning algorithm in terms convergence speed, generalization, and training time. Hazarika and Gupta (2020) proposed a framework named WCRVFL by combining the advantages of wavelet-coupled and RVFL for modeling and forecasting COVID-19 spread. As a result, the WCRVFL was more comparable with state-of-the-art support vector regression and conventional RVFL in terms of performance and generalization. Albadr, et al. (2020) proposed an optimization-based EML classifier for detecting COVID-19 by employing an optimized genetic algorithm to optimize the parameters for EML. Besides, a popular dimension reduction method named principal component analysis was also applied to reduce the dimensionality of the histogram of oriented gradient features. Moreover, many studies have also employed non-iterative techniques for detecting COVID-19 via X-ray images and chest computed tomography (CT) scans (Khan, et al., 2021, Kumar et al., 2022, Murugan and Goel, 2021, Zhang, et al., 2021).

3. Proposed method

3.1. Motivation

Fig. 1 depicts a difference between cough sounds of positive and negative COVID-19 cases. Because both sounds represent dry coughs, it is a challenging task to discriminate them, especially using traditional methods or depending on human factors. One notable observation is that the features extracted from the waveform and spectrogram of a negative COVID-19 cough sound are clear and sparse. In contrast, the features are dense and noisy in the case of positive COVID-19 cough sounds. Therefore, a vision-based deep learning method is a good option for tackling this classification task.

Fig. 1.

Fig. 1

Visualization of negative COVID-19 (a) and positive COVID-19 (b) samples.

Our study is also inspired by the need for a COVID-19 test method that is convenient, inexpensive, and contactless while can provide rapid and accurate results. From this perspective, detecting COVID-19 through cough sounds offers a viable solution. Such a method is flexible, as the service can be provided as a mobile application that can be quickly deployed in any situation. In addition, the method potentially has high accurate recognition rates for group testing or group isolation (Laguarta et al., 2020) scenarios. Therefore, a novel vision-based framework is proposed to tackle this problem in this study.

3.2. Background

3.2.1. Feature extraction

Two different feature sets of cough sounds are extracted before feeding into the proposed Fruit-CoV framework. The first feature set is based on Log-Mel spectrograms, while the second is generated using the Wavegram-Log-Mel-CNN model (Kong, et al., 2020). The following subsections describe the features in more detail.

Log-Mel spectrogram.

Given a recorded cough sound with a sampling rate of Sr, a period of P is selected. Then, a Short-Time Fourier Transform (STFT) is applied to each period P for generating a spectrogram with a window size of Ws or the number of Fast Fourier Transform (FFT) bins, and a step or hop length of Hp between two successive windows. Next, a Mel filter bank of M is combined with the spectrogram using a logarithmic calculation to obtain the Log-Mel spectrogram. Finally, the Log-Mel spectrogram is employed as the input for the backbone EffNetV2. The parameter settings are presented in Table 5, Table 6.

Table 5.

Parameter setting for transforming sounds into Log-Mel spectrograms in the first stage.

Sr P Ws Hp M f_min f_max
48 kHz 15 1024 320 128 50 24000
8 kHz 35 1024 320 64 50 14000
4 kHz 35 1024 320 256 50 2000
Table 6.

Parameter setting for transforming sounds into Log-Mel spectrograms in the second stage.

Architecture Sr P Ws Hp M f_min f_max
[1] 48 kHz 15 1024 320 128 50 24000
[2] 32 kHz 32 1024 320 64 50 14000

[1] 8 kHz 35 1024 320 64 50 14000
[2] 8 kHz 35 256 80 64 50 4000

[1] 4 kHz 35 1024 320 256 50 2000
[2] 8 kHz 35 256 80 64 50 4000

Note: [1] is the EffNetV2 and [2] is the CNN14.

Wavegram-Log-Mel-CNN.

Proposed in Kong, et al. (2020), Wavegram-Log-Mel-CNN produces a feature map extracted from Log-Mel spectrograms. The features are concatenated with a wavegram extracted by a CNN-based architecture, as shown in Fig. 3. Further details of Wavegram-Log-Mel-CNN is presented in 3.2.3.

Fig. 3.

Fig. 3

The architecture of Wavegram-Log-Mel-CNN.

3.2.2. EfficientNet-V2 architecture

The aim of EfficientNet-V2 (EffNetV2) (Tan & Le, 2021) is to scale deep neural networks. Specifically, the method scales an existing model to the depth, width, and resolution of input images to boost the overall performance. ResNet (He et al., 2016), for example, is a well-known architecture that performs depth-wise scaling. However, ResNet has a problem with traditional scaling methods, i.e., scaling becomes saturated after certain levels, and the performance starts degrading. EfficientNet (Tan & Le, 2019), known as EfficientNet-V1, attempts to scale strategically the depth, width and image resolution by applying a multi-objective Neural Architecture Search (NAS) that optimizes both accuracy and floating-point operations.

EffNetV2 recognizes inefficiencies in the architecture and scaling strategies of EfficientNet-V1. EffNetV2 extends the original search space by adding accelerator-friendly operations, such as FusedMBConv, as described in Fig. 2. It simplifies the search space by removing unnecessary operations, such as average pooling and max pooling, which NAS never selects. Furthermore, EffNetV2 uses a progressive learning scheme that helps handle small images.

Fig. 2.

Fig. 2

The MBConv (a) and Fused-MBConv (b) Blocks of the EffNetV2 (Tan & Le, 2021) architecture.

EffNetV2 outperforms previous models in inference speed and network size (6.8x smaller compared with that of version 1). As a result, the training speed of EffNetV2 is 11 times faster than that of the original version while maintaining the same accuracy. The architecture of EffNetV2 is described in Table 1, where Conv denotes a convolution layer, K is a kernel size, SE is a ratio in SqueezeExcite (Hu, Shen, & Sun, 2018) or SE module, where FC denotes a fully connected layer.

Table 1.

The layers and corresponding parameters of the EffNetv2 (Tan & Le, 2021) architecture.

Stage Operator Stride Number of channels Number of layers
0 Conv3 × 3 2 24 1
1 Fused-MBConv1, K = 3 1 24 2
2 Fused-MBConv4, K = 3 2 48 4
3 Fused-MBConv4, K = 3 2 64 4
4 MBConv4, K = 3, SE 0.25 2 128 6
5 MBConv6, K = 3, SE 0.25 1 160 9
6 MBConv6, K = 3, SE 0.25 2 256 15
7 Conv1 × 1 & Pooling & FC 1280 1

3.2.3. PANNs architecture

Several pre-trained models can handle multiple tasks in natural language processing (NLP) and computer vision. However, in audio pattern recognition, a limited number of pre-trained models can process different tasks. In this study, one such model is utilized, namely PANNs (Kong, et al., 2020), to tackle our problem. These networks can be transferred and used in audio-related tasks such as acoustic scene classification, general audio tagging, music classification, and speech emotion classification.

The Wavegram-Log-Mel-CNN architecture is also employed that uses both Log-Mel spectrograms and waveforms as input features. The architecture is illustrated in Fig. 3. Specifically, the input of Wavegram-Log-Mel-CNN includes sound signals in the form of waveforms. The output is a combination of Wavegram features and Log-Mel features. In this study, a pre-trained Wavegram model proposed by Kong, et al. (2020) is used.

In Fig. 3, the first convolution 1D block with a kernel size of K=11 and a stride of S=5 is applied to raw waveforms. This is followed by three pairs of convolution 1D and max pooling 1D. In convolution 1D, two convolutional layers with a dilation rate of 1 and 2 are used. Each max pooling 1D has a stride of 4. Before feeding into the Wavegram, feature maps are reshaped by a set of parameters (N,T,C,F), where N is the batch size, T is the number of frames, C is the number of channels, and F is the number of frequency bins.

The CNN14 architecture, as illustrated in Fig. 4, is the backbone of Wavegram. CNN14 can capture time–frequency invariant patterns from Wavegram since kernels are convolved along both time and frequency domains in Wavegram.

Fig. 4.

Fig. 4

The architecture of CNN14.

3.2.4. GeM pooling layer

GeM pooling was proposed by Radenović et al. (2018). Given an input 2D image, the output of a CNN is a 3D tensor with K×W×H denoting the depth, width, and height of feature maps. Let Xg be a set of W×H feature maps, where g=1,2,,G and G=K represent the number of feature maps. The feature map fm(GeM) obtained by a GeM pooling layer is defined in Eq. (1):

fm(GeM)=[f1(GeM)fg(GeM)fG(GeM)] (1)

or in short,

fmg(GeM)=(1|Xg|xXgxPg)1Pg, (2)

where Pg is a pooling parameter that can be learned or set manually. If (Pg) then GeM pooling becomes a max pooling, and if (Pg1) then it becomes an average pooling. In this study, Pg is set to 3. Fig. 5 illustrates the GeM pooling scheme.

Fig. 5.

Fig. 5

Illustration of a GeM pooling.

3.3. Fruit-CoV

In this paper, the baseline EfficientNet-V2 in the first stage is leveraged to extract high-level visual representations from Log-Mel spectrograms. In the second stage, a pre-trained CNN with 14 layers from PANNs is exploited to extract additional embedding features from Log-Mel spectrograms. Because PANNs is pre-trained on a large-scale AudioSet dataset consisting of many respiratory audio classes, the extracted features can be used as bio-embedding features. Embedding features then are concatenated to eventually train a binary classifier. Fig. 6 depicts an overview of the proposed scheme.

Fig. 6.

Fig. 6

The proposed two-stage framework.

The data samples are collected with three main sampling rates: 4 kHz, 8 kHz, and 48 kHz. The following data processing schemes are applied:

  • Case 1 (4 kHz): Cough sounds are converted into Log-Mel spectrograms with a sampling rate of 4 kHz. The Log-Mel spectrograms are used as inputs for two stages. In the first stage, besides EfficientNet-V2, a GeM pooling is added to identify important feature representations. Such representations are propagated to a fully connected layer for classification into infected or non-infected categories. In the second stage, sounds are converted into Log-Mel spectrograms using a sampling rate of 8 kHz before feeding into the CNN14. Moreover, to adapt to the embedding features (Embedding 1) extracted from the trained EffNetV2, the features from ‘conv_block6’ of CNN14 is selected, followed by a GeM pooling layer to generate the embedding features, i.e., Embedding 2.

  • Case 2 (8 kHz): A similar procedure as that of 4 kHz is applied except that the Log-Mel spectrograms are processed by a sampling rate of 8 kHz in both stages. Additionally, Embedding 2 is chosen from the ‘embedding’ layer of CNN14 without adding the GeM pooling layer.

  • Case 3 (48 kHz): The procedure is different from the first two cases. Specifically, cough sounds are converted into Log-Mel spectrograms with a sampling rate of 48 kHz in the first stage. However, in the second stage, to avoid losing the quality of sounds and unsynchronizing with weights from the pre-trained model, the Wavegram-Log-Mel-CNN sounds with a sampling rate of 32 kHz are processed before applying CNN14 to generate Embedding 2.

  • For other sampling rates, sounds are converted into Log-Mel spectrograms using the closest sampling rate as listed in case 1, case 2, or case 3.

Technically, the first stage of the Fruit-CoV framework works as feature representation learning that is trained to extract the bio-embedding efficiently, while the second stage plays as transfer learning that applies the first stage to a downstream task of discriminating COVID-19 positive and COVID-19 negative. In fact, the pre-trained PANNs was trained on a large-scale AudioSet dataset and then transferred to learn the bio-embedding features on the dataset in this study. As a result, the trained model from the first stage and the pre-trained PANNs not only can be used to extract the bio-embedding features on the dataset in this study but also can be used to discriminate the cough sounds from COVID-19 positive and the ones from other dry cough sounds.

4. Performance evaluation

4.1. Dataset

The AICovidVN 115M Challenge is a community project in Vietnam toward finding efficient and effective solutions for detecting COVID-19 based on cough sounds. The winning team is selected to implement an appropriate test service on a broader scale in Vietnam. The dataset contains two classes, i.e., positive and negative labels. As shown in Fig. 7, it comprises 4,508 (61.2%) training samples, 1,230 (16.7%) public test samples and 1,627 (22.1%) private test samples.

Fig. 7.

Fig. 7

Distribution of the dataset of the AICovidVN 115M Challenge.

Since different devices are utilized to record cough sounds, the dataset includes sounds with different sampling rates. As a result, the problem becomes challenging as Fruit-CoV is required to be designed in such a way that it can adapt to different sampling rates. Specifically, three major sampling rates are employed for use with a deep learning model, i.e., 4 kHz, 8 kHz, and 48 kHz, as mentioned in Section 3.3.

Moreover, the dataset is restricted to respiratory diseases due to privacy. Hence, there are only 910 samples in the training set on respiratory diseases, consisting of 896 negative and 14 positive samples. The dataset is also imbalanced, as presented in Table 2, Table 3. Note that gender information is not considered in this study. Therefore, the proposed Fruit-CoV framework is designed to detect and diagnose SARS-CoV-2 infections using only recorded cough sounds.

Table 2.

Distribution of COVID-19 positive and negative cases in the training set.

Label Total samples With noise Without noise
Negative 3825 426 3399
Positive 679 30 649

Table 3.

Distribution of gender in the training set.

Gender Total samples With noise Without noise
Male 3166 338 2828
Female 1338 118 1220

4.2. Hyperparameter and experimental settings

Hyperparameter settings for training Fruit-CoV are presented in Table 4. The SpecAugment (Park, et al., 2019) method is only applied for the EffNetV2 architecture in both stages. Additionally, k-fold cross-validation is used to generalize the model with k=5, while the optimizer is the Cosine-based AngularGrad (CAG) function (Roy, et al., 2021), the loss function is a binary cross-entropy with logits (BCEs), and the learning rate scheduler is started after 90% of the number of epochs in the first stage, but it is not applied in the second stage.

Table 4.

Hyper-parameter settings of the proposed framework.

Params Configure/Setup
Learning rate 1e3
Learning rate scheduler Cosine(2e4)
Batch size 16
Epochs 30

For transforming sounds into Log-Mel spectrograms, the parameters defined for the first stage are shown in Table 5, and the second stage are shown in Table 6. In both cases, f_min denotes the lowest frequency and f_max denotes the highest frequency in Hz, and P denotes the period in seconds. The optimal parameters for training model in the first stage and the second one are presented in Table 7 and Table 8, respectively, where Optim stands for optimizer and Lr stands for learning rate.

Table 7.

Optimal parameters for training models in the first stage.

Case Loss Optim Lr Stochastic weigh average
4 kHz BCEs CAG 1e-3 Start at 90% epochs, Lr = 2e-4
8 kHz BCEs CAG 1e-3 Start at 90% epochs, Lr = 2e-4
48 kHz BCEs CAG 1e-3 Start at 90% epochs, Lr = 2e-4

Table 8.

Optimal parameters for training models in the second stage.

Case Loss Optim Lr Stochastic weigh average
4 & 8 kHz BCEs CAG 1e-3 None
8 & 8 kHz BCEs CAG 1e-3 None
48 & 32 kHz BCEs CAG 1e-3 None

The infrastructures used for the training include an Intel CORE i7 7700 CPU with 16 GB RAM and distributed data-parallel training with 2 GPUs: NVIDIA 1080ti EVGA SC2 (11 GB VRAM), and NVIDIA 1070 Gigabyte OC (8 GB VRAM). Validations are conducted using two scenario: (1) using only the public dataset; (2) using the public dataset and the private test dataset. Experimental results and detailed analysis are presented in Section 4.3.

4.3. Evaluation results and discussion

4.3.1. Public dataset

In this case, k-fold cross validation is used for evaluation. The experimental results for the public dataset are displayed in Fig. 8, where each curve represents an average result of 5-fold. In this phase, although the AICovidVN 115M Challenge uses the AUC score as a benchmark to rank submitted results, the ROC curve is additionally used to evaluate the proposed framework. Fig. 8(a) presents the ROC curves of 4 kHz, 8 kHz, and 48 kHz sampling rates, with AUC scores of 0.9851, 0.9901, and 0.9771, respectively. Fig. 8(b) shows the ROC curve of the combination of 4 kHz and 48 kHz sampling rates, with an AUC score of 0.9341, while Fig. 8(c) presents the ROC curve of the combination of 8 kHz and 48 kHz sampling rates, with an AUC score of 0.9676. The results in Fig. 8 indicate that using three sampling rates is better than two regarding individual and average accuracy. These findings are significant due to varying sampling rates for data sample processing. In addition, Precision–Recall (PR) curves are also presented in Fig. 9 to interpret the probabilistic predictions in terms of imbalanced classification.

Fig. 8.

Fig. 8

Visualization of ROC curves on the public test set.

Fig. 9.

Fig. 9

Visualization of PR curves on the public test set.

4.3.2. Public dataset and private test dataset

In this case, the public dataset is used for training Fruit-CoV. It is then evaluated using the private test dataset containing 1,627 samples. Fig. 10, Fig. 11 depict AUC scores of the first and second stages with different sampling rates, respectively. In Fig. 10, the scores achieve 93%, 96%, and 94% in the first stage with sampling rates of 4 kHz, 8 kHz, and 48 kHz, respectively. Similarly, in the second stage (see Fig. 11), the scores achieve 95%, 97%, and 99%, respectively. In short, the proposed framework obtains an average AUC score of 92.8% on the private test dataset.

Fig. 10.

Fig. 10

AUC scores on the first stage with (a) 4; (b) 8; and (c) 48 kHz, respectively, on the private test set.

Fig. 11.

Fig. 11

AUC scores on the second stage with (a) 4; (b) 8; and (c) 48 kHz, respectively, on the private test set.

4.3.3. Discussion

Compared with existing studies, the proposed method is robust and flexible with different sampling rates. A high average AUC score of 92.8% on the private test data from the AICovidVN 115M Challenge has been achieved with the proposed method. Because this study is dealing with an imbalanced classification task, it is inappropriate to evaluate the model performance in terms of Accuracy, F1 Score, Precision, and Recall. Therefore, using ROC curves and ROC AUC is more suitable for this task with few samples of the minority class. Besides, Precision–Recall curves and Precision–Recall AUC are also provided for the imbalanced classification task.

Compared with the previous models, only the recorded cough sounds are used instead of using a combination of cough sounds, speech, and bio-marker information, such as in Laguarta et al. (2020)

The quality of data used in our evaluation is low since the samples have been taken from multiple countries (Vietnam, India, and Switzerland) with a variety of sampling rates using only mobile phones. Therefore, a higher detection rate is possible if the data quality improves, e.g., using sounds recorded with high-quality recording devices.

Table 9 summarizes the comparison of the proposed method with several existing solutions for discriminating between COVID-19 positive and COVID-19 negative. As shown in Table 9, the proposed method is more comparable with the latest and state-of-the-art solutions that have employed cough sounds for detecting COVID-19. Although some approaches are better than ours, they have been investigated on either fewer data samples or more balanced datasets. As a result, the proposed method is also more effective and robust than the latest and state-of-the-art method.

Table 9.

Comparison of the proposed method with several existing approaches.

Solution Feature Method Result
Mohammed, et al. (2021) - Spectrogram
- MelSpectrum
- Tonal
- Raw
- MFCCs
- Power spectrum
- Chroma
- Ensemble of CNN
classifier
- Best AUC score of 77% for
discriminating between COVID-19 positive and COVID-19 negative

Coppock, et al. (2021) - Spectrogram - COVID-19 Identification
ResNet
- Best AUC score of 84.6% for
discriminating between COVID-19 positive and COVID-19 negative

Nessiem, et al. (2021) - Raw audio in combination with
different spectrogram variations
- Convolutional Neural
Networks - Ensemble models - Bayesian Optimization HyperBand
- Best AUC score of 80.7% for
discriminating between COVID-19 positive and COVID-19 negative

Melek (2021) - MFCCs with LOO-CV
(Sammut & Webb, 2011) strategy
- Support Vector Machine
- Linear discriminant analysis
- K-nearest neighbors
- Partial least squares regression
- Sequential Forward Search
- Best AUC score of 98.6% for
discriminating between COVID-19 positive and COVID-19 negative

Pahar et al. (2021) - MFCCs
- Log energies
- Zero-crossing rate
- Kurtosis
- Logistic Regression
- Support Vector Machine
- Multilayer Perceptrons
- Convolutional Neural Networks
- Long short-term memory
- ResNet-50
- SMOTE oversampling for the minor class
- Sequential Forward Search
- Best AUC score of 98% for
discriminating between COVID-19 positive and healthy coughs - Best AUC score of 94% for discriminating between COVID-19 positive and COVID-19 negative

Brown, et al. (2020) - Handcrafted with PCA = 0.95
- VGGish (Hershey, et al., 2017) with PCA = 0.95, and duration, tempo, onset, period
- Logistic Regression - Best AUC score of above 80%
for discriminating between COVID-19 positive and COVID-19 negative

Ours - Log-Mel spectrogram
- Wavegram-Log-Mel-CNN
- PANNs
- EffNetV2
- Best AUC score of 92.8% for
discriminating between COVID-19 positive and COVID-19 negative

To reduce the complexity in time and space, the number of layers can be limited. However, the main aim is to optimize the model performance within a reasonable training/inference time. Based on Table 10, Table 11 (see Appendix), it is observed that increasing the number of epochs and layers can lead to increased model performance as well as increased complexity in time and space. Therefore, the number of epochs and layers are chosen to lead to great performance for the model.

Table 10.

Comparison of the proposed framework with different numbers of epochs.

Case Number of epochs AUC score Training time
1 25 97.72% 5.40 h
2 30 97.86% 6.09 h
3 40 97.94% 7.48 h
Table 11.

Comparison of the proposed framework with different numbers of layers.

Case Number of layers AUC score Training time
1 1 97.86% 6.09 h
2 2 97.92% 6.10 h
3 3 98.08% 6.10 h

Moreover, because only cough sounds were recorded instead of speech ones, NLP techniques were not applied. Besides, cough signals can be considered abnormal signals while speech ones can be represented as sequential signals or time series signals. So, it is suitable to apply vision-based techniques instead of NLP-based ones to classify whether COVID-19 positive or COVID-19 negative by converting cough sounds into Mel-spectrogram-based features in this study. Although an NLP-based technique has been experimented with, namely the long short-term memory (LSTM) network in the current study, the results were unpromising. As a result, the Fruit-CoV framework has been designed based on vision-based techniques alone, and a proper architecture for NLP techniques has not been investigated further in the current study. In the future, if both cough and speech signals would be recorded, NLP techniques can be integrated to acquire better accuracy for COVID-19 detection.

5. Conclusions

With high performance, a vision-based framework named Fruit-CoV, is proposed to detect and diagnose SARS-CoV-2 infections through recorded cough sounds conveniently. A hybrid deep neural network with state-of-the-art architectures, i.e., EfficientNet-V2 and PANNs, has been proposed. An available dataset has been analyzed with an appropriate pre-processing procedure. Experimental results show that the proposed Fruit-CoV framework obtains an AUC score of 92.8%, winning the AICovidVN 115M Challenge. In addition, the proposed framework is more comparable with the latest and state-of-the-art methods that have employed cough sounds for discriminating between positive and negative samples. As summarized in Table 9, Fruit-CoV is more effective, efficient, and robust than the latest and state-of-the-art solutions. Moreover, Fruit-CoV can be integrated into a call center, a VoIP system, or a mobile application to deploy as a self-testing service for COVID-19 detection from home rapidly and conveniently. Fruit-CoV can be further improved by aggregating recorded cough sounds, automating data processing, exploring state-of-the-art techniques, and generating an evaluation benchmark of existing methods. While coughing is the most prominent symptom of COVID-19, other symptoms such as fever, headache, and fatigue can be taken into account for a better detection. Further investigations to develop a framework for COVID-19 detection using a combination of different data modalities can be promising. For example, images and videos taken with smartphone cameras, or data collected by on-board inertial sensors, can be used for fatigue detection. Data obtained from temperature-fingerprint sensors can be used for fever prediction. Camera images and measurements from inertial sensors can be used for headache prediction. Integrating different data modalities obtained from mobile phones to address different symptoms can further increase the accuracy of COVID-19 detection, offering a rapid, convenient, and inexpensive solution for undertaking this important medical problem. Our solution can be important for developing countries with limited healthcare facilities and low medical budgets. Finally, the recorded cough sounds have been collected in multiple countries and regions via multiple devices in different ways, which might be a drawback toward a higher detection performance for some reasons: (1) The cough sounds could vary significantly among different countries, regions, ages, and genders; (2) each device has a different sensor for recording sounds that leads to different sampling rates. Therefore, developing a cloud-based web application with standard recording sensors might be considered in the next stage of this Fruit-CoV framework.

CRediT authorship contribution statement

Long H. Nguyen: Propose ideas, Develop software, Participate in the challenge. Nhat Truong Pham: Propose ideas, Draft the paper, Participate in the challenge. Van Huong Do: Develop software and deploy the service. Liu Tai Nguyen: Develop software and deploy the service. Thanh Tin Nguyen: Revise the paper and support the development. Hai Nguyen: Revise the paper and support the development. Ngoc Duy Nguyen: Supervise the team, Provide the methodology, Revise the paper. Thanh Thi Nguyen: Provide the methodology and revise the paper. Sy Dzung Nguyen: Discussion, Revise the paper, Support the development. Asim Bhatti: Provide the methodology and revise the paper. Chee Peng Lim: Supervise the team, Provide the methodology and revise the paper.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

The authors thank the organizer of the AICovidVN 115M Challenge for providing the recorded cough sound dataset. The authors also thank all members of the FruitLab team for many useful discussions. The authors also would like to thank all Editors and Reviewers for their useful comments and thoughtful suggestions to improve the paper.

Footnotes

The proposed model in this study won the first place on the leaderboard of the AICovidVN 115M Challenge (https://www.covid.aihub.vn/ or https://aihub.vn/competitions/22).

Appendix A. Ablation studies

Several experiments have been conducted with different numbers of epochs and layers to validate the effectiveness of the proposed framework. Table 10, Table 11 present the experimental results with different numbers of epochs and layers (the number of fully connected layers after embedding fusion), respectively. In particular, only a sampling rate of 4 kHz has been utilized for the EffNetV2 in both the first and second stages, and a sampling rate of 8 kHz for the CNN14 in the second stage has been employed for the experiment and analysis using the public dataset. Besides, k-fold cross validation is also used for evaluation with k=5. Moreover, it should be noted that all results in these ablation studies have been implemented on an NVIDIA RTX A6000 instead of a combination of an NVIDIA 1080ti EVGA SC2 and an NVIDIA 1070 Gigabyte OC due to the changes in our existing computing resources. Therefore, these results might be different from the performance evaluation in Section 4. However, all other hyperparameter and experimental settings are the same as in Table 4, Table 5, Table 6, Table 7, Table 8.

The results in Table 10 indicate that using a different number of epochs can produce different AUC scores with different training times. The AUC score only increases a little bit when increasing the number of epochs. However, the training time does increase linearly over the number of epochs. Besides, using different numbers of layers (the number of fully connected (FC) layers) can obtain different AUC scores, but the difference is very slight, as presented in Table 11. It is noted that the number of epochs is 30. Case 1 is the default of the proposed method, while Case 2 and 3 are constructed by adding one or two FC layer(s) with 1,024 or 1,024 and 64 units, respectively. As a result, training times may be similar in three cases due to the addition of one or two FC layer(s) that are not sufficiently different when running on the NVIDIA RTX A6000. In particular, increasing the number of layers might lead to increasing complexity in time and space.

Data availability

We have shared the link to our code in the manuscript.

References

  1. Abdel-Basset M., Hawash H., Elhoseny M., Chakrabortty R.K., Ryan M. Deeph-dta: deep learning for predicting drug-target interactions: a case study of covid-19 drug repurposing. IEEE Access. 2020;8:170433–170451. doi: 10.1109/ACCESS.2020.3024238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Abdelmageed M.I., Abdelmoneim A.H., Mustafa M.I., Elfadol N.M., Murshed N.S., Shantier S.W., et al. Design of a multiepitope-based peptide vaccine against the e protein of human covid-19: an immunoinformatics approach. BioMed Research International. 2020;2020 doi: 10.1155/2020/2683286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Albadr M.A.A., Tiun S., Ayob M., Al-Dhief F.T., Omar K., Hamzah F.A. Optimised genetic algorithm-extreme learning machine approach for automatic covid-19 detection. PLoS One. 2020;15 doi: 10.1371/journal.pone.0242899. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Balaha H.M., El-Gendy E.M., Saafan M.M. Covh2sd: A covid-19 detection approach based on harris hawks optimization and stacked deep learning. Expert Systems with Applications. 2021;186 doi: 10.1016/j.eswa.2021.115805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Banerjee A., Santra D., Maiti S. 2020. Energetics based epitope screening in sars cov-2 (covid 19) spike glycoprotein by immuno-informatic analysis aiming to a suitable vaccine development. BioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Bock J.O., Ortea I. Re-analysis of sars-cov-2-infected host cell proteomics time-course data by impact pathway analysis and network analysis: a potential link with inflammatory response. Aging (Albany NY) 2020;12(11277) doi: 10.18632/aging.103524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Brown, C., Chauhan, J., Grammenos, A., Han, J., Hasthanasombat, A., Spathis, D., et al. (2020). Exploring automatic diagnosis of covid-19 from crowdsourced respiratory sound data. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (pp. 3474–3484).
  8. Castorina P., Iorio A., Lanteri D. Data analysis on coronavirus spreading by macroscopic growth laws. International Journal of Modern Physics C. 2020;31 [Google Scholar]
  9. Cohen-McFarlane M., Goubran R., Knoefel F. Novel coronavirus cough database: Nococoda. IEEE Access. 2020;8:154087–154094. doi: 10.1109/ACCESS.2020.3018028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Coppock H., Gaskell A., Tzirakis P., Baird A., Jones L., Schuller B. End-to-end convolutional neural network enables covid-19 detection from breath and cough audio: a pilot study. BMJ Innovations. 2021;7 doi: 10.1136/bmjinnov-2021-000668. [DOI] [PubMed] [Google Scholar]
  11. Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on multimedia (pp. 1459–1462).
  12. Faezipour M., Abuzneid A. Smartphone-based self-testing of covid-19 using breathing sounds. Telemedicine and E-Health. 2020;26:1202–1205. doi: 10.1089/tmj.2020.0114. [DOI] [PubMed] [Google Scholar]
  13. Gemmeke J.F., Ellis D.P., Freedman D., Jansen A., Lawrence W., Moore R.C., et al. 2017 IEEE international conference on acoustics, speech and signal processing. IEEE; 2017. Audio set: An ontology and human-labeled dataset for audio events; pp. 776–780. [Google Scholar]
  14. Giordano G., Blanchini F., Bruno R., Colaneri P., Filippo A.Di., Di Matteo A., et al. Modelling the covid-19 epidemic and implementation of population-wide interventions in italy. Nature Medicine. 2020;26:855–860. doi: 10.1038/s41591-020-0883-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Han J., Qian K., Song M., Yang Z., Ren Z., Liu S., et al. 2020. An early study on intelligent analysis of speech under covid-19: severity, sleep quality, fatigue, and anxiety. arXiv preprint arXiv:2005.00096. [Google Scholar]
  16. Hazarika B.B., Gupta D. Modelling and forecasting of COVID-19 spread using wavelet-coupled random vector functional link networks. Applied Soft Computing. 2020;96 doi: 10.1016/j.asoc.2020.106626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
  18. Heroy S. 2020. Metropolitan-scale covid-19 outbreaks: how similar are they? arXiv preprint arXiv:2004.01248. [Google Scholar]
  19. Hershey S., Chaudhuri S., Ellis D.P., Gemmeke J.F., Jansen A., Moore R.C., et al. 2017 IEEE international conference on acoustics, speech and signal processing. IEEE; 2017. Cnn architectures for large-scale audio classification; pp. 131–135. [Google Scholar]
  20. Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).
  21. Imran A., Posokhova I., Qureshi H.N., Masood U., Riaz M.S., Ali K., et al. Ai4covid-19: Ai enabled preliminary diagnosis for covid-19 from cough samples via an app. Informatics in Medicine Unlocked. 2020;20 doi: 10.1016/j.imu.2020.100378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Khan M.A., Majid A., Akram T., Hussain N., Nam Y., Kadry S., et al. Classication of covid-19 ct scans via extreme learning machine. Computers, Materials and Continua. 2021;68 [Google Scholar]
  23. Kong Q., Cao Y., Iqbal T., Wang Y., Wang W., Plumbley M.D. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020;28:2880–2894. [Google Scholar]
  24. Kumar L.K., Alphonse P. Automatic diagnosis of covid-19 disease using deep convolutional neural network with multi-feature channel from respiratory sound data: cough, voice, and breath. Alexandria Engineering Journal. 2021 [Google Scholar]
  25. Kumar V., Gupta A., Hazarika B.B., Gupta D. 2022. Automatic diagnosis of covid-19 from chest x-ray images using transfer learning-based deep features and machine learning models. [Google Scholar]
  26. Laguarta J., Hueto F., Subirana B. Covid-19 artificial intelligence diagnosis using only cough recordings. IEEE Open Journal of Engineering in Medicine and Biology. 2020;1:275–281. doi: 10.1109/OJEMB.2020.3026928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Li Z., Li X., Huang Y.Y., Wu Y., Zhou L., Liu R., et al. 2020. Fep-based screening prompts drug repositioning against covid-19. BioRxiv. [Google Scholar]
  28. Li Z., Zhao S., Chen Y., Luo F., Kang Z., Cai S., et al. A deep-learning-based framework for severity assessment of covid-19 with ct images. Expert Systems with Applications. 2021;185 doi: 10.1016/j.eswa.2021.115616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Melek M. Diagnosis of covid-19 and non-covid-19 patients by classifying only a single cough sound. Neural Computing and Applications. 2021:1–12. doi: 10.1007/s00521-021-06346-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Miao L., Last M., Litvak M. Tracking social media during the covid-19 pandemic: The case study of lockdown in new york state. Expert Systems with Applications. 2022;187 doi: 10.1016/j.eswa.2021.115797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Mishra R.K., Urolagin S., Jothi J.A.A., Neogi A.S., Nawaz N. Deep learning-based sentiment analysis and topic modeling on tourism during covid-19 pandemic. Frontiers of Computer Science. 2021;3 doi: 10.3389/fcomp.2021.775368. [DOI] [Google Scholar]
  32. Mohammed E.A., Keyhani M., Sanati-Nezhad A., Hejazi S.H., Far B.H. An ensemble learning approach to digital corona virus preliminary screening from cough sounds. Scientific Reports. 2021;11:1–11. doi: 10.1038/s41598-021-95042-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Morís D.I., de Moura Ramos J.J., Buján J.N., Hortas M.O. Data augmentation approaches using cycle-consistent adversarial networks for improving covid-19 screening in portable chest x-ray images. Expert Systems with Applications. 2021;185 doi: 10.1016/j.eswa.2021.115681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Murugan R., Goel T. E-diconet: Extreme learning machine based classifier for diagnosis of covid-19 using deep convolutional network. Journal of Ambient Intelligence and Humanized Computing. 2021;12:8887–8898. doi: 10.1007/s12652-020-02688-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Nessiem M.A., Mohamed M.M., Coppock H., Gaskell A., Schuller B.W. 2021 IEEE 34th international symposium on computer-based medical systems. IEEE; 2021. Detecting covid-19 from breathing and coughing sounds using deep neural networks; pp. 183–188. [Google Scholar]
  36. Nguyen T.T., Pathirana P.N., Nguyen T., Nguyen Q.V.H., Bhatti A., Nguyen D.C., et al. Genomic mutations and changes in protein secondary structure and solvent accessibility of sars-cov-2 (covid-19 virus) Scientific Reports. 2021;11:1–16. doi: 10.1038/s41598-021-83105-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Orlandic L., Teijeiro T., Atienza D. The coughvid crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Scientific Data. 2021;8:1–10. doi: 10.1038/s41597-021-00937-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Pahar M., Klopper M., Warren R., Niesler T. Covid-19 cough classification using machine learning and global smartphone recordings. Computers in Biology and Medicine. 2021 doi: 10.1016/j.compbiomed.2021.104572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Park D.S., Chan W., Zhang Y., Chiu C., Zoph B., Cubuk E.D., et al. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. [Google Scholar]
  40. Peng L., Yang W., Zhang D., Zhuge C., Hong L. 2020. Epidemic analysis of covid-19 in china by dynamical modeling. arXiv preprint arXiv:2002.06563. [Google Scholar]
  41. Pham Q.V., Nguyen D.C., Huynh-The T., Hwang W.J., Pathirana P.N. Artificial intelligence (ai) and big data for coronavirus (covid-19) pandemic: A survey on the state-of-the-arts. IEEE Access. 2020;8:130820–130839. doi: 10.1109/ACCESS.2020.3009328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Pinkas G., Karny Y., Malachi A., Barkai G., Bachar G., Aharonson V. Sars-cov-2 detection from voice. IEEE Open Journal of Engineering in Medicine and Biology. 2020;1:268–274. doi: 10.1109/OJEMB.2020.3026468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Qian K., Schmitt M., Zheng H., Koike T., Han J., Liu J., et al. Computer audition for fighting the sars-cov-2 corona crisis–introducing the multi-task speech corpus for covid-19. IEEE Internet of Things Journal. 2021 doi: 10.1109/JIOT.2021.3067605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Qian K., Schuller B.W., Yamamoto Y. 2021 IEEE 3rd global conference on life sciences and technologies. IEEE; 2021. Recent advances in computer audition for diagnosing covid-19: An overview; pp. 181–182. [Google Scholar]
  45. Radenović F., Tolias G., Chum O. Fine-tuning cnn image retrieval with no human annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2018;41:1655–1668. doi: 10.1109/TPAMI.2018.2846566. [DOI] [PubMed] [Google Scholar]
  46. Rajpal S., Agarwal M., Rajpal A., Lakhyani N., Saggar A., Kumar N. COV-ELM classifier: An extreme learning machine based identification of COVID-19 using chest x-ray images. Intelligent Decision Technologies. 2022;16:193–203. doi: 10.3233/IDT-210055. [DOI] [Google Scholar]
  47. Roy S., Paoletti M., Haut J., Dubey S., Kar P., Plaza A., et al. 2021. Angulargrad: a new optimization technique for angular convergence of convolutional neural networks. arXiv preprint arXiv:2105.10190. [Google Scholar]
  48. Sainz-Pardo J.L., Valero J. Covid-19 and other viruses: Holding back its spreading by massive testing. Expert Systems with Applications. 2021;186 doi: 10.1016/j.eswa.2021.115710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Sammut C., Webb G.I. Springer Science & Business Media; 2011. Encyclopedia of machine learning. [Google Scholar]
  50. Schuller B.W., Schuller D.M., Qian K., Liu J., Zheng H., Li X. Covid-19 and computer audition: An overview on what speech & sound analysis could contribute in the sars-cov-2 corona crisis. Frontiers in Digital Health. 2021;3(14) doi: 10.3389/fdgth.2021.564906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Sharma N., Krishnan P., Kumar R., Ramoji S., Chetupalli S.R., Ghosh P.K., et al. 2020. Coswara–a database of breathing, cough, and voice sounds for covid-19 diagnosis. arXiv preprint arXiv:2005.10548. [Google Scholar]
  52. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning (pp. 6105–6114).
  53. Tan M., Le Q.V. 2021. Efficientnetv2: smaller models and faster training. arXiv preprint arXiv:2104.00298. [Google Scholar]
  54. Ting D.S.W., Carin L., Dzau V., Wong T.Y. Digital technology and covid-19. Nature Medicine. 2020;26:459–461. doi: 10.1038/s41591-020-0824-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Vaishya R., Javaid M., Khan I.H., Haleem A. Artificial intelligence (ai) applications for covid-19 pandemic. Diabetes & Metabolic Syndrome: Clinical Research & Reviews. 2020;14:337–339. doi: 10.1016/j.dsx.2020.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Valdés J.J., Xi P., Cohen-McFarlane M., Wallace B., Goubran R., Knoefel F. 2021 IEEE international symposium on medical measurements and applications. IEEE; 2021. Analysis of cough sound measurements including covid-19 positive cases: a machine learning characterization; pp. 1–6. [Google Scholar]
  57. Verde L., De Pietro G., Ghoneim A., Alrashoud M., Al-Mutib K.N., Sannino G. Exploring the use of artificial intelligence techniques to detect the presence of coronavirus covid-19 through speech and voice analysis. IEEE Access. 2021;9:65750–65757. doi: 10.1109/ACCESS.2021.3075571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Zhang X., Lu S., Wang S.H., Yu X., Wang S.J., Yao L., et al. Diagnosis of covid-19 pneumonia via a novel deep learning architecture. Journal of Computer Science and Technology. 2021;1 doi: 10.1007/s11390-020-0679-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

We have shared the link to our code in the manuscript.


Articles from Expert Systems with Applications are provided here courtesy of Elsevier

RESOURCES