Skip to main content
. 2023 Jan 5;153:106517. doi: 10.1016/j.compbiomed.2022.106517

Table 4.

Application of speech technology in pathological voice recognition and evaluation (otorhinolaryngology department).

Disease Data sources Voice type Voice feature Classifier Effect Ref.
Vocal Fold disorders 41 HP, 111 Ps SV/a/ Jitter, RAP, Shimmer, APQ, MFCC Harmonic to Noise Ratio (HNR), SPI ANN, GMM, HMM, SVM Average classification rate in GMM reaches 95.2% [92]
KAY database: 53 HP, 94 Ps SV/a/ Wavelet-packet coefficients, energy, and entropy, selected by algorithms SVM, KNN Best accuracy = 91% [93]
MEEI: 53 HP, 657 Ps SV/a/ Features based on the phenomena of critical bandwidths GMM Best accuracy = 99.72% [94]
Benign Vocal Fold Lesions MEEI: 53 HP, 63 Ps; SVD: 869 HP, 108 Ps; Hospital Universitario Príncipe de Asturias (HUPA): 239 HP, 85 Ps; UEX-Voice: 30 HP, 84 Ps SV/a/and SS MFCC, HNR, Energy, Normalized Noise Energy Random-Forest (RF) and Multi-condition Training Accuracies: about 95% in MEEI, 78% in HUPA, and 74% in SVD [95]
Voice disorder MEEI: 53 HP, 372 Ps
SVD: 685 HP, 685 Ps
VOICED: 58 HP, 150 Ps
SV/a/ Fundamental Frequency (F0), jitter, shimmer, HNR Boosted Trees (BT), KNN, SVM, Decision Tree (DT), Naive Bayes (NB) Best performance achieved by BT (AUC = 0.91) [96]
KAY: 213 Ps SV/a/ Features are extracted through an adaptive wavelet filterbank SVM Sort six types of disorders successfully [97]
KAY: 57 HP, 653 Ps samples from Persian native speakers: 10 HP, 19 Ps SV/a/ Same as above SVM Accuracy = 100% on both databases [98]
30 HP, 30 Ps SV/a/ Daubechies' DWT, LPC Least squares SVM Accuracy >90% [97]
MEEI: 53 HP, 173 Ps SV/a/and SS Linear Prediction Coefficients GMM Accuracy = 99.94% (voice disorder), Accuracy = 99.75% (running speech) [101]
Dysphonia Corpus Gesproken Nederlands corpus; EST speech database: 16 Ps; CHASING01 speech database: 5 Ps; Flemish COPAS pathological speech corpus: 122 HP, 197 Ps SV/a/and SS. Gammatone filterbank features and bottleneck feature Time-frequency CNN Accuracy ≈89% [144]
TORGO Dataset: 8 HP, 7 Ps SS Mel-spectrogram Transfer learning based CNN model Accuracy = 97.73%, [145]
UA-Speech: 13 HP, 15 Ps SS Time- and frequency-domain glottal features and PCA-based glottal features Multiclass-SVM Best accuracy ≈ 69% [146]
Pathological Voice SVD: approximately 400 native Germans SV/a/ Co-Occurrence Matrix GMM Accuracy reaches 99% only by voice [102]
MEEI: 53 HP
SVD: 1500 Ps
SV/a/ Local binary pattern, MFCC GMM, extreme learning machine Best accuracy = 98.1% [103]
SVD SV/a/,/i/,/u/ Multi-center and multi-threshold based ternary patterns and Features selected by Neighborhood Component Analysis NB, KNN, DT, SVM, bagged tree, linear discriminant Accuracy = 100% [108]
SVD: samples of speakers aged 15–60 years SV/a/ Feature extracted from spectrograms by CNN CNN, LSTM Accuracy reaches 95.65% [109]
Cyst Polyp Paralysis SVD: 262 HP, 244 Ps
MEEI: 53 HP, 95 Ps
SV/a/ spectrogram CNN (VGG16 Net and Caffe-Net), SVM Accuracy = 98.77% on SVD [105]
SVD: 686 HP, 1342 Ps SV/a/,/i/,/u/and SS Spectro-temporal representation of the signal Parallel CNN Accuracy = 95.5% [106]
Acute decompensated heart failure 1484 recordings from 40 patients SS time, frequency resolution, and linear versus perceptual (ear) mode Similarity calculation and Cluster algorithm 94% of cases are tagged as different from the baseline [100]
Common vocal diseases FEMH data: 588 HP
Phonotrauma data: 366 HP
SV/a/; MFCC and medical record features GMM and DNN, two stages DNN Best accuracy = 87.26% [107]
Application of speech technology in pathological voice recognition and evaluation (neurology department)
Disease Data sources Voice type Voice feature Classifier Effect Ref.
Parkinson's Disease (PD) UCI Machine Learning repository: 8 HP, 23 Ps SV Features selected by the Relief algorithm SVM and bacterial foraging algorithm Best accuracy = 97.42% [119]
98 S SV/a/, SS OpenSMILE features, MPEG-7 features, etc. RF Best accuracy ≈80% [120]
UCI Machine Learning repository; Training: 20 HP, 20 Ps; Testing: 28 S SV and SS Wavelet Packet Transforms, MFCC, and the fusion HMM, SVM Best accuracy = 95.16%, [121]
Group 1: 28 PD Ps
Group 2: 40 PD Ps
SS Diadochokinetic sequences with repeated [pa], [ta], and [ka] syllables Ordinal regression models The [ka] model achieves agreements with human raters' perception [122]
Istanbul acoustic dataset (IAD) [123]: 74 PH, 188 Ps
Spanish acoustic dataset (SAD) [124]: 80 PH, 40 Ps
SV/a/ MFCC, Wavelet and Tunable Q-Factor wavelet transform, Jitter, Shimmer, etc. Three DTs. Best accuracy = 94.12% on IAD and = 95% on SAD [125]
Training: 392 HP, 106 Ps
Testing: 80 HP, 40 Ps
SS MFCC, Bark-band Energies (BBE) and F0, etc. RF, SVM, LR, Multiple Instance Learning The best model yielded 0.69/0.68/0.63/0.8 AUC for four languages [126]
Istanbul acoustic dataset: 74 HP, 188 Ps SV/a/ MFCC, Deep Auto Encoder (DAE), SVM LR, SVM, KNN, RF, GB, Stochastic Gradient Descent Accuracy = 95.49% [127]
PC-GITA: 50 HP, 50 Ps
SVD: 687 HP, 1355 Ps
Vowels dataset: 1676 S
SV Spectrogram CNN Best accuracy = 99% [128]
Alzheimer's disease (AD) 50 HP, 20 Ps SS Fractal dimension and some features selected by algorithms MLP, KNN Best accuracy = 92.43% on AD [104]
PD, Huntington's disease (HD), or dementia 8 HP, 7 Ps SS Pitch, Gammatone cepstral coefficients, MFCC, wavelet scattering transform Bi-LSTM Accuracy = 94.29% [110]
Dementia Two corpora recorded at the Hospital's memory clinic in Sheffield, UK; corpora 1: 30 Ps corpora 2: 12 Ps, 24 S SS 44 features (20 conversation analysis based, 12 acoustic, and 12 lexical) SVM Accuracy = 90.9% [111]
DementiaBank Pitt Corpus [112]: 98 HP, 169 Ps PROMPT Database [113]: 72 HP, 91 Ps SS Combined Low-Level Descriptors (LLD) features extracted by openSMILE [114] Gated CNN Accuracy = 73.1% on Pitt Corpus and = 74.1 on PROMPT [115]
Dysarthria UA-Speech: 12 HP, 15 CP Ps
MoSpeeDi: 20 HP, 20 Ps
PC-GITA database [116]: 45 HP, 45 PD Ps
SS Spectro-temporal subspace, MFCC, the frequency-dependent shape parameter Grassmann Discriminant Analysis Best accuracy = 96.3% on UA-Speech [117]
65 HP, 65 MS-positive Ps SS Seven features including Speech duration, vowel-to-recording ratio, etc. SVM, RF, KNN, MLP, etc. Accuracy = 82% [118]
Distinguishing two kinds of dysarthria 174 HP, 76 Ps SV and SS Cepstral peak prominence classification and regression tree; RF; Gradient Boosting Machine (GBM); XGBoost Accuracy = 83% [155]
Application of speech technology in pathological voice recognition and evaluation (respiratory department)
Disease Data sources Voice type Voice feature Classifier Effect Ref.
COVID-19 130 HP, 69 Ps SV/a/and cough feature sets extracted with the openSMILE, open-source software, and Deep CNN, respectively SVM and RF Accuracy ≈80% [129]
Sonda Health COVID-19 2020 (SHC) dataset [130]: 44 HP, 22 Ps SV and SS Features (glottal, spectral, prosodic) extracted by COVAREP speech toolkit DT Feature-task combinations accuracy >80% [131]
Coswara: 490 HP, 54 Ps SV/a/,/i/,/o/; Fundamental, MFCC Frequency (F0), jitter, shimmer, HNR SVM Accuracy ≈ 97% [132]
DiCOVA Challenge dataset and COUGHVID: Training: 772 HP, 50 Ps Validation: 193 HP, 25 Ps Testing: 233 S Cough MFCC, Teager Energy
Cepstral Coefficients TECC
Light GBM The best result is 76.31% [133]
MSC-COVID-19 database: 260 S SS Mel spectrogram SVM & Resnet Assess patient status by sound is effective [134]
Integrated Portable Medical Assistant collected: 36 S Cough and speech Mel spectrogram, Local Ternary Pattern SVM Accuracy = 100% [135]
COUGHVID: more than 20,000 S
Cambridge Dataset [136]: 660 HP, 204 Ps; Coswara: 1785 HP, 346 Ps
Cough MFCC, spectral features, chroma features Resnet and DNN Sensitivity = 93%, specificity = 94% [137]
COUGHVID: 1010 Ps; Coswara: 400 Ps; Covid19-Cough: 682 Ps Cough, breathing cycles, and SS Mel-spectrograms and cochlea-grams, etc. DCNN, Light GBM AUC reaches 0.8 [138]
Cambridge dataset: 330 HP, 195 Ps; Coswara: 1134 HP, 185 Ps; Virufy: 73 HP, 48 Ps; NoCoCODa: 73 Ps Cough audio features, including MFCC, Mel-Scaled Spectrogram, etc. Extremely Randomized Trees, SVM, RF, MLP, KNN, etc. AUC reaches 0.95 [139]
Coswara: 1079 HP, 92 Ps
Sarcos: 26 HP, 18 Ps
Cough MFCC LR, KNN, SVM, MLP, CNN, LSTM, Restnet50 AUC reaches 0.98 [140]
Coswara, ComParE dataset, Sarcos dataset Cough, breathing, sneeze, speech Bottleneck feature LR, SVM, KNN, MLP AUC reaches 0.98 [141]
Chronic Obstructive Pulmonary Disease 25 HP, 30 Ps respiratory sound signals MFCC, LPC, etc. SVM, KNN, LR, DT, etc. Accuracies of SVM and LR are 100% [142]
429 respiratory sound samples respiratory sound signals MFCC; Hilbert-Huang Transform (HHT)-MFCC; HHT-MFCC-Energy SVM Accuracy = 97.8% by HHT-MFCC-Energy [143]
Tuberculosis (TB) 21 HP, 17 Ps, cough recordings: 748 Cough MFCC, Log spectral energy LR AUC reaches 0.95 [148]
35 HP, 16 Ps, cough recordings:1358 Cough MFCC, Log-filterbank energies, zero-crossing-rate, Kurtosis LR, KNN, SVM, MLP, CNN LR outperforms the other four classifiers, achieving an AUC of 0.86 [147]
TASK, Sarcos, Brooklyn datasets: 21 HP, 17 Ps
Wallacedene dataset: 16 Ps
Coswara: 1079 HP, 92 Ps; ComParE: 398 HP, 199 Ps
Cough MFCC CNN, LSTM, Resnet50 Resnet50 AUC: 91.90%
CNN AUC: 88.95%
LSTM AUC: 88.84%
[149]
Application of speech technology in pathological voice recognition and evaluation. (Others)
Disease Data sources Voice type Voice feature Classifier Effect Ref.
Juvenile Idiopathic Arthritis 5 HP, 3 Ps Knee Acoustical Spectral, MFCC, or band power feature Gradient Boosted Trees, neural network Accuracy = 92.3% using GBT, Accuracy = 72.9% using neural network [150]
Stress 6 categories of emotions, namely: Surprise, Fear, Neutral, Anger, Sad, and Happy SS (facial expressions, content of speech) Mel scaled spectrogram Multinomial Naïve Bayes, Bi-LSTM, CNN Assess students' stress by facial expressions and speech is effective [86]
Depression and Other Psychiatric Conditions Gruop1: depression (DP) 27 S;
Gruop2: other psychiatric conditions (OP) 12 S; Gruop3: normal controls (NC) 27 S
SS Features extracted by openSMILE and Weka program [151] Five multiclass classifier schemes of scikit-learn Accuracy = 83.33%, sensitivity = 83.33%, and specificity = 91.67% [152]
Depression AVEC 2014 dataset: 84 S; TIMIT dataset SS TEO-CB-Auto-Env, Cepstral, Prosodic, Spectral, and Glottal, MFCC Cosine similarity Accuracy = 90% [154]

SV=Sustained vowel, SS=Spontaneous speech, Ps = Patients, HP=Healthy People, S=Subjects.