Table 4.
Disease | Data sources | Voice type | Voice feature | Classifier | Effect | Ref. |
---|---|---|---|---|---|---|
Vocal Fold disorders | 41 HP, 111 Ps | SV/a/ | Jitter, RAP, Shimmer, APQ, MFCC Harmonic to Noise Ratio (HNR), SPI | ANN, GMM, HMM, SVM | Average classification rate in GMM reaches 95.2% | [92] |
KAY database: 53 HP, 94 Ps | SV/a/ | Wavelet-packet coefficients, energy, and entropy, selected by algorithms | SVM, KNN | Best accuracy = 91% | [93] | |
MEEI: 53 HP, 657 Ps | SV/a/ | Features based on the phenomena of critical bandwidths | GMM | Best accuracy = 99.72% | [94] | |
Benign Vocal Fold Lesions | MEEI: 53 HP, 63 Ps; SVD: 869 HP, 108 Ps; Hospital Universitario Príncipe de Asturias (HUPA): 239 HP, 85 Ps; UEX-Voice: 30 HP, 84 Ps | SV/a/and SS | MFCC, HNR, Energy, Normalized Noise Energy | Random-Forest (RF) and Multi-condition Training | Accuracies: about 95% in MEEI, 78% in HUPA, and 74% in SVD | [95] |
Voice disorder | MEEI: 53 HP, 372 Ps SVD: 685 HP, 685 Ps VOICED: 58 HP, 150 Ps |
SV/a/ | Fundamental Frequency (F0), jitter, shimmer, HNR | Boosted Trees (BT), KNN, SVM, Decision Tree (DT), Naive Bayes (NB) | Best performance achieved by BT (AUC = 0.91) | [96] |
KAY: 213 Ps | SV/a/ | Features are extracted through an adaptive wavelet filterbank | SVM | Sort six types of disorders successfully | [97] | |
KAY: 57 HP, 653 Ps samples from Persian native speakers: 10 HP, 19 Ps | SV/a/ | Same as above | SVM | Accuracy = 100% on both databases | [98] | |
30 HP, 30 Ps | SV/a/ | Daubechies' DWT, LPC | Least squares SVM | Accuracy >90% | [97] | |
MEEI: 53 HP, 173 Ps | SV/a/and SS | Linear Prediction Coefficients | GMM | Accuracy = 99.94% (voice disorder), Accuracy = 99.75% (running speech) | [101] | |
Dysphonia | Corpus Gesproken Nederlands corpus; EST speech database: 16 Ps; CHASING01 speech database: 5 Ps; Flemish COPAS pathological speech corpus: 122 HP, 197 Ps | SV/a/and SS. | Gammatone filterbank features and bottleneck feature | Time-frequency CNN | Accuracy ≈89% | [144] |
TORGO Dataset: 8 HP, 7 Ps | SS | Mel-spectrogram | Transfer learning based CNN model | Accuracy = 97.73%, | [145] | |
UA-Speech: 13 HP, 15 Ps | SS | Time- and frequency-domain glottal features and PCA-based glottal features | Multiclass-SVM | Best accuracy ≈ 69% | [146] | |
Pathological Voice | SVD: approximately 400 native Germans | SV/a/ | Co-Occurrence Matrix | GMM | Accuracy reaches 99% only by voice | [102] |
MEEI: 53 HP SVD: 1500 Ps |
SV/a/ | Local binary pattern, MFCC | GMM, extreme learning machine | Best accuracy = 98.1% | [103] | |
SVD | SV/a/,/i/,/u/ | Multi-center and multi-threshold based ternary patterns and Features selected by Neighborhood Component Analysis | NB, KNN, DT, SVM, bagged tree, linear discriminant | Accuracy = 100% | [108] | |
SVD: samples of speakers aged 15–60 years | SV/a/ | Feature extracted from spectrograms by CNN | CNN, LSTM | Accuracy reaches 95.65% | [109] | |
Cyst Polyp Paralysis | SVD: 262 HP, 244 Ps MEEI: 53 HP, 95 Ps |
SV/a/ | spectrogram | CNN (VGG16 Net and Caffe-Net), SVM | Accuracy = 98.77% on SVD | [105] |
SVD: 686 HP, 1342 Ps | SV/a/,/i/,/u/and SS | Spectro-temporal representation of the signal | Parallel CNN | Accuracy = 95.5% | [106] | |
Acute decompensated heart failure | 1484 recordings from 40 patients | SS | time, frequency resolution, and linear versus perceptual (ear) mode | Similarity calculation and Cluster algorithm | 94% of cases are tagged as different from the baseline | [100] |
Common vocal diseases | FEMH data: 588 HP Phonotrauma data: 366 HP |
SV/a/; | MFCC and medical record features | GMM and DNN, two stages DNN | Best accuracy = 87.26% | [107] |
Application of speech technology in pathological voice recognition and evaluation (neurology department) | ||||||
---|---|---|---|---|---|---|
Disease | Data sources | Voice type | Voice feature | Classifier | Effect | Ref. |
Parkinson's Disease (PD) | UCI Machine Learning repository: 8 HP, 23 Ps | SV | Features selected by the Relief algorithm | SVM and bacterial foraging algorithm | Best accuracy = 97.42% | [119] |
98 S | SV/a/, SS | OpenSMILE features, MPEG-7 features, etc. | RF | Best accuracy ≈80% | [120] | |
UCI Machine Learning repository; Training: 20 HP, 20 Ps; Testing: 28 S | SV and SS | Wavelet Packet Transforms, MFCC, and the fusion | HMM, SVM | Best accuracy = 95.16%, | [121] | |
Group 1: 28 PD Ps Group 2: 40 PD Ps |
SS | Diadochokinetic sequences with repeated [pa], [ta], and [ka] syllables | Ordinal regression models | The [ka] model achieves agreements with human raters' perception | [122] | |
Istanbul acoustic dataset (IAD) [123]: 74 PH, 188 Ps Spanish acoustic dataset (SAD) [124]: 80 PH, 40 Ps |
SV/a/ | MFCC, Wavelet and Tunable Q-Factor wavelet transform, Jitter, Shimmer, etc. | Three DTs. | Best accuracy = 94.12% on IAD and = 95% on SAD | [125] | |
Training: 392 HP, 106 Ps Testing: 80 HP, 40 Ps |
SS | MFCC, Bark-band Energies (BBE) and F0, etc. | RF, SVM, LR, Multiple Instance Learning | The best model yielded 0.69/0.68/0.63/0.8 AUC for four languages | [126] | |
Istanbul acoustic dataset: 74 HP, 188 Ps | SV/a/ | MFCC, Deep Auto Encoder (DAE), SVM | LR, SVM, KNN, RF, GB, Stochastic Gradient Descent | Accuracy = 95.49% | [127] | |
PC-GITA: 50 HP, 50 Ps SVD: 687 HP, 1355 Ps Vowels dataset: 1676 S |
SV | Spectrogram | CNN | Best accuracy = 99% | [128] | |
Alzheimer's disease (AD) | 50 HP, 20 Ps | SS | Fractal dimension and some features selected by algorithms | MLP, KNN | Best accuracy = 92.43% on AD | [104] |
PD, Huntington's disease (HD), or dementia | 8 HP, 7 Ps | SS | Pitch, Gammatone cepstral coefficients, MFCC, wavelet scattering transform | Bi-LSTM | Accuracy = 94.29% | [110] |
Dementia | Two corpora recorded at the Hospital's memory clinic in Sheffield, UK; corpora 1: 30 Ps corpora 2: 12 Ps, 24 S | SS | 44 features (20 conversation analysis based, 12 acoustic, and 12 lexical) | SVM | Accuracy = 90.9% | [111] |
DementiaBank Pitt Corpus [112]: 98 HP, 169 Ps PROMPT Database [113]: 72 HP, 91 Ps | SS | Combined Low-Level Descriptors (LLD) features extracted by openSMILE [114] | Gated CNN | Accuracy = 73.1% on Pitt Corpus and = 74.1 on PROMPT | [115] | |
Dysarthria | UA-Speech: 12 HP, 15 CP Ps MoSpeeDi: 20 HP, 20 Ps PC-GITA database [116]: 45 HP, 45 PD Ps |
SS | Spectro-temporal subspace, MFCC, the frequency-dependent shape parameter | Grassmann Discriminant Analysis | Best accuracy = 96.3% on UA-Speech | [117] |
65 HP, 65 MS-positive Ps | SS | Seven features including Speech duration, vowel-to-recording ratio, etc. | SVM, RF, KNN, MLP, etc. | Accuracy = 82% | [118] | |
Distinguishing two kinds of dysarthria | 174 HP, 76 Ps | SV and SS | Cepstral peak prominence | classification and regression tree; RF; Gradient Boosting Machine (GBM); XGBoost | Accuracy = 83% | [155] |
Application of speech technology in pathological voice recognition and evaluation (respiratory department) | ||||||
---|---|---|---|---|---|---|
Disease | Data sources | Voice type | Voice feature | Classifier | Effect | Ref. |
COVID-19 | 130 HP, 69 Ps | SV/a/and cough | feature sets extracted with the openSMILE, open-source software, and Deep CNN, respectively | SVM and RF | Accuracy ≈80% | [129] |
Sonda Health COVID-19 2020 (SHC) dataset [130]: 44 HP, 22 Ps | SV and SS | Features (glottal, spectral, prosodic) extracted by COVAREP speech toolkit | DT | Feature-task combinations accuracy >80% | [131] | |
Coswara: 490 HP, 54 Ps | SV/a/,/i/,/o/; | Fundamental, MFCC Frequency (F0), jitter, shimmer, HNR | SVM | Accuracy ≈ 97% | [132] | |
DiCOVA Challenge dataset and COUGHVID: Training: 772 HP, 50 Ps Validation: 193 HP, 25 Ps Testing: 233 S | Cough | MFCC, Teager Energy Cepstral Coefficients TECC |
Light GBM | The best result is 76.31% | [133] | |
MSC-COVID-19 database: 260 S | SS | Mel spectrogram | SVM & Resnet | Assess patient status by sound is effective | [134] | |
Integrated Portable Medical Assistant collected: 36 S | Cough and speech | Mel spectrogram, Local Ternary Pattern | SVM | Accuracy = 100% | [135] | |
COUGHVID: more than 20,000 S Cambridge Dataset [136]: 660 HP, 204 Ps; Coswara: 1785 HP, 346 Ps |
Cough | MFCC, spectral features, chroma features | Resnet and DNN | Sensitivity = 93%, specificity = 94% | [137] | |
COUGHVID: 1010 Ps; Coswara: 400 Ps; Covid19-Cough: 682 Ps | Cough, breathing cycles, and SS | Mel-spectrograms and cochlea-grams, etc. | DCNN, Light GBM | AUC reaches 0.8 | [138] | |
Cambridge dataset: 330 HP, 195 Ps; Coswara: 1134 HP, 185 Ps; Virufy: 73 HP, 48 Ps; NoCoCODa: 73 Ps | Cough | audio features, including MFCC, Mel-Scaled Spectrogram, etc. | Extremely Randomized Trees, SVM, RF, MLP, KNN, etc. | AUC reaches 0.95 | [139] | |
Coswara: 1079 HP, 92 Ps Sarcos: 26 HP, 18 Ps |
Cough | MFCC | LR, KNN, SVM, MLP, CNN, LSTM, Restnet50 | AUC reaches 0.98 | [140] | |
Coswara, ComParE dataset, Sarcos dataset | Cough, breathing, sneeze, speech | Bottleneck feature | LR, SVM, KNN, MLP | AUC reaches 0.98 | [141] | |
Chronic Obstructive Pulmonary Disease | 25 HP, 30 Ps | respiratory sound signals | MFCC, LPC, etc. | SVM, KNN, LR, DT, etc. | Accuracies of SVM and LR are 100% | [142] |
429 respiratory sound samples | respiratory sound signals | MFCC; Hilbert-Huang Transform (HHT)-MFCC; HHT-MFCC-Energy | SVM | Accuracy = 97.8% by HHT-MFCC-Energy | [143] | |
Tuberculosis (TB) | 21 HP, 17 Ps, cough recordings: 748 | Cough | MFCC, Log spectral energy | LR | AUC reaches 0.95 | [148] |
35 HP, 16 Ps, cough recordings:1358 | Cough | MFCC, Log-filterbank energies, zero-crossing-rate, Kurtosis | LR, KNN, SVM, MLP, CNN | LR outperforms the other four classifiers, achieving an AUC of 0.86 | [147] | |
TASK, Sarcos, Brooklyn datasets: 21 HP, 17 Ps Wallacedene dataset: 16 Ps Coswara: 1079 HP, 92 Ps; ComParE: 398 HP, 199 Ps |
Cough | MFCC | CNN, LSTM, Resnet50 | Resnet50 AUC: 91.90% CNN AUC: 88.95% LSTM AUC: 88.84% |
[149] |
Application of speech technology in pathological voice recognition and evaluation. (Others) | ||||||
---|---|---|---|---|---|---|
Disease | Data sources | Voice type | Voice feature | Classifier | Effect | Ref. |
Juvenile Idiopathic Arthritis | 5 HP, 3 Ps | Knee Acoustical | Spectral, MFCC, or band power feature | Gradient Boosted Trees, neural network | Accuracy = 92.3% using GBT, Accuracy = 72.9% using neural network | [150] |
Stress | 6 categories of emotions, namely: Surprise, Fear, Neutral, Anger, Sad, and Happy | SS (facial expressions, content of speech) | Mel scaled spectrogram | Multinomial Naïve Bayes, Bi-LSTM, CNN | Assess students' stress by facial expressions and speech is effective | [86] |
Depression and Other Psychiatric Conditions | Gruop1: depression (DP) 27 S; Gruop2: other psychiatric conditions (OP) 12 S; Gruop3: normal controls (NC) 27 S |
SS | Features extracted by openSMILE and Weka program [151] | Five multiclass classifier schemes of scikit-learn | Accuracy = 83.33%, sensitivity = 83.33%, and specificity = 91.67% | [152] |
Depression | AVEC 2014 dataset: 84 S; TIMIT dataset | SS | TEO-CB-Auto-Env, Cepstral, Prosodic, Spectral, and Glottal, MFCC | Cosine similarity | Accuracy = 90% | [154] |
SV=Sustained vowel, SS=Spontaneous speech, Ps = Patients, HP=Healthy People, S=Subjects.