Skip to main content
. 2023 Dec 7;6:228. doi: 10.1038/s41746-023-00959-9

Table 1.

Overview of included studies.

First Author and Publication Year Number of participants with ALS (Number with Bulbar Onset) Length of follow up Frequency of data collection Speech sample acquisition method Device Brand Speech features assessed Additional assessments Mean Total ALS-FRS (R) (Mean Bulbar Sub-score) Results/Conclusions
Agurto, 201928 42 (34) 11 months

Once weekly

(ALS-FRS/FVC/SVC at start point)

App Help us Answer ALS Pitch variation, prosody features, vowel space, vowel quality and noise measurements, mel frequency cepstral coeffients (MFCC), tremor features including tremor frequency and tremor amplitude, spectral features (including spectral slope, the maximum energy, the frequency where the maximum value is obtained, as well as the median and the energy IQR of the long-term average spectra)

ALS-FRS(R)

FVC

SVC

F-34.70 M-40.50 (-)

AR and MFCC highly correlated with initial speech score.

Best speech features obtained from the reading task.

Initial scores unable to determine disease trajectory but extracted features could.

Progression related more to onset type than initial score.

Berry, 201927 23 (4) 24 weeks

Twice weekly for passage reading.

4 days per week for 3 randomly assigned questions ALSFRS-(R) in full once weekly

6, 12 and 24 week clinician administered ALS-FRS(R)

App Beiwe Speech and pause variables

ALS-FRS(R)

VC

ALS-CBS

34.00 (-)

The correlation between clinic assessed and app ALS-FRS(R) scores taken at a single time point was high at 0.93.

High correlation for trajectory of decline between clinic assessment and app-based assessment.

An increase in pause duration was demonstrated over time.

Buder, 199631 1 (1) 2 years 3 assessments completed Audiocasset - FORMOFFA (FOR= FORmants; MO =MOments; FF = Fundamental Frequency; A = Amplitude). None - Reduction in amplitude (flattening) and reduced spectral mean variability (M1).
Cebola, 202353 40 (-) Single-time point - Smartphone -

Window-based features (temporal, spectral and statistical).

Full signal features (silence features and formant features).

None -

Sample based classification:

Best overall data was from the vowel task.

Best results came from use of the whole feature set.

The best results were achieved with SVM for both datasets.

Patient based classification:

Achieved better results than sample-based with the best results achieved using the complete feature set.

Accuracy and F1 scores also improved.

Accuracy values of over 0.90 using the full speech feature set.

Chiaramonte, 201910 22 (22) 3 months Monthly Multi-Dimensional Voice Programme (MDVP) Multi-Dimensional Voice Programme (MDVP) F0; long-term control of amplitude (vAM); jitter; shimmer; Noise to harmonic ratio (NHR); Amplitude tremor intensity index (ATRI); Frequency tremor intensity index (FTRI); Voice turbulence index (VTI); Soft phonation index (SPI); Degree of subharmonic components (DSH) and Degree of voice break (DVB).

GIRBAS Scale

Penetration Aspiration Scale

ENT assessment

Spectrogram

Electroglottography

FEES

-

Jitter, shimmer, vF0, ATRI, FTRI and vAM all significantly increased.

NHR significantly reduced.

The scores of the GIRBAS scale and Penetration Aspiration Scale (PAS) were higher at 3 months than at initial assessment, indicating increased difficulty in swallowing.

Both progressive dysphagia and dysarthria are associated with muscle weakness and loss of motor control.

Acoustic analysis should be used in combination with other assessments including swallowing assessment.

Garcia-Gancedo, 201922 25 (2) 48 weeks

Clinic visit every 12 weeks (speech assessment)

Wore sensor for 3 consecutive days every month

Microphone High fidelity speech capture system Central tendency of F0, jitter, shimmer, speaking rate, average phoneme rate and % pause time

ALS-FRS(R)

FVC

41.60 (-)

All (100%) patients captured the digital speech data successfully at baseline and at week 48.

All collected data was successfully analysed.

Little change from baseline or an observable change over time was seen.

Recording of voice samples using this method is feasible and acceptable to patients.

Kelly, 202023 25 (2) 48 weeks

Clinic visit every 12 weeks (speech assessment)

Wore sensor for 3 consecutive days every month

Microphone High fidelity speech capture system Central tendency of F0, jitter, shimmer, speaking rate, average phoneme rate and % pause time

ALS-FRS(R)

FVC

41.60 (11.40) Four speech endpoints showed between-patient correlation coefficient >0.5- with bulbar (average phoneme rate and average speaking rate) and respiratory (jitter and shimmer) of ALS-FRS(R) score.
Laganaro, 202132 20 (-) Single-time point - Microphone -

Intelligibility, articulation, pneumophonatory control.

Voice features: jitter; shimmer; F0, harmonic to noise ratio (HNR), cepstral peak prominence (CPPs); prosody, speaking rate and diadochokinetic rate (DDK).

None -

High correlation between device system score and externally validated perceptual score.

Sensitivity of 83.3% and specificity 95.2%.

System able to identify abnormal speech but not distinguish between pathologies.

Lévêque, 202233 33 (33) Single-time point - Earset Microphone Focusrite Scarlett (2i4) external audiocard and a professional quality Shure SM35-XLR earset microphone

TSC_MFCCs: Degree of acoustic change across a sequence.

Mean TSC_MFCCs: Average acoustic change across a sequence.

VARCO: Degree of variability of the acoustic change across a sequence.

eventDUR: Differences in articulation rate.

None -

MEAN acoustic change was significantly smaller for ALS and primary lateral sclerosis (PLS) compared with spinal and bulbar muscular atrophy SBMA (respectively p < .001 and p < .01) and controls (p < .001).

ALS showed higher VARCO values than control (p < .0001), PLS (p = .04), and SBMA (p = .001).

ALS and PLS speakers are slower than both SBMA and control speakers for all sequences.

Likhachov, 202130 31 (-) Single-time point - Smartphone ALS Expert Mobile Application for Android.

Jitter. Shimmer. Features based on F0-contour: Pitch period entropy (PPE) and Pathological vibrato index (PVI).

Noise features: Harmonic to noise ratio (HNR) and Glottal -to noise excitation ratio (GNE).

None -

Classifier using jitter, PPE, PVI and GNE was the most accurate.

Correct identification of ALS possible using only one voice test.

Liscombe, 202135 50 (-) Single-time point - Microphone - Speech events and silence events measured as: true negative time, true positive time, false negative and false positive. Speaker loudness. ALS-FRS(R) -

More silences present for the bulbar cohort most extremely demonstrated by the SIT task.

Optimal configuration differs between groups.

Liscombe, 202334 10 (-) Single-time point - Microphone - Speech events and silence events measured as: true negative time, true positive time, false negative and false positive. Speaker loudness. ALS-FRS(R) -

Most dramatic change between control VAD settings and pathological VAD settings was endSilence which doubled.

When VAD settings were optimised for pathological groups compared to control settings, DCF, I% and FN% reduced.

These fell further when settings were further optimised for data in a specific cohort.

Cohort specific 10-fold cross validation tests to assess robustness found little variation from other data presented.

Maffei, 202336 49 (7) Single-time point - Lapel Microphone Audio-Technica AT831R Perturbation and Noise based measures: local jitter; local shimmer; harmonic to noise ratio (HNR). Cepstral/spectral measures: cepstral peak prominence (CPP); Low High Spectral Ratio and Cepstral Spectral Index of Dysphonia (CSID) None -

Jitter, shimmer, and HNR levels are abnormal in ALS and can discriminate between normal and dysphonic voices.

Cepstral/spectral measures also discriminated the groups with excellent or acceptable diagnostic accuracy, defined as an area under the curve (AUC) > .8 and > .7.

Mori, 200437 4 (-) Single-time point - Microphone Dynamic microphone or electret condenser microphone MI-1233

F0 range and F0 minimum.

F1: First formant frequency.

F2: Second formant frequency

None -

F0 range narrower in dysarthric speakers.

F0 minimum for ALS did not differ significantly from controls.

F1 and F2 vowel spaces were narrower than controls.

Formant frequencies in expected regions.

Naeini, 202238 243 (-) Single time-point - Microphone - Pause duration; total duration; speech duration; pause events; % pause; mean phrase; coefficient of variation of phrase durations; coefficient of variation of pause durations SIT -

Both MFA and Wav2Vec2 performed well when compared to the Speech and Pause Analysis (SPA) software.

Wav2Vec2 generalized better across clinical severities.

Wav2Vec2 model performed better with most features.

Audio deemed to be ‘good’ had the strongest correlations.

Neumann, 202151 54 (32) Single-time point - Microphone - Mean F0; jitter; shimmer; harmonic to noise ratio (HNR); cepstral peak prominence (CPP); speaking & articulation duration and rate; percentage pause time (PPT) and, Cycle-to-cycle temporal variation (cTV) ALS-FRS(R) Bulbar-33.09 (8.75) pre-bulbar-36. 45 (12.00)

Strong differences between acoustic features for timing measures.

Effect sizes between bulbar and control groups were highest.

Mean F0 showed a significant difference (smaller effect sizes). UAR- between control and pre-bulbar 0.63 and between bulbar and pre-bulbar -0.77.

Voice quality measures added power to the predictive model for pre-bulbar samples.

Nevler, 202024 67 (16) Single-time point - - Speech Activity Detector (SAD), developed at the University of Pennsylvania Linguistic Data Consortium. F0, F0 range, mean speech segment duration, total speech duration, pause rate

Edinburgh Cognitive Assessment Scale (ECAS), ALS-FRS(R),

Motor examination, Mini-Mental State Examination (MMSE)

MRI

ALS- 35(-),

ALS-FTD- 34.2 (-)

The F0 range was restricted in patients with ALS-FTD compared to healthy controls (p = 0.005). There was no significant difference between F0 range between motor ALS and healthy controls (p = 0.15).

Regression analysis showed strong association between F0 range and severity of bulbar impairment. No association was found between F0 range and cognitive impairment using MMSE score as predictor of F0 range (p = 0.34).

Mean speech segment duration was reduced in ALS-FTD compared to controls (p < 0.001) and motor-ALS (p = 0.042).

Cognitive impairment was associated with mean speech segment duration and total speech time.

Pause rate is related to cognitive function.

Exploratory regression analysis revealed a relationship between F0 range, pause rate and total speech duration, and cortical thickness in different areas of the brain.

Norel, 201829 67 (-) Single-time point - App ALS Mobile Analyzer Mel-frequency cepstral coefficients (MFCC), spectral changes ALS-FRS(R) - 79% accuracy for males and 83% for females.For males a single feature could distinguish controls and patients. Model was tolerant to uncontrolled recording conditions.
Peplinski, 201925 65 (-) Several months Daily App ALS at Home Components of tremor: dominant tremor frequency, maximum absolute tremor intensity, median absolute tremor intensity, mean absolute tremor intensity, max relative tremor intensity, median relative tremor intensity, mean relative tremor intensity, tremor energy, tremor entropy None -

Discriminative power to separate perceptually rated tremor vs non-tremor.

Unable to distinguish controls from those ALS without tremor.

Robert, 199939 63 (40) Single-time point - Digital tape recorder - F0, jitter, intensity, shimmer, number of harmonics in frequency spectral analysis. None -

5 of 8 acoustic features used present in symptomatic and asymptomatic ALS

Jitter significantly higher with bulbar symptoms.

Shimmer and CVF were also higher. No of harmonics was significantly lower in symptomatic ALS.

MPFR was significantly lower in ALS patients

Rong, 201516 66 (15) 60 months Aimed every 3 months but varied based on clinic follow-up schedule. Average no of sessions was 7. Microphone Countryman E6 microphone Respiratory subsystem: Pausing patterns and subglottal pressure. Phonatory subsystem: Jitter; shimmer; noise to harmonic ratio (NHR); loudness; and maximum F0. Resonatory subsystem: Nasalance, peak oral pressure and peak nasal airflow. Articulatory measures.

ALS-FRS(R)

SIT

38.00 (10.00) DDK and F0 identified early bulbar decline occurring before SR and SIT decline
Rong, 201640 66 (15) 1792 days Approximately every 3 months (varied with clinic schedule) Microphone Countryman E6 58 measures across 4 subsystems. Speech

ALS-FRS(R)

%FVC

SIT

38.00 (10.00)

Decline in AMR task performance prior to speech intelligibility decline.

Distinction between fast and slow bulbar progressors.

Rong, 202019 16 (-) Single-time point - Microphone - Cycle-to-cycle temporal variation (cTV) and syllable rate (sylRate). SIT -

Cycle-to-cycle temporal variation (cTV) showed large increase in early bulbar disease.

Large effect size of cTV between controls and early bulbar disease.

Rowe, 202220 46 (-) Single-time point - Microphone or App (depending on database used) Professional quality microphones (e.g., AKG C410, Shure SM81 Condenser, Olympus VN-702PC digital recorder) or the Beiwe application

Coordination- relative duration of the silence between two articulatory gestures during each syllable transition (GapSyllProp).

Consistency- across repetition variability in voice onset time. (RepVarVOT)Speed- Second formant slope in the consonant transition of /k/ (F2Slope).

Precision - across-consonant variability in second formant slope in the consonant transitions of /p/, /t/, and /k/ (ConVarF2Slope). Rate - number of syllables produced per second (RepRate).

None -

Multivariate analysis indicated a different articulatory pattern depending on the diagnosis of the speaker. This was significant for all articulatory components (coordination, speed, precision, rate).

Overall Pearson correlation revealed only weak to moderate correlations with pairs of acoustic features for both each individual pathology and the whole study population. Speed and Precision were most strongly correlated (0.72) in speakers with ALS.

ALS was the only clinical group where multivariate LDAs using receiver-operating characteristic (ROC) curves showed below acceptable values for sensitivity, specificity, and area under the curve (AUC).

The full feature profile performed significantly better than the individual features at classifying the clinical groups.

Rutkove, 202021 113 (60) 9 months Daily for 90 days then 2x weekly for additional 180 days (ALSFRS(R) collected weekly) App ALS at home -

ALS-FRS(R)

FVC

36.10 (-)

Patients reported greater sense of control.

Frequent at home data collection successful and would reduce future sample sizes.

Silbergleit, 199741 20 (-) Single-time point - Headband microphone

Cspeech

CompuAdd computer, model 320/325

IBM ACPA (audio capture and playback adapter) A/D D/A card

Jitter; shimmer; Signal-to-noise ratio (SNR) and Maximum phonation frequency range (MPFR). Hearing screening - Jitter and maximum phonation frequency range (MPFR) showed significant differences between groups. Shimmer and signal-to-noise ratio (SNR) unable to separate groups.
Stegmann, 202026 65 (12) 9 months Daily speech samples for 3 months then 2x weekly for 6 months (Average every 2.9 days) App ALS at home

AP & SR

Articulatory precision (AP) and speaking rate (SR)

ALS-FRS(R) 37.10 (9.70)

Speaking rate (SR) and articulatory precision (AP) able to detect bulbar involvement early and track progression.

Remote assessment via mobile app possible.

Decline of AP and SR faster in bulbar-onset than non-bulbar onset.

Tanchip, 202242 145 (33) Single-time point - Microphone Marantz PMD660 compact flash recorder with an accompanying Countryman E6 omnidirectional microphone or an Olympus WS-853 recorder with an accompanying ME52W unidirectional microphone Diadochokinetic rate (DDK); cycle-to-cycle temporal variation (cTV); number of syllables SIT -

The intraclass correlation coefficient (ICC) calculated between syllable counts was 0.99 between both Raters 1 and 2 and Raters 2 and 3, suggesting excellent reliability of the manual procedure.

Generally, there was overall agreement between the manual and algorithmic syllable detection. Disease severity had a significant effect on syllable count agreement (p < 0.001) with all five algorithms overestimating syllable count in the severe stage, and all except the Energy algorithm overestimating in the moderate stage.

For DDK rate and cTV, the Energy algorithm performed best with correlations of over 0.7 with manual analysis.

Tena, 202243 47 (14) Single-time point - Microphone USB EMITA Streaming GXT 252 microphone and Audacity (open-source application).

Phonatory subsystem features including: absolute jitter; relative jitter; absolute Shimmer; relative Shimmer; mean harmonic-to-noise ratio; pitch (SD), pitch (min), pitch (max), pitch (mean).

Time frequency features including:

Average instantaneous spectral energy, instantaneous frequency peak and spectral information

None -

Differentiation of diagnosis by gender was the most important finding.

The best model was Random Forest (RF).

RF able to distinguish between control group and bulbar ALS patients with an accuracy of 96.1% and 98.1% for males and females respectively.

Different numbers of statistically significant features were identified depending on the cohort and whether participants were male or female.

Tomik, 201544 17 (17) 12 months Baseline, at 6 months and 12 months Microphone - F0, jitter, shimmer, noise-to-harmonic ratio (NHR), voice range and maximum phonation time (MPT). None -

Jitter was significantly higher for all examinations in women with ALS compared to controls.

Mean shimmer and NHR values were significantly higher in women with ALS.

Mean F0 did not show a reduction in ALS for either sex.

Tomik, 199945 53 (15) 36 weeks Every 10-12 weeks Microphone Bruel and Kjaer microphone Articulation time, pause duration. None -

Significant differences between the mean distances for all chosen sounds in both ALS groups. Significant increase over time for mean distances for all sounds in both groups.

Different acoustic signature patterns identified for each ALS group with different sounds showing different distance increases.

Vashkevich, 201861 26 (-) Single-time point - Smartphone (with a headset) - Distance between vowel envelopes, mutual location of formant frequencies, difference in amplitude of the harmonics. Norris scale -

Reduced distance between vowel envelopes in pathology.

Harmdiff showed a good separation between control group and ALS.

HNR did not show distinction.

High accuracy of 88%.

Vashkevich, 201960 15 (-) Single-time point - Smartphone (with a standard headset) - Distance between spectral envelopes, formant structure of the speech, formant convergence, breathiness. None -

Distance between vowel envelopes, second formant of ‘I’ and second formant convergence of vowels produced good distinction between controls and ALS group.

84.8% accuracy achieved using just second formants of vowel ‘I’.

Vashkevich, 202159 31 (13) Single-time point - Smartphone (with a standard headset) - Jitter & shimmer features; F0; spectral envelopes; harmonic-to-noise ratio (HNR); Glottal -to noise excitation ratio (GNE); Mel-frequency cepstral coefficients (MFCC), Phonatory frequency range (PFR), Pitch period entropy (PPE), Pathological vibrato index (PVI) and tremor and harmonics. None -

Pathological vibrato index (PVI) and Mel-frequency cepstral coefficients (MFCC) are most valuable. MFCC is valuable for early diagnosis by distinguishing from controls.

Pathological vibrato index (PVI) is valuable for identifying later changes and progression of disease.

Jitter, shimmer and harmonic-to-noise ratio (HNR) is less useful.

Wang, 201846 12 (-) Single-time point - Microphone - Jitter, shimmer and Mel-frequency cepstral coefficients (MFCC) SIT - Combining lip, tongue and acoustic data produces also achieved higher accuracy, better correlation and RMSE than acoustic alone.
Wang, 201647 11 (-) Single-time point - Microphone - F0 and Mel-frequency cepstral coefficients (MFCC) SIT -

Acoustic data alone produced accuracy above 50%.

Lip, tongue, and acoustic data combined improved accuracy to 80.91%.

Feasible to detect ALS automatically from short speech samples.

Wang, 201648 9 (-) Single-time point - Microphone - F0 features and harmonic-to-noise ratio (HNR) SIT -

Feasible with only acoustic data.

Adding articulatory data improves model performance.

Weismer, 200149 10 (-) Single-time point - Microphone - Formant frequency measure including F2 slopes; intelligibility and speaking rate (SR). None -

Total utterance length significantly greater for ALS compared to both PD and controls.

Vowel space and F2 slopes taken from either single word or sentence production highly correlated with single word and scaled sentence intelligibility.

Wisler, 201950 66 (-) 24 months 4 sessions with an interval of 4-6 months Shure Microflex microphone Shure Microflex microphone Mel-frequency cepstral coefficients (MFCC).

ALS-FRS(R)

SIT

- Best RMSE and correlations when acoustic data is combined with lip and tongue data using SVR model.
Yunusova, 20169 85 (-) Single-time point - Microphone - Speaking rate (SR), articulatory rate (AR) & pause features.

ALS-FRS(R)

SIT

33.53 (-)

Articulation rate able to distinguish bulbar disease from respiratory disease.

CV phase duration can be used for early detection.

ALS Amyotrophic lateral sclerosis; ALS-FRS(R) Amyotrophic lateral sclerosis functional rating scale revised, ALS-CBS Amyotrophic lateral sclerosis cognitive behavioural screen, SIT Speech Intelligibility Testing, VC Vital capacity, FVC Forced vital capacity, SVC Slow vital capacity, F0 Fundamental frequency, vF0 Fundamental frequency variation, Jitter frequency perturbation, Shimmer Amplitude perturbation