Abstract
Purpose:
Oral diadochokinesis is a useful task in assessment of speech motor function in the context of neurological disease. Remote collection of speech tasks provides a convenient alternative to in-clinic visits, but scoring these assessments can be a laborious process for clinicians. This work describes Wav2DDK, an automated algorithm for estimating the diadochokinetic (DDK) rate on remotely collected audio from healthy participants and participants with amyotrophic lateral sclerosis (ALS).
Method:
Wav2DDK was developed using a corpus of 970 DDK assessments from healthy and ALS speakers where ground truth DDK rates were provided manually by trained annotators. The clinical utility of the algorithm was demonstrated on a corpus of 7,919 assessments collected longitudinally from 26 healthy controls and 82 ALS speakers. Corpora were collected via the participants' own mobile device, and instructions for speech elicitation were provided via a mobile app. DDK rate was estimated by parsing the character transcript from a deep neural network transformer acoustic model trained on healthy and ALS speech.
Results:
Algorithm estimated DDK rates are highly accurate, achieving .98 correlation with manual annotation, and an average error of only 0.071 syllables per second. The rate exactly matched ground truth for 83% of files and was within 0.5 syllables per second for 95% of files. Estimated rates achieve a high test-retest reliability (r = .95) and show good correlation with the revised ALS functional rating scale speech subscore (r = .67).
Conclusion:
We demonstrate a system for automated DDK estimation that increases efficiency of calculation beyond manual annotation. Thorough analytical and clinical validation demonstrates that the algorithm is not only highly accurate, but also provides a convenient, clinically relevant metric for tracking longitudinal decline in ALS, serving to promote participation and diversity of participants in clinical research.
Supplemental Material:
Changes in speech are often harbingers of neurodegenerative diseases such as amyotrophic lateral sclerosis (ALS) and Parkinson's disease (Liss et al., 2009; Mundt et al., 2007; Stegmann et al., 2020; Stipancic et al., 2018). The oral diadochokinesis (DDK) speech elicitation task is a maximum performance task in which a speaker is instructed to produce a sequence of syllables in rapid succession, at maximum speed. This task serves to measure articulatory agility, motor speed, and motor coordination (Kent et al., 2022), and is therefore a fundamental task for assessing patients across a wide range of neurological conditions such as multiple sclerosis (MS; Rozenstoks et al., 2020), tardive dyskinesia (Sinha et al., 2015), traumatic brain injury (Wang et al., 2004), Parkinson's disease (Karlsson et al., 2020; Tjaden & Watling, 2003), and ALS (Rong et al. 2015). Maximum performance tasks can also uncover mild impairment or emerging deficits in speech motor control. DDK rate reductions in ALS have been detected prior to decline in speaking rate or intelligibility in early bulbar onset ALS (Rong et al., 2016), and have been used to stratify subjects into slow and fast bulbar disease progressors (Rong et al., 2015).
Neurological deficits are identified by extracting metrics to capture DDK rate, and DDK regularity. While DDK rate measures the number of syllables produced over a constant time, DDK regularity measures deviations in temporal regularity of syllable production. Temporal performance measures such as cycle-to-cycle temporal variability and syllable duration variability have been used in addition to DDK rate to identify developing neurological deficits in spastic-flaccid dysarthria present in early ALS (Rong, 2020), find manifestations of ataxic dysarthria in MS (Rozenstoks et al., 2020), and to characterize hypokinetic dysarthria present in Parkinson's disease (Karlsson et al., 2020). Regularity and rate are extracted either from alternating motion rate (AMR) tasks, measuring a subject's ability to produce single syllable repetitions (such as “pa,” “ta,” “ka,” “la”), or sequential motion rate (SMR), measuring a subject's ability to produce a sequence of syllables (such as the nonword “pataka,” or words such as “buttercup,” “pattycake,” or “buttercake”).
While the literature demonstrates clear value in estimation of DDK for clinical populations, its use for clinical monitoring is limited due to the need for in-clinic administration and the laborious manual scoring process. In-clinic evaluation of DDK may exclude some neurologically impaired patients who may suffer from impaired mobility, lack of transportation, geographical access, and fatigue (Braley et al., 2021). Additionally, the gold standard for DDK assessment relies on manual labeling of audio at the syllable level, which can be a time intensive process for clinicians with finite resources (Kent, 1996). These factors ultimately limit large-scale patient monitoring.
One way to address this lack of access is via development of a robust, smartphone-based speech elicitation task that (a) provides patients a convenient way to submit DDK speech via their own mobile devices remotely and (b) automatically estimates the DDK from the submitted speech. To this end, an assortment of work has addressed the need for automated DDK algorithms and even remote collection of DDK assessments. A number of automated methods use the spectral envelope of the signal in conjunction with signal processing techniques to estimate both DDK rate and regularity (Orozco-Arroyave et al., 2018; Rong, 2020; Smékal et al., 2013). However, these methods tend to be brittle— especially in the presence of background noise—as they require manual setting of amplitude threshold values for proper syllable detection from the time domain spectral energy curve, and may be prone to error when evaluated on heavily impaired speakers (Tanchip et al., 2022).
In lieu of techniques solely based on signal processing, advances in automatic speech recognition (ASR) and deep learning provide a convenient end-to-end solution. As modern ASR pipelines are trained on a wide range of speech, they are less susceptible to recording variations and background noise (Li et al., 2014). A hidden Markov model and Gaussian mixture model ASR system was developed by Tao et al. (2016) to automate analysis of AMR and SMR on a corpus of 221 recordings from subjects with traumatic brain injury and Parkinson's disease. Wang et al. (2019) and Rozenstoks et al. (2020) both built convolutional neural networks (CNNs) to identify syllable onset/offset in DDK. Rozenstoks et al. (2020) leveraged a Resnet-101 (He et al., 2016) object detector for both AMR and SMR analysis on in-clinic collected data from both healthy subjects and subjects with MS. They fine-tuned the object detector to find syllable location from spectrogram representations of audio, which were labeled at the phoneme level. Using a holdout set of 144 recordings, they found significant differences in DDK between subjects with no perceptible speech impairment and healthy controls. When compared with CNN architectures, long short-term memory (LSTM) networks have been shown to improve on DDK rate estimation performance (Segal et al., 2022). To the best of our knowledge, no existing DDK analysis methods have incorporated recent advances in transformer networks. Pretrained transformer models have been shown to outperform CNN and LSTM models in ASR and natural language processing tasks in few-shot learning regimes and are thus well matched to clinical settings where data is scant and challenging to source (Bansal et al., 2018; Baevski et al., 2020; Devlin et al., 2018).
Analysis of the literature reveals that there is a need for algorithms that can be fine-tuned with limited data (e.g., based on transformer-style networks) and that are thoroughly validated on large-scale data. Ideally, the algorithms are (a) validated on remotely collected speech, (b) validated on data from clinical populations, and (c) evaluated on clinical data at large scale. Most aforementioned methods do not satisfy these criteria. While Wang et al. (2019) collected speech on iOS devices, the algorithm was evaluated on 81 files collected only from healthy speakers. Other previously referenced methods based on deep learning or signal processing relied on in-clinic DDK assessments as opposed to remote smartphone assessments. In all cases, the corpora were limited to a size of a few hundred recordings at most (Novotny et al., 2020; Orozco-Arroyave et al., 2018; Rong, 2020; Rozenstoks et al., 2020; Smékal et al., 2013; Wang et al., 2019).
In contrast, we present Wav2DDK, an automated DDK rate estimation system developed on a corpus of 970 remotely collected DDK assessments using a Wav2Vec2 (Baevski et al., 2020) transformer ASR model. Large scale clinical validation was performed on a carefully curated corpus of nearly 8,000 smartphone-collected DDK recordings from a longitudinal study of both healthy speakers and ALS patients. These subjects exhibited varied articulatory ability across a range of bulbar function. DDK rates were estimated from an 8-s SMR task (repeated productions of the word “buttercup”) elicited via the SpeechVitals mobile app. There were two key aims of this work. First, we developed an accurate DDK rate estimation system for remotely collected speech. Second, a rigorous validation framework modeled after V3 (Goldsack et al., 2020) was applied to perform data verification, analytical validation of accuracy on healthy and ALS speech, and clinical validation via analysis of algorithm outputs relative to clinical variables in ALS.
Data
Data Collection and Task Design
For all DDK assessments, subjects were directed to “repeat the word ‘buttercup’ as many times as you can in eight seconds.” Directions (see Figure 1b) also instructed speakers to “repeat the word as quickly and as clearly as you can.” Based on initial collected data across varying severities of ALS, an 8-s interval ensured that even those with highly impaired speech were given enough time to produce speech. Because “pataka” may not be pronounced as expected without a demonstrative example, which was not present in our self-administered task, we elicited “buttercup.” Like “pa-ta-ka,” “buttercup” requires alternating places of articulation (bilabial–alveolar–velar). Task instruction and data collection was done via the SpeechVitals app; participants submitted the assessments on their own mobile devices for all the corpora collected in this study.
Figure 1.
Recordings were collected via the SpeechVitals mobile app on patients' own mobile devices. Subjects were instructed to keep the device at a distance of 18 in. for the duration of the recording and were provided instructions on how to perform the diadochokinetic assessments.
First, the app instructed subjects on how to hold their device to ensure consistent recording conditions between sessions (see Figure 1a). They were then presented with instructions on how to complete the DDK task (see Figure 1b). Immediately after participants selected “start recording” on the instruction screen, the app transitioned to the screen in Figure 1c and began recording audio while also displaying a countdown of remaining time in the 8-s interval. Recorded speech was saved as a .wav file sampled at 16 kHz and uploaded to cloud storage.
Description of DDK Corpora
A summary of the datasets used in this study is provided in Table 1. The algorithm was developed using DDK Corpus 1, a dataset of 970 DDK assessments from 269 speakers with ALS. This dataset was used to develop the transcript parsing algorithm for rate estimation. DDK Corpus 2 consisted of 884 DDK assessments (101 healthy, 783 ALS) collected from 128 speakers (26 healthy, 102 ALS). This dataset was a blinded holdout set with no speaker overlap with DDK Corpus 1 and was used for analytical validation to measure the accuracy of the algorithm against ground truth DDK rate. Trained annotators labeled the number of full “buttercup” utterances present in each file while excluding partial utterances and stutters (e.g., “-tercup,” “butter-,” “-cup”). The third DDK corpus consisted of data from the ALS-at-Home study, a bring-your-own-device study. This data was used to demonstrate the clinical relevance of the computed DDK metric.
Table 1.
This table describes the contents and function of the speech corpora used in this study, the amount of collected audio in each corpus, and the number of speakers in each corpus.
Dataset | Function | Description | Number of recordings/ duration of audio |
Number of speakers |
---|---|---|---|---|
ALS Sentence Reading | Acoustic Model Fine Tuning | ALS speakers performing a sentence reading task | 340 recordings (38 min) |
98 |
DDK Corpus 1 | Development set for automated DDK estimation algorithm | ALS speakers performing an 8-s DDK assessment | 970 recordings (129.3 min) |
269 |
DDK Corpus 2 | Analytical validation (holdout set) |
Healthy controls and ALS speakers performing an 8-s DDK assessment | 884 recordings [101 healthy, 783 ALS] (100.5 min) |
128 (26 healthy, 102 ALS) |
ALS-at-Home Corpus | Clinical validation (holdout set) | Longitudinally collected corpus of ALS and healthy speakers performing a DDK task | 7,919 recordings [1,886 healthy, 6,033 ALS] (1,056 min) |
108 (26 healthy, 82 ALS) |
Note. DDK = diadochokinetic; ALS = amyotrophic lateral sclerosis.
The ALS-at-Home study collected DDK assessments from participants over the course of 1 year. Institutional review board approval was obtained to collect data for this study (Rutkove et al., 2019). A total of 108 speakers (26 healthy, 82 ALS) provided 7,919 DDK assessments (1,886 healthy, 6,033 ALS) from their mobile device. Each of the 108 subjects provided multiple DDK recordings over the course of the study with an average period of 3.05 days between each collected recording. Out of 82 speakers with ALS, 14 had bulbar-onset ALS, and 68 belonged to the non–bulbar-onset group. Bulbar-onset ALS or non–bulbar-onset ALS diagnosis was provided by clinicians and was based on the first site of symptom onset. Earliest symptoms in bulbar onset participants were associated with bulbar function (speech, salivation, or swallowing). Participants also provided self-reports of the revised ALS functional rating scale (ALSFRS-R; Cedarbaum et al., 1999), which measures 12 aspects of physical function encompassing bulbar function, respiratory function, fine-motor control, and gross motor control. Further details on the ALS-at-Home dataset can be found in Table 2.
Table 2.
Sample description of ALS-at-Home study.
Total number of participants | 108 |
Total number of observations (recordings) | 7,919 (1,886 Healthy, 6,033 ALS) |
Number of healthy/number of ALS speakers | 26 Healthy 82 ALS |
Participants with bulbar onset ALS | 14 participants |
Average length of participant enrollment | 168.9 days |
Average frequency of data collection | Every 3.05 days |
Gender for healthy and ALS participants (% female) | Healthy 73% ALS 37% |
Average (standard deviation) age for healthy and ALS participants | ALS 59.2 (10.8) Healthy 51.9 (14.6) |
Note. ALS = amyotrophic lateral sclerosis.
Note that there is ample overlap between the files in the ALS-at-Home study and DDK Corpus 2 (see Supplemental Material S1). Of the 884 recordings, 754 files were selected at random from 77 speakers in the ALS-at-Home dataset. The remaining 130 out of 884 recordings in DDK Corpus 2 did not come from the ALS-at-Home study and were collected from 61 ALS speakers (from the same study as DDK Corpus 1). Of the 754 files from the ALS-at-Home study, 101 were drawn randomly from the 26 healthy speakers in ALS-at-Home, and 653 of the 754 files were drawn from 41 ALS speakers. Further clarifying these 653 files by disease severity—as measured by ALSFRS-R speech subscore—93 files had a subscore of 4, 548 files had a subscore of 3, eight files had subscore of 2, and four had a subscore of 1. Concretely, an ALSFRS-R speech subscore of 4 indicates normal speech production, an ALSFRS-R of 3 indicates detectable speech disturbance, an ALSFRS-R of 2 indicates that speech is intelligible with repeating, an ALSFRS-R of 1 denotes that speech is intelligible when combined with nonvocal communication, and an ALSFRS-R of 0 indicates loss of useful speech. We direct the reader to Supplemental Material S1 provided as supplementary information for a detailed description of DDK Corpus 2.
To fine-tune our Wav2Vec2 ASR model, we use the ALS Sentence Reading dataset (Table 1). This dataset contained 38 min of audio consisting of 340 recordings from 98 ALS speakers with varying degrees of motor speech impairment. Speakers performed a sentence reading task. For more details on this corpus, including elicited sentences, we direct the reader to Supplemental Material S3. There was no speaker overlap between the sentence reading corpus and the data used to develop or validate the algorithm. For the initial model (prior to fine-tuning), we use a pretrained version of the Wav2Vec2-large transformer ASR model that was trained on LibriSpeech corpus (Panayotov et al., 2015), an ASR dataset of healthy speakers performing an audiobook reading task.
Data Exclusion and Preprocessing
The data was manually annotated to remove files with high background noise or files without active speech. A total of 161 files from ALS-at-home dataset, or roughly 2% of files, contained either no speech or had unacceptable background noise (such as other speakers or loud home appliances). Files were tagged as high background noise based on a subjective criterion: annotators were instructed to mark a file for exclusion if despite increasing the volume and concentrating harder, the speech was still unintelligible over background noise. These noisy files were prefiltered from the dataset and did not count towards the total 7919 files reported in Table 1 and Table 2. No files were prefiltered from DDK Corpus 1, DDK Corpus 2, or the ALS Sentence Reading corpus. The model operated directly on the .wav files collected from the participants' devices without any preprocessing.
Method
Wav2DDK Model Fine-Tuning
To adapt the model to dysarthric speech and to acoustic conditions present in smartphone collected audio, we fine-tuned a Wav2Vec2-large model on the remotely collected ALS Sentence Reading corpus. Fine-tuning was performed on the sentence reading corpus for 15 epochs using the AdamW optimizer, with a learning rate of 0.0004, batch size of 32, and weight decay of 0.0005. Further details on the ASR model can be found in Supplemental Material S4.
A DDK Rate Estimation Algorithm From ASR Transcript Parsing
DDK rate was estimated using a three-step algorithm consisting of ASR transcript generation, transcript parsing, and utterance counting. Figure 2 shows a high-level overview of the algorithm. Each file was transcribed using a fine-tuned version of an audio sequence to character sequence ASR system, Wav2Vec2 (Baevski et al., 2020). This end-to-end ASR system accepted raw audio as input and produced a character-level transcript. Unlike traditional ASR, the pipeline did not use a language model, and therefore was not constrained to transcribe utterances to English words. Table 3 demonstrates how changes in speech due to dysarthria (such as nasalization of bilabial consonants, and distortion of velar and alveolar stop consonants), and pronunciation differences due to accent were captured in the transcript. Nonwords were frequently produced by the model. The algorithm exploited this property by first processing the nonwords in the character-level ASR transcript.
Figure 2.
Overview of the proposed system for estimating diadochokinetic (DDK) rate. The two main components of this system are a sequence-to-sequence ASR acoustic model for generating transcripts from audio, and a transcript processing step to estimate DDK rate. A count of the character “C” is used in cases with no or mild motor speech impairment. In cases with moderate impairment, “B” can be used instead. For, markedly unintelligible speech a count of the number of vowels suffices.
Table 3.
A comparison of output transcripts from traditional ASR systems and our sequence-to-sequence system for six recordings from a healthy speaker, a speaker with moderate motor speech impairment, and two speakers with severe motor speech impairment. The character level output of our acoustic model successfully reproduces pronunciation distortions present in the original speech.
Sample description | Ground truth buttercup count | Google Cloud ASR transcript | Wav2DDK character-level transcript |
---|---|---|---|
Healthy speaker | 19 | BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP | BUTERCUP BUTERCUP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BTERCUP BUTE |
Mild dysarthria | 15 | BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTERCUP BUTTER UP | PBUTTERGOUP BUTTERGUP PUTTERGUP BUTTERGUP BUTTERGUP BUTTERGUP PUTTERCUP PUTTERGUP BUTTERGUP PUTTERCUP BUTERGUP UTTERGUP BUTTERCUP PUTTERGUP PBUTTERGUP PU |
Moderate dysarthria | 10 | BLUNDER OF BURNER OF BURNER OF BURNER UNDER | BOTOF BATOF BA CAOF BOTOF BANOF BARNOF BANUM MOT OF BANOF BANOF BANOF |
Severe dysarthria | 7 | MINER MINER | MUTTER GOT BUTTER GOT MUTER GOT MUTER OTMUTER GUT MUTER GOT MUTER GOUT |
Severe dysarthria | 5 | MONROE | MINOR OP BINDER OM MINOR OM MINORO MINOR OP |
In the first step, the algorithm passed the raw audio through the ASR model to extract the transcript. For healthy speakers, “buttercup” was correctly transcribed in most cases. Thus, the DDK rate could be estimated from transcripts with minimal processing. Transcripts from speakers with dysarthria or speakers with accents mostly contained nonwords and required processing. In the initial processing step, we counted the number of instances of the letter “c.” The rationale for this is as follows. Speech impairments due to dysarthria impacted many characters in the ASR transcript (see Table 3). Some examples from ALS speakers with spastic-flaccid dysarthria include character sequences such as “buttergup,” “muttergup,” “butterhum,” and so forth. However, the velar consonant /k/ tended to be preserved in transcripts as the letter “c” or “g.” The count of the letter “c” was used as an initial estimate of the number of “buttercup” utterances spoken. DDK rate in syllables per second was computed by multiplying the number of identified "buttercup" utterances by three (“buttercup” contains three syllables) and normalizing by the 8-s length of the audio. Table 4 provides examples of how pronunciation differences manifested as non-words in the transcripts, how transcripts were processed, and examples of character counting.
Table 4.
Examples of the transcript processing algorithm applied to transcribed recordings from six speakers with varying levels of articulatory ability.
Sample description | Ground truth buttercup count | Wav2DDK character-level transcript | Processed transcript | Estimated buttercup count |
---|---|---|---|---|
Healthy speaker | 19 | BUTERCUP BUTERCUP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BUTERCOP BTERCUP BUTE | BUTERCU BUTERCU BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BUTERCO BTERCU | 19 |
Mild dysarthria | 15 | PBUTTERGOUP BUTTERGUP PUTTERGUP BUTTERGUP BUTTERGUP BUTTERGUP PUTTERCUP PUTTERGUP BUTTERGUP PUTTERCUP BUTERGUP UTTERGUP BUTTERCUP PUTTERGUP PBUTTERGUP PU | BUTERCOU BUTERCU PUTERCU BUTERCU BUTERCU BUTERCU PUTERCU PUTERCU BUTERCU PUTERCU BUTERCU UTERCU BUTERCU PUTERCU BUTERCUP |
15 |
Moderate dysarthria | 10 | BOTOF BATOF BA CAOF BOTOF BANOF BARNOF BANUM MOT OF BANOF BANOF BANO | BOTOF BATOF BA CAOF BOTOF BANOF BARNOF BANU BOT OF BANOF BANOF BANO | 10 |
Severe dysarthria | 7 | MUTTER GOT BUTTER GOT MUTER GOT MUTER OTMUTER GUT MUTER GOT MUTER GOUT | MUTER COT BUTER COT MUTER COT MUTER OTMUTER CUT MUTER COT MUTER COUT | 6 |
Severe dysarthria | 5 | MINOR OP BINDER OM MINOR OM MINORO MINOR OP | BINOR OP BINDER OM MINOR OM BINORO BINOR OP | 5 |
Note. Bolded letters indicate the landmark characters used to detect the count of the number of “buttercup” utterances in service of diadochokinetic (DDK) rate estimation.
This initial estimate was then refined in three ways. First, we substituted “g” with “c” to accommodate transcripts like “buttergup.” Speakers with dysarthria may have trouble controlling voicing and misarticulate the voiceless velar stop, /k/, as /g/ (Antolik & Fougeron, 2013). When the misarticulated DDK utterances are processed by the ASR model, the ASR transcript reflects this the misarticulation and resolves each utterance to “buttergup” rather than “buttercup.” Since the algorithm relies on a count of the number of “c” in the ASR transcript as a proxy for the number of repeated DDK utterances, the “g” to “c” character substitution allowed for accurate counting of the number of “buttercup” repetitions in the file and improved the estimate of DDK rate for speakers with mild speech motor impairment. Second, for speakers with severely impaired articulation who were unable to attain velar closure, transcripts did not contain “c” or “g” (e.g., transcripts may contain repetitions of “butterham” or “mutterham”). In these cases, the algorithm used the character “b” as a landmark for calculating “buttercup” count. Nasalized bilabial consonants often occurred in the transcript, so occurrences of “m” were substituted with “b.” Third, for the small number of severely impaired speakers who could not accurately articulate bilabial consonants or velar consonants, characters corresponding to either of these phonemes (“b” and “m” for /b/; “k” and “g” for /k/) were not present in the transcript. A crude estimate of the “buttercup” count is obtained for these speakers by counting the number of vowels present in the transcript and dividing by 3 (as there are three vowels in “buttercup”). Partial utterances that occurred at the beginning or end of the file (e.g., “-tercup” or “butter-”) were removed by ignoring “c” landmarks that were detected within a five-character window within the beginning of the transcript, and ignoring “b” landmarks that occurred within a five-character window of the end of the transcript (see Table 4, row 3).
Statistical Analyses
Statistical analyses were performed in accordance with V3 (Goldsack et al., 2020), a thorough framework for establishing the efficacy of biometric monitoring technologies used in digital medicine. The framework contains three steps: (a) verification, (b) analytical validation, and (c) clinical validation. Verification ensures that sensors are capturing data properly, and that the recorded data is of high quality. We performed data verification by automatically removing files that contained transcripts of zero length. Even after manual exclusion of quiet and noisy files, some recordings in the ALS-at-Home data were found to contain no speech. All of the 183 files with no speech were correctly removed via this method. Listening to recordings with no transcripts revealed that the participant simply chose not to perform the task as the background environment is clearly audible.
Analytical Validation
For analytical validation, the metric value produced by the algorithm is evaluated against a reference standard for a specified population. We performed analytical validation against ground truth DDK rate from trained manual annotators on DDK Corpus 2, the dataset of 884 DDK assessments with healthy and ALS speakers. Both Kendall τβ rank correlation and Pearson correlation between the automatically estimated rate and the ground truth rate were calculated. To characterize error with respect to ground truth, average error in syllables per second was calculated, and a cumulative distribution plot of the error in DDK rate was generated. The cumulative distribution plot measured the percent of files in the analytical validation dataset that achieved DDK rate error below a given threshold.
Error as a function of disease severity. In DDK Corpus 2, ground truth DDK rate was manually annotated for 754 files from the ALS-at-Home study. Mean average error in DDK rate and Kendall τβ correlation between estimated and annotated rates were computed for healthy files and files associated with ALSFRS-R subscores of 3 and 4. As labels were available for only 12 files total for ALSFRS-R subscores of 0, 1, and 2, calculation of mean average error was omitted for these speakers.
Clinical Validation
For clinical validation of the automated DDK system, we used data from the ALS-at-Home study to affirm that the metric predicted meaningful clinical, biological, physical, or functional states in subjects.
Reliability of DDK rate. As neurological function is expected to be consistent over short-term windows (over a period of a few days), DDK rate is expected to be similarly stable between two successively collected recordings from a participant. Test–retest reliability was measured by computing the Pearson correlation between estimated DDK rate at time = T, with the DDK rate from the next assessment at time = T + 1 (across all values of T). The average time between assessments throughout the course of the ALS-at-Home study across all 108 participants was 3.05 days. Test–retest reliability was calculated across all 7,919 files.
Comparison of DDK rate and gold-standard clinical measures. In addition to DDK assessments, subjects provided self reports of ALSFRS-R functional rating scales. ALSFRS-R, the clinical-standard measure of patient function and disease progression, is used to provide clinical care and is used to evaluate efficacy of prospective therapeutics (Cedarbaum et al., 1999). To validate DDK as a measure of speech function and bulbar impairment, we computed the Pearson correlation between the ALSFRS-R speech subscore and the DDK rate.
DDK deficits in ALS speakers and across bulbar function: Two analyses were conducted to uncover group-level and longitudinal differences in DDK between cohorts. The first analysis sought to understand if speakers with minimal or imperceptible speech impairment had deficits in DDK rate. A subset of 12 out of 68 nonbulbar onset ALS speakers in the ALS-at-Home study reported normal bulbar function as indicated by ALSFRS-R subscores of 4 each for speech, swallowing, and salivation. These speakers were termed the bulbar non-impaired cohort. The bulbar non-impaired cohort was compared to the healthy cohort using a linear mixed-effects model with random intercepts that was fitted to the DDK rate. The fixed effect in this model was the cohort (healthy or bulbar non-impaired), and participant identity was the random effect. Effect size with respect to the fixed effect (bulbar non-impaired vs. healthy) was used to quantify imperceptible DDK rate deficits in ALS participants who reported otherwise normal bulbar and speech function.
Second, we performed a longitudinal analysis to quantify practice effects by calculating average, per-cohort DDK rate trajectory. Practice effects or familiarization effects are improvements in task performance due to repetition. Practice effects are prominent in repetitive neuropsychological testing—especially in healthy subjects—and have been observed for rate when subjects undergo repeated rounds of DDK testing (Ben-David & Icht, 2018). Ability to accrue practice effects from repeating a task is an indication that an individual has the physiological (and/or cognitive) capacity to improve performance. Three linear mixed-effect growth curve models with random slopes and intercepts were fit to healthy, non-bulbar onset ALS, and bulbar-onset ALS groups respectively to estimate the average DDK trajectory of each group. Then, the three slopes obtained from the average DDK rate trajectories were compared to quantify average per-cohort level of DDK task familiarization throughout the course of the study.
Classification using Wav2DDK features: We explored the use of hidden layer features from the Wav2DDK model to predict binary clinical outcomes and to visualize different strata in the data (see Supplemental Material S5). Neural networks learn rich, informative feature representations and can be used as feature extractors for downstream use cases (Notley & Magdon-Ismail, 2018). This approach has been used for a variety of tasks such as second language speech scoring, speaker classification, sentiment analysis, and speech translation (Shah et al., 2021).
From each recording, the model produces a 1,024-dimensional vector for each 25-ms frame of speech. A single 1,024-dimensional vector was obtained for each recording by averaging across per-frame features. We assess the ability of the extracted features to classify participants across clinical subgroups. Using a logistic regression classifier with 10-fold cross validation we compute the area under the curve (AUC) for the receiver operating characteristic (ROC) curve when classifying between healthy versus bulbar-onset ALS, healthy versus non–bulbar-onset ALS, healthy versus all ALS, and non–bulbar-onset versus bulbar-onset ALS. Classification analyses were performed using only the first recording from each speaker.
Results
Analytical Validation of Wav2DDK
Figure 3a shows that on the holdout test set of 884 recordings, DDK Corpus 2, the algorithm estimated rate had a high correlation (r = .98, Kendall τβ = .97) with the manually annotated DDK rate. Average error was 0.071 syllables per second. To understand performance at varying levels of acceptable error in DDK rate, the cumulative distribution plot in Figure 3b shows that the estimated rate exactly matched the human annotated rate for 83% of files and was within 0.5 syllables per second of the human annotated rate for 96% of the files. When DDK rates were estimated from transcripts generated using the baseline Wav2Vec2 ASR model—without fine tuning on the ALS sentences corpus—the estimated rate matched ground truth for 71% of files and 88% of files were within 0.5 syllables per second of ground truth.
Figure 3.
(a) Ground truth (manual count) versus estimated diadochokinetic (DDK) rate for DDK rate calculated on a holdout set of 884 recordings of healthy and ALS speakers. The algorithm achieved high accuracy (Pearson correlation of .98, and Kendall correlation of .97 between manual and automated DDK rate) and an average error of only 0.071 syllables per second. The plot shows a probability mass function of the ground truth DDK rate versus the estimated DDK rate. The color grade represents the value of the probability mass function. Red regions and green regions represent that a larger fraction of data lies in the region compared to purple regions. Notice that a majority of the data lie along the Y = X line. The two groups of data lying on either side of the Y = X line correspond to the algorithm counting one extra or one less buttercup than the ground truth. (b) The cumulative DDK rate error distribution of the algorithm is compared against a version of the algorithm using a baseline Wav2Vec2 model that has not been fine-tuned on the ALS Sentence Reading Corpus.
Error as a function of disease severity. The mean absolute error (MAE) between estimated DDK rate and manually annotated rate was also compared across disease severity (see Table 5). For this comparison, the subset of manually annotated files in DDK Corpus 2 sourced from the ALS-at-Home study was used. The labeled files were from three subgroups: healthy controls, files where subjects reported an ALSFRS-R speech subscore of 4, and files where subjects reported an ALSFRS-R speech subscore of 3. MAE across healthy files, files with a speech subscore of 4, and files with a speech subscore of 3 was 0.12 syllables per second, 0.04 syllables per second, and 0.10 syllables per second respectively.
Table 5.
Accuracy by disease severity.
Subset | Number labeled files | Kendall τβ | DDK rate mean absolute error (std) | Mean estimated DDK rate (syllables per second) | Mean ground truth DDK rate (syllables per second) |
---|---|---|---|---|---|
Healthy | 101 | 0.92 | 0.12 (0.25) | 7.19 | 7.19 |
ALSFRS-R = 4 | 93 | 0.98 | 0.04 (0.12) | 7.18 | 7.18 |
ALSFRS-R = 3 | 548 | 0.97 | 0.10 (0.26) | 5.84 | 5.9 |
Note. Comparative analysis of error was performed for a set of labeled DDK assessments from ALS-at-home for healthy speakers, assessments with ALSFRS = 4, and assessments with ALSFRS = 3. High Kendall τβ was observed between estimated rate and ground truth rate for all three cases. The mean absolute errors for each group were similar as the errors were within one standard deviation of each other. DDK = diadochokinetic.
Clinical Validation
Test–Retest Reliability
The longitudinally collected audio from the ALS-at-Home study was used to assess test–retest reliability. Figure 4 demonstrates that the algorithm achieves a high test–retest reliability of .95 by plotting the DDK rate at consecutive times (time = T vs. time = T + 1) and measuring the correlation between the estimated rates for consecutive recordings. Average time between assessments was 3.05 days. Test–retest reliability was .85 across all healthy speakers and .95 for all ALS speakers. Fine-tuning of the Wav2Vec2 model significantly increased reliability from .63 to .85 for healthy speakers and from .88 to .95 for speakers with ALS (see Table 6).
Figure 4.
Test–retest reliability of the algorithm is measured by plotting a probability mass function of the estimated DDK rate for a participant at time = T on the x-axis against the estimated rate at time = T + 1. The color gradient denotes the probability mass (i.e., a larger portion of the data lies in darker shaded regions). As DDK was longitudinally collected, each (x, y) pair used to generate the probability mass function consists of the estimated rate from an assessment versus the estimated rate at the next assessment (for the same participant). The probability mass function shown is generated using data from all participants. Since short-term day-to-day or week-to-week changes in DDK rates are unlikely, a DDK measurement at time = T should be similar to a measurement at time = T + 1, and most of the data should lie along the y = x line. Indeed, as indicated by the darker shading along y = x, most of the data lies along this line, and we measure a high test–retest of .95.
Table 6.
Analytical and clinical validation metrics for Wav2DDK.
Analytical validation | ||||
---|---|---|---|---|
Files | Speakers | Kendall's correlation |
||
No Fine-tuning | Fine-tuned Model | |||
DDK Corpus 2 | 884 (101 healthy, 783 ALS) | 128 (26 healthy, 102 ALS) | 0.91 | 0.97 |
Clinical validation | ||||
Clinical validation (ALS-at-Home Corpus) | Files | Speakers |
Test–retest reliability (Pearson correlation) |
|
No fine-tuning |
Fine-tuned model |
|||
Healthy Speakers |
1,886 |
26 |
0.63 | 0.85 |
ALS Speakers | 6,033 | 78 | 0.88 | 0.95 |
Note. For analytical validation, Kendall τβ correlation of estimated rate with manually annotated rate is calculated. Test–retest reliability is calculated on the longitudinally collected DDK data for both healthy and ALS speakers to establish the clinical repeatability of the measure. DDK = diadochokinetic; ALS = amyotrophic lateral sclerosis.
Correlation With Gold-Standard Clinical Variables
We evaluated the relationship of DDK rate to the ALSFRS-R speech subscore to establish the clinical validity of extracted rates (see Figure 5). A positive correlation of r = .67 was observed between the algorithm estimated DDK rates and the functional rating of speech ability. A marked drop in DDK rate exists between speakers with an ALSFRS-R speech subscore of 4 (normal speech processes) and those with ALSFRS-R of 3 (detectable speech disturbance). Progression from ALSFRS-R of 2 (intelligible with repeating) to an ALSFRS-R of 1 (speech combined with nonvocal communication) or an ALSFRS-R of 0 (loss of useful speech) was accompanied by drastic decline in DDK rate.
Figure 5.
Algorithm estimated diadochokinetic (DDK) rates were correlated with the revised amyotrophic lateral sclerosis functional rating scale (ALSFRS-R) speech subscore (r = 0.67) to demonstrate the clinical relevance of the algorithm estimated rates. The ALSFRS-R is the clinical gold standard for measuring patient outcomes in patients with ALS. The ALSFRS-R speech scores range from 0 (loss of useful speech) to 4 (normal speech process).
DDK Rate Measurements for Uncovering Mild Impairment
Analysis of the intercept of a linear mixed-effects model with participant identity as the random effect and the cohort (bulbar non-impaired or healthy) as a fixed effect was used to compute the effect size across cohort. DDK rates were 0.3 syllables per second higher (standard error 0.2 syllables per second, p = .07) for healthy controls versus bulbar non-impaired subjects.
Longitudinal Trajectory of DDK Rate
A linear mixed-effects model with random slopes and intercepts was fitted to each clinical cohort in the ALS-at-Home study to obtain average DDK trajectories for each cohort. Comparison of the slopes revealed that healthy subjects manifested the largest practice effect: performance increased at an average rate of 0.21 syllables per second per month. Non-bulbar ALS subjects improved at a slower rate of 0.135 syllables per second per month (see Figure 6). No familiarization effect was observed for the bulbar cohort, suggesting insufficient physiological reserve to accrue practice benefits. The difference in practice effect slope between bulbar and non-bulbar groups, and the difference in practice effect between bulbar and healthy controls was found to be significant (p < .001 for both comparisons). However, the difference in practice effect between non-bulbar and healthy controls were not significantly different from 0 (p = .073).
Figure 6.
Estimated DDK rate as a function of number of days since study enrollment. The thin lines show rate trajectories for each participant. Green, blue, and red denote cohort membership of healthy, non-bulbar ALS, and bulbar-onset ALS respectively. The bolded lines show the average trajectory for each cohort and were generated by fitting a linear mixed-effects growth curve model with random slopes and intercepts for each cohort. The healthy control group showed the strongest familiarization effect, with an average DDK rate increase of 0.21 syllables/second per month. Non-bulbar ALS subjects improved at a slower rate of 0.135 syllables/second per month. The bulbar onset group showed no familiarization effect.
Clinical Value of Intermediate Features for Pathological Speech Classification
Feature vectors extracted using Wav2Vec2 were used to train a logistic regression classifier. Instead of all recordings, only features extracted from the first recording from each participant were used to ensure that the samples were independent and identically distributed. Supplemental Material S2 demonstrates the utility of the intermediate features in stratifying the data for the classification task. Classification performance between cohorts was evaluated on the ALS-at-Home dataset. Participants were classified as “bulbar” or “non-bulbar” based on location of symptom onset; the earliest symptoms in bulbar onset participants were associated with bulbar function (speech, salivation, or swallowing). We observe high classification performance between the healthy and bulbar groups (ROC AUC = .99), good classification performance between the healthy and bulbar+ non-bulbar groups (ROC AUC = .78), and between bulbar and non-bulbar groups (ROC AUC = .81). The lowest AUC of .72 was observed for classifying between healthy and non-bulbar subjects.
Discussion
Sequential and/or AMR tasks (DDKs) are commonly elicited during clinical evaluations of speech to characterize the speed, clarity, and rhythm of the syllables, and normative values have been collected across age, gender, and languages (Ben-David & Icht, 2016; Choe & Han, 1998; Fletcher, 1972; Icht & Ben-David, 2014; Prathanee et al., 2003; Tafiadis et al., 2021, 2022). DDK rates can be measured by using a stopwatch and counting the number of syllables produced during some duration, and then dividing that number by the duration of that speech string. This can be done during a clinical session or on prerecorded audio after the session has ended. The proposed solution, Wav2DDK, eschews the need for in-clinic data collection and manual measurements. Automated and accurate DDK calculation thereby greatly streamlines the clinical workflow for this task.
Word stimuli such as “pattycake,” “buttercup,” or “buttercake” have been found to produce faster syllable rates than nonword stimuli; this effect has been observed across ages and even across languages (Ben-David & Icht, 2016; Tafiadis et al., 2022; Zamani et al., 2017). Icht and Ben-David (2014) conducted a review of DDK norms collected for English, Greek, Portuguese, Hebrew, and Farsi and found the need to set language- and culture-specific norms. Similarly, we argue that by collecting word specific norms for the stimulus “buttercup,” the validity of the DDK can be preserved. Due to similarity in articulatory patterns between “buttercup” and “pataka,” a similar decline in rate is expected with disease progression. Clinical use of this tool will require normative data collection, a process that is ongoing.
Despite extant methods in the literature for automated DDK analysis methods, these methods can struggle with dysarthric speech. Indeed, Tanchip et al. (2022) estimated DDK rate using a basket of signal processing and energy based DDK algorithms on a dataset of dysarthric speakers. Across existing methods, they found best case Kendall τβ correlations of .76, .87, and .7 between manually labeled and algorithm estimated DDK rates for mild, moderate, and severe dysarthria, respectively. Our method achieved Kendall τβ correlations of .92, .98, and .97 between manual rate and estimated rate for healthy subjects, subjects with ALSFRS-R of 4, and subjects with an ALSFRS-R of 3, respectively. While it is unexpected that the healthy controls had a lower Kendall τβ correlation and higher MAE when compared with the group of speakers with ALSFRS-R speech subscores of 3 or 4, the standard deviation of the MAE for each subgroup (see Table 5) reveals that MAE are within 1 SD of each other.
Across all files in the ALS-at-Home dataset, our transformer ASR network based algorithm achieved a Pearson correlation of .98 and a Kendall τβ correlation of .97 with respect to manual annotation. Our results are consistent with reported accuracies from other deep learning based systems (Rozenstoks et al., 2020; Segal et al., 2022). However, the existing studies use small, in-clinic collected datasets (Segal et al., 2022, used 19 healthy controls and five speakers with Parkinson's disease). Generalization performance for these methods is not well characterized on large scale data. This is in contrast to our work where we validated the model on a large, remotely collected, longitudinal dataset. Longitudinally collected data allowed us to perform additional clinical validation via evaluation of the test–retest reliability. We observed a test–retest reliability of .85 across all healthy speakers and .95 across all ALS speakers. This higher test–retest reliability observed for ALS speakers compared to healthy controls is a consequence of DDK rate having a much larger dynamic range for ALS speech.
Pretrained transformers have been shown to have improved out of distribution generalization (Hendrycks et al., 2020) compared to CNNs and LSTMs. Existing deep learning DDK analysis methods that train networks from scratch (Novotny et al., 2020; Segal et al., 2022) do not leverage recent developments in transformer neural networks for speech tasks (Baevski et al., 2020). The excellent few shot performance of large pretrained models (Bansal et al., 2018) is a key strength to our method. When compared with networks trained from scratch, Baevski et al. (2020) showed that networks pretrained on thousands of hours of audio require less data than networks trained from scratch to achieve similar performance on downstream tasks. Fine tuning improved model performance for recording conditions present in remotely collected speech and therefore not only increased accuracy in predicting DDK rate, but also improved the test–retest reliability (see Table 6). Thus, we argue that Wav2DDK likely reduces the amount of training data required to be collected to adapt a DDK estimation system to other diseases or populations.
Clinical utility of Wav2DDK. The estimated DDK rate outcome measure and ancillary neural network features demonstrated high diagnostic utility. Hidden layer features extracted from Wav2DDK were able to classify between healthy controls and participants with disease. Additionally, automated processing of frequent at-home assessments allowed for longitudinal analysis of DDK rate trajectories per cohort. The average trajectory of subjects with bulbar-onset ALS showed that they were unable to manifest improvement in DDK rate across the course of the study. However, a practice effect was observed for non–bulbar-onset ALS speakers and healthy controls; the DDK rate for healthy controls increased 56% faster compared to non–bulbar-onset ALS speakers. A rate decline of 0.3 syllables per second was also observed relative to healthy controls in bulbar non-impaired speakers, a subset of non–bulbar-onset ALS participants whose clinical ratings indicated no bulbar impairment. One of the main advantages of automated, remote measurements is the ease at which frequent measurements can be collected. There is promise for frequent speech assessments that measure speaking rate or DDK to detect early changes (Rong et al., 2015; Stegmann et al., 2020). A necessary first step is a reliable method to estimate DDK. Observation of decline in subjects with disease, but no detectable impairment, can serve as an impetus for clinical intervention before onset of further impairment.
Study limitations and future work. Metrics describing temporal regularity extracted from DDK, such as cycle-to-cycle temporal variability, provide high diagnostic accuracy in classification between healthy and ALS groups, often with higher accuracy than syllable rates alone (Rong et al., 2015). Most automatic DDK methods are able to calculate DDK regularity as they identify the temporal position of syllables in the audio, but extracting timing from DDK presents a challenge to our transcript based DDK analysis system. Wav2DDK is based on a Wav2Vec2 ASR model. Wav2Vec2 produces transcripts using a 25-ms analysis window with a 20-ms frame step. Each analysis window is classified into one of the 26 characters in English. At first glance, it seems that timing information on the onset of each “buttercup” utterance could be directly extracted by processing the frames corresponding to each landmark character used for count estimation such as “b” or “c.” However, the model was trained with a connectionist temporal classification loss (Graves et al., 2006) on an ASR task—the training objective maximized the probability of the correct character sequence without regard to timing. Thus, extracting utterance onset and offset using the character sequence is unreliable. Zhu et al. (2022) showed that if phoneme boundaries are provided by manual annotations, Wav2Vec2 could predict phoneme locations with state of the art accuracy in a forced alignment task. In this study, annotators were directed to mark the only total “buttercup” count, but not the locations of the onset and offset of each “buttercup” or the onset and offset of syllables, so the collected data was not conducive for training Wav2DDK to predict syllable location. Future collection of DDKs manually labeled at the phoneme or syllable level in conjunction with model retraining would allow Wav2DDK to generate DDK regularity measures.
Declines in articulatory precision and speaking rate have been detected from remotely collected speech early in the disease trajectory (Stegmann et al., 2020). A Wav2DDK variant capable of identifying phoneme locations could be used to generate articulatory precision scores based on DDK assessments. Though the character substitutions present in the Wav2DDK transcripts mirror the articulatory patterns in dysarthric speakers (see Table 3), a lack of time alignment between transcript and audio, and absence of accurate duration information for each character would result in inaccurate measurements of articulatory precision.
An additional limitation of the study was that duration of active speech was not used when calculating the DDK rate. Syllable counts were normalized by 8 s for all recordings regardless of time of speech onset. It is possible that speech onset time or a delay in speech onset could prove to be a useful feature for measurement of impairment. A future iteration of Wav2DDK could investigate this by using voice activity detection to find the total duration of active speech and normalizing by active speech duration.
Conclusions
In this work, we proposed Wav2DDK, a method for automated labeling of DDK rate from remotely collected DDK assessments. A common problem among features derived from high-dimensional performance-based signals such as speech is that algorithm outputs fail to apply in new situations (Berisha et al., 2021). We guard against failure to generalize in three ways. First, we constrain variability in the signal to be analyzed by using a well-understood task (DDK), already in use in the clinic. Second, we produce an interpretable measure, DDK rate, that has direct clinical meaning. Third, we apply a rigorous framework modeled after V3 (Goldsack et al., 2020) on the largest corpus used to date in automatic DDK analysis studies (to the best of our knowledge). We show that estimated DDK rates are accurate, reliable, and clinically relevant. This process involves using independent datasets for development of the metric and validation of the metric, measuring accuracy of the metric for speakers who exhibit a wide range of DDK rates and articulatory function, measuring repeatability of the metric on clinically relevant populations, demonstrating construct validity through correlation with gold-standard clinical scales, and demonstrating that the metric follows disease progression longitudinally in populations of interest. Taken together, these steps help to ensure that Wav2DDK will provide accurate, clinically useful measurements of articulatory function in other studies.
The algorithm decreases burden on patients and clinicians; patients can provide speech remotely without the need for in-clinic visits, while clinicians no longer need to manually annotate each sample. By parsing the output transcript from an acoustic-to-character sequence ASR model, the proposed approach produces a reliable DDK rate that is shown to have high clinical relevance. Longitudinal trajectories of DDK show that the magnitude of DDK familiarization could be a useful feature for early detection of bulbar onset. High correlation between the acoustic model features and bulbar function suggests their applicability in developing measures for augmenting or supplementing self-reports of ALSFRS-R.
Supplementary Material
Acknowledgment
This work was partially funded by grants NIH-NIDCD R01DC006859 and NIH-NIDCD R21DC019475 awarded to Julie Liss and Visar Berisha.
Funding Statement
This work was partially funded by grants NIH-NIDCD R01DC006859 and NIH-NIDCD R21DC019475 awarded to Julie Liss and Visar Berisha.
References
- Antolik, T. K., & Fougeron, C. (2013). Consonant distortion in dysarthria due to Parkinson's disease, amyotrophic lateral sclerosis and cerebellar ataxia. In Bimbot F. (Ed.), Interspeech 2013 (pp. 2152–2156). International Speech Communication Association. 10.21437/Interspeech.2013-509 [DOI]
- Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460. [Google Scholar]
- Bansal, S., Kamper, H., Livescu, K., Lopez, A., & Goldwater, S. (2018). Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. arXiv preprint arXiv:1809.01431. 10.21437/Interspeech.2018-1326 [DOI]
- Icht, M., & Ben-David, B. M. (2014). Oral-diadochokinesis rates across languages: English and Hebrew norms. Journal of Communication Disorders, 48, 27–37. 10.1016/j.jcomdis.2014.02.002 [DOI] [PubMed] [Google Scholar]
- Ben-David B. M., & Icht M. (2016) Oral-diadochokinetic rates for Hebrew-speaking healthy aging population: Non-word versus real-word repetition. International Journal of Language and Communication Disorders, 52(3), 301–310. 10.1111/1460-6984.12272 [DOI] [PubMed] [Google Scholar]
- Ben-David, B. M., & Icht, M. (2018). The effect of practice and visual feedback on oral-diadochokinetic rates for younger and older adults. Language and Speech, 61(1), 113–134. 10.1177/0023830917708808 [DOI] [PubMed] [Google Scholar]
- Berisha, V., Krantsevich, C., Hahn, P. R., Hahn, S., Dasarathy, G., Turaga, P., & Liss, J. (2021). Digital medicine and the curse of dimensionality. NPJ digital medicine, 4(1), 1–8. 10.1038/s41746-021-00521-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Braley, M., Pierce, J. S., Saxena, S., De Oliveira, E., Taraboanta, L., Anantha, V., Lakhan, S. E., & Kiran, S. (2021). A virtual, randomized, control trial of a digital therapeutic for speech, language, and cognitive intervention in post-stroke persons with aphasia. Frontiers in Neurology, 12, 626780. 10.3389/fneur.2021.626780 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cedarbaum, J. M., Stambler, N., Malta, E., Fuller, C., Hilt, D., Thurmond, B., & Nakanishi, A. (1999). The ALSFRS-R: A revised ALS functional rating scale that incorporates assessments of respiratory function. Journal of the Neurological Sciences, 169(1–2), 13–21. 10.1016/S0022-510X(99)00210-5 [DOI] [PubMed] [Google Scholar]
- Choe, J., & Han, J. S. (1998). Diadochokinetic rate of normal children and adults: A preliminary study. Communication Sciences & Disorders, 3(1), 183–194. [Google Scholar]
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
- Fletcher, S. G. (1972). Time-by-count measurement of diadochokinetic syllable rate. Journal of Speech and Hearing Research, 15(4), 763–770. 10.1044/jshr.1504.763 [DOI] [PubMed] [Google Scholar]
- Goldsack, J. C., Coravos, A., Bakker, J. P., Bent, B., Dowling, A. V., Fitzer-Attas, C., Godfrey, A., Godino, J. G., Gujar, N., Izmailova, E., Manta, C., Peterson, B., Vandendriessche, B., Wood, W. A., Wang, K. W., & Dunn, J. (2020). Verification, analytical validation, and clinical validation (V3): The foundation of determining fit-for-purpose for Biometric Monitoring Technologies (BioMeTs). NPJ Digital Medicine, 3(1), 1–15. 10.1038/s41746-020-0260-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: Labeling unsegmented sequence data with recurrent neural networks. In Cohen W. & Moore A. (Eds.), Proceedings of the 23rd International Conference on Machine Learning (pp. 369–376). Association for Computing Machinery. 10.1145/1143844.1143891 [DOI]
- He K., Zhang X., Ren S., & Sun J. (2016, June). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). 10.1109/CVPR.2016.90 [DOI]
- Hendrycks, D., Liu, X., Wallace, E., Dziedzic, A., Krishnan, R., & Song, D. (2020). Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100. 10.18653/v1/2020.acl-main.244 [DOI]
- Karlsson, F., Schalling, E., Laakso, K., Johansson, K., & Hartelius, L. (2020). Assessment of speech impairment in patients with Parkinson's disease from acoustic quantifications of oral diadochokinetic sequences. The Journal of the Acoustical Society of America, 147(2), 839–851. 10.1121/10.0000581 [DOI] [PubMed] [Google Scholar]
- Kent, R. D. (1996). Hearing and believing: Some limits to the auditory-perceptual assessment of speech and voice disorders. American Journal of Speech-Language Pathology, 5(3), 7–23. 10.1044/1058-0360.0503.07 [DOI] [Google Scholar]
- Kent, R. D., Kim, Y., & Chen, L. M. (2022). Oral and laryngeal diadochokinesis across the life span: A scoping review of methods, reference data, and clinical applications. Journal of Speech, Language, and Hearing Research, 65(2), 574–623. 10.1044/2021_JSLHR-21-00396 [DOI] [PubMed] [Google Scholar]
- Li, J., Deng, L., Gong, Y. & Haeb-Umbach, R. (2014, April). An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 745–777. 10.1109/TASLP.2014.2304637 [DOI] [Google Scholar]
- Liss, J. M., White, L., Mattys, S. L., Lansford, K., Lotto, A. J., Spitzer, S. M., & Caviness, J. N. (2009). Quantifying speech rhythm abnormalities in the dysarthrias. Journal of Speech, Language, and Hearing Research, 52(5), 1334–1352. 10.1044/1092-4388(2009/08-0208) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mundt, J. C., Snyder, P. J., Cannizzaro, M. S., Chappie, K., & Geralts, D. S. (2007). Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology. Journal of Neurolinguistics, 20(1), 50–64. 10.1016/j.jneuroling.2006.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Notley, S., & Magdon-Ismail M. (2018). Examining the use of neural networks for feature extraction: A comparative analysis using deep learning, support vector machines, and k-nearest neighbor classifiers. arXiv preprint arXiv:1805.02294v2.
- Novotny, M., Melechovsky, J., Rozenstoks, K., Tykalova, T., Kryze, P., Kanok, M., Klempir, J., & Rusz, J. (2020). Comparison of automated acoustic methods for oral diadochokinesis assessment in amyotrophic lateral sclerosis. Journal of Speech, Language, and Hearing Research: JSLHR, 63(10), 3453–3460. 10.1044/2020_JSLHR-20-00109 [DOI] [PubMed] [Google Scholar]
- Orozco-Arroyave, J. R., Vásquez-Correa, J. C., Vargas-Bonilla, J. F., Arora, R., Dehak, N., Nidadavolu, P. S., Christensen, H., Rudzicz, F., Yancheva, M., Chinaei, H., Vann, A., Vogler, N., Bocklet, T., Cernak, M., Hannink, J., & Nöth, E. (2018). NeuroSpeech. SoftwareX, 8, 69–70. 10.1016/j.softx.2017.08.004 [DOI] [Google Scholar]
- Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015, April). Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206–5210). IEEE. 10.1109/ICASSP.2015.7178964 [DOI] [Google Scholar]
- Prathanee, B., Thanaviratananich, S., & Pongjanyakul, A. (2003). Oral diadochokinetic rates for normal Thai children. International Journal of Language & Communication Disorders, 38(4), 417–428. 10.1080/1368282031000154042 [DOI] [PubMed] [Google Scholar]
- Rozenstoks, K., Novotny, M., Horakova, D., & Rusz, J. (2020, January). Automated assessment of oral diadochokinesis in multiple sclerosis using a neural network approach: Effect of different syllable repetition paradigms. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 28(1), 32–41. 10.1109/TNSRE.2019.2943064 [DOI] [PubMed] [Google Scholar]
- Rong, P., Yunusova, Y., Wang, J., & Green, J. R. (2015). Predicting early bulbar decline in amyotrophic lateral sclerosis: A speech subsystem approach. Behavioral Neurology, 2015, Article ID 183027. 10.1155/2015/183027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rong, P., Yunusova, Y., Wang, J., Zinman, L., Pattee, G. L., Berry, J. D., Perry, B., & Green, J. R. (2016). Predicting speech intelligibility decline in amyotrophic lateral sclerosis based on the deterioration of individual speech subsystems. PLOS ONE, 11(5), Article e0154971. 10.1371/journal.pone.0154971 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rong P. (2020). Automated acoustic analysis of oral diadochokinesis to assess bulbar motor involvement in amyotrophic lateral sclerosis. Journal of Speech, Language, and Hearing Research, 63(1), 59–73. 10.1044/2019_JSLHR-19-00178 [DOI] [PubMed] [Google Scholar]
- Rutkove, S. B., Qi, K., Shelton, K., Liss, J., Berisha, V., & Shefner, J. M. (2019). ALS longitudinal studies with frequent data collection at home: Study design and baseline data. Amyotrophic Lateral Sclerosis & Frontotemporal Degeneration, 20(1–2), 61–67. 10.1080/21678421.2018.1541095 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Segal, Y., Hitczenko K., Goldrick M., Buchwald A., Roberts, A., & Keshet, J. (2022). DDKtor: Automatic Diadochokinetic Speech Analysis. 10.21437/Interspeech.2022-311 [DOI]
- Shah, J., Singla, Y. K., Chen, C., & Shah, R. R. (2021). What all do audio transformer models hear? Probing acoustic representations for language delivery and its structure. arXiv preprint arXiv:2101.00387. 10.1109/ICDMW58026.2022.00120 [DOI]
- Sinha, P., Vandana, V. P., Lewis, N. V., Jayaram, M., & Enderby, P. (2015). Evaluating the effect of risperidone on speech: A cross-sectional study. Asian Journal of Psychiatry, 15, 51–55. 10.1016/j.ajp.2015.05.005 [DOI] [PubMed] [Google Scholar]
- Smékal, Z., Mekyska, J., Rektorová, I., & Faúndez-Zanuy, M. (2013). Analysis of neurological disorders based on digital processing of speech and handwritten text. International Symposium on Signals, Circuits and Systems (ISSCS), 1–6. 10.1109/ISSCS.2013.6651178 [DOI] [Google Scholar]
- Srinivasan, V., Ramalingam, V., & Arulmozhi, P. (2014). Artificial neural network based pathological voice classification using MFCC features. International Journal of Science, Environment and Technology, 3(1), 291–302. [Google Scholar]
- Stegmann, G. M., Hahn, S., Liss, J., Shefner, J., Rutkove, S., Shelton, K., Duncan, C. J., & Berisha, V. (2020). Early detection and tracking of bulbar changes in ALS via frequent and remote speech analysis. NPJ Digital Medicine, 3(1), 1–5. 10.1038/s41746-020-00335-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stipancic, K. L., Yunusova, Y., Berry, J. D., & Green, J. R. (2018). Minimally detectable change and minimal clinically important difference of a decline in sentence intelligibility and speaking rate for individuals with amyotrophic lateral sclerosis. Journal of Speech, Language, and Hearing Research, 61(11), 2757–2771. 10.1044/2018_JSLHR-S-17-0366 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tafiadis, D., Zarokanellou, V., Prentza, A., Voniati, L., & Ziavra, N. (2021). Oral diadochokinetic rates for real words and non-words in Greek-speaking children. Open Linguistics, 7(1), 722–738. 10.1515/opli-2020-0178 [DOI] [Google Scholar]
- Tafiadis, D., Zarokanellou, V., Prentza, A., Voniati, L., & Ziavra, N. (2022). Diadochokinetic rates in healthy young and elderly Greek-speaking adults: The effect of types of stimuli. International Journal of Language & Communication Disorders. 10.1111/1460-6984.12747 [DOI] [PubMed] [Google Scholar]
- Tanchip, C., Guarin, D. L., McKinlay, S., Barnett, C., Kalra, S., Genge, A., Korngut, L., Green, J. R., Berry, J., Zinman, L., Yadollahi, A., Abrahao, A., & Yunusova, Y. (2022). Validating automatic diadochokinesis analysis methods across dysarthria severity and syllable task in amyotrophic lateral sclerosis. Journal of Speech, Language, and Hearing Research, 65(3), 940–953. 10.1044/2021_JSLHR-21-00503 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tao, F., Daudet L., Poellabauer C., Schneider S. L., & Busso C. (2016). A portable automatic PA-TA-KA syllable detection system to derive biomarkers for neurological disorders. In Interspeech (pp. 362–366). 10.21437/Interspeech.2016-789 [DOI]
- Tjaden, K., & Watling, E. (2003). Characteristics of diadochokinesis in multiple sclerosis and Parkinson's disease. Folia Phoniatrica et Logopaedica, 55(5), 241–259. 10.1159/000072155 [DOI] [PubMed] [Google Scholar]
- Wang, Y. T., Gao, K., Zhao, Y., Kuruvilla-Dugdale, M., Lever, T. E., & Bunyak, F. (2019). DeepDDK: A deep learning based oral-diadochokinesis analysis software. IEEE-EMBS International Conference on Biomedical and Health Informatics, 2019, 1–4. 10.1109/BHI.2019.8834506 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang, Y. T., Kent, R. D., Duffy, J. R., Thomas, J. E., & Weismer, G. (2004). Alternating motion rate as an index of speech motor disorder in traumatic brain injury. Clinical Linguistics & Phonetics, 18(1), 57–84. 10.1080/02699200310001596160 [DOI] [PubMed] [Google Scholar]
- Zamani, P., Rezai, H., & Garmatani, N. T. (2017). Meaningful words and non-words repetitive articulatory rate (oral diadochokinesis) in Persian speaking children. Journal of Psycholinguistic Research, 46(4), 897–904. 10.1007/s10936-016-9469-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu, J., Zhang, C., & Jurgens, D. (2022, May). Phone-to-audio alignment without text: A semi-supervised approach. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8167–8171). IEEE. 10.1109/ICASSP43922.2022.9746112 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.