Characterizing Dysarthria Diversity for Automatic Speech Recognition: A Tutorial from the Clinical Perspective

Hannah P Rowe; Sarah E Gutz; Marc F Maffei; Katrin Tomanek; Jordan R Green

doi:10.3389/fcomp.2022.770210

. Author manuscript; available in PMC: 2023 Oct 19.

Published in final edited form as: Front Comput Sci. 2022 Apr 12;4:770210. doi: 10.3389/fcomp.2022.770210

Characterizing Dysarthria Diversity for Automatic Speech Recognition: A Tutorial from the Clinical Perspective

Hannah P Rowe ¹, Sarah E Gutz ², Marc F Maffei ¹, Katrin Tomanek ³, Jordan R Green ^1,^2,^*

PMCID: PMC10586392 NIHMSID: NIHMS1889487 PMID: 37860708

Abstract

Despite significant advancements in automatic speech recognition (ASR) technology, even the best performing ASR systems are inadequate for speakers with impaired speech. This inadequacy may be, in part, due to the challenges associated with acquiring a sufficiently diverse training sample of disordered speech. Speakers with dysarthria, which refers to a group of divergent speech disorders secondary to neurologic injury, exhibit highly variable speech patterns both within and across individuals. This diversity is currently poorly characterized and, consequently, difficult to adequately represent in disordered speech ASR corpora. In this paper, we consider the variable expressions of dysarthria within the context of established clinical taxonomies (e.g., Darley, Aronson, and Brown dysarthria subtypes). We also briefly consider past and recent efforts to capture this diversity quantitatively using speech analytics. Understanding dysarthria diversity from the clinical perspective and how this diversity may impact ASR performance could aid in (1) optimizing data collection strategies for minimizing bias; (2) ensuring representative ASR training sets; and (3) improving generalization of ASR across users and performance for difficult-to-recognize speakers. Our overarching goal is to facilitate the development of robust ASR systems for dysarthric speech using clinical knowledge.

Keywords: training corpora, dysarthric speech, automatic speech recognition, acoustic analysis of speech, clinical framework, diversity and inclusion

1. Introduction

Dysarthria, or impaired speech due to motoric deficits, can have a detrimental impact on functional communication, often leading to significantly reduced quality of life (Hartelius et al., 2008). For individuals with speech impairments, automatic speech recognition (ASR) systems can enhance accessibility and interpersonal communication. However, inadequate acoustic models continue to impede the widespread success of ASR for disordered speech (Gupta et al., 2016; Moore et al., 2018). The limits of disordered speech ASR may be, in part, a byproduct of the significant variety of abnormal speech patterns across individuals (Duffy, 2013) and their underrepresentation in training corpora (Gupta et al., 2016). Nevertheless, studies on ASR for dysarthria have rarely considered this diversity (Blaney and Wilson, 2005; Gupta et al., 2016; Keshet, 2018; Benzeghiba et al., 2007; Moore et al., 2018). In this perspective paper, we examine how speech impairment diversity has been characterized based on clinical models and how this diversity may impact ASR performance.

ASR can be broadly classified into speaker-independent (SI), speaker-dependent (SD), and speaker-adaptive (SA) (also known as personalized) systems. SI systems are trained on a large set of speech data and are not adapted to the user’s speech. Thus, although commercially developed SI ASR systems have demonstrated low word error rates (WER) for healthy speakers, these systems perform considerably worse with impaired speech (Moore et al., 2018). Performance often improves, however, when training sets include dysarthric speakers, thereby providing more variability on which to train (Mengistu and Rudzicz, 2011; Mustafa et al., 2014). In contrast to SI systems, SD systems are trained only on the user of the system and therefore can achieve high recognition accuracy, whereas SA systems are trained on a large dataset, such as the ones used for SI systems, but adapt to the user’s speech over time. Previous work has found SA systems that are trained on a user’s own speech perform much better than SD and SI models (Green et al., 2021; Kim et al., 2013; Mengistu and Rudzicz, 2011; Mustafa et al., 2014; Takashima et al., 2020; Xiong et al., 2019). Green et al. (2021), for example, recently demonstrated that the recognition accuracy of short phrases using end-to-end (e2e) ASR models was 4.6% for SA models and 31% for SI models. However, SA and SD systems both require training data from the speaker, which can be cumbersome for individuals with neurodegenerative diseases who are prone to fatigue.

Despite the significant progress in ASR development, even the highest performing out-of-the-box ASR models are inadequate for impaired speakers. Although poor performance has largely been attributed to the shortage of disordered speech training datasets, closing the performance gap is likely to require not only more data but also sufficiently diverse corpora. Indeed, solely adding speakers in attempts to represent the range of diversity is inefficient, expensive, and possibly unachievable. Ensuring dysarthric speech diversity, however, requires conceptual schemes for identifying salient atypical speech variables and their expressed ranges across individuals. In this paper, we consider several conceptual schemes used by speech-language pathologists to clinically characterize dysarthria diversity often for the purpose of speech diagnosis. An improved understanding of the diversity inherent to dysarthria and its potential impact on ASR performance could lead to (1) optimized data collection strategies for minimizing bias; (2) sufficiently representative ASR training sets; and (3) more widespread generalization across ASR users and, in turn, stronger performance for difficult-to recognize speakers. We consider the following questions:

What types of diversity need to be represented in dysarthria ASR training corpora?
What phonemic patterns might impact dysarthria ASR performance?
What can be done to adequately represent the different sources of variability in dysarthria ASR training corpora?

2. Characterizing Dysarthria Diversity

2.1. What types of diversity need to be represented in dysarthria ASR training corpora?

2.1.1. Diversity in speech severity

To date, the most frequently used metric for distinguishing variation in a dysarthria research cohort is overall speech impairment severity (Duffy, 2013). Severity is a multidimensional construct that refers to the speaker’s overall impairment and includes a range of components, including naturalness, intelligibility, and subsystem abnormalities (see Section 2.1.3) (Duffy, 2013). Severity is often indexed by trained listeners, such as speech-language pathologists, who use adjectival descriptors (e.g., mild, moderate, severe, profound). Alternatively, severity can be assessed using human transcription intelligibility, which indicates a listener’s ability to understand the speaker based on the speech signal alone (Yorkston et al., 2007). While a functional metric, intelligibility is just one component of severity and does not necessarily account for all the different fluctuations in speech that are influenced by severity (e.g., changes in voice and resonance), especially for more mild speech impairment (Rong et al., 2015).

Including the full range of speech severities in ASR training sets is essential because good recognition accuracy for mild speech is unlikely to generalize to more severely affected speech (Moore et al., 2018). Thus, sufficient representation of speakers with severe dysarthria, in addition to those with mild and moderate impairments, in the training dataset could provide a more sustainable approach for enabling models that generalize to speakers across the severity continuum. Representing diversity only with severity, however, ignores the substantial variety of aberrant speech features that characterize clinically distinct dysarthria variants. Other sources of diversity in dysarthric speech must, therefore, be considered to develop inclusive and sufficiently representative datasets.

2.1.2. Diversity in dysarthria type

One of the most established clinical taxonomies for speech motor disorders was developed over 50 years ago by Darley, Aronson, and Brown (DAB) (Darley et al., 1969). The DAB labeling system distinguishes 38 atypical speech features that are rated on a 7-point scale and groups dysarthria types based on speech feature profiles. While creating the DAB model, the authors stratified dysarthric speakers based on clusters of speech features associated with lesions in specific regions of the central and peripheral nervous systems. These clusters are associated with at least five subtypes of dysarthria: flaccid, spastic, ataxic, hypokinetic, and hyperkinetic (see Figure 1). In many cases, patients exhibit a combination of the five subtypes (i.e., mixed dysarthria) (Darley et al., 1969). In addition to its clinical and neurological implications, the DAB model can serve as a basic heuristic in developing comprehensive and representative ASR corpora.

Figure 1. — Breakdown of dysarthria subtypes within a widely used taxonomy of speech motor disorders.

*Note:* ALS = amyotrophic lateral sclerosis; CP = cerebral palsy; AT = ataxia; HD = Huntington’s disease; MS = multiple sclerosis; MSA = multiple systems atrophy; PD = Parkinson’s disease; PSP = progressive supranuclear palsy; TD = tardive dyskinesia; ARTIC = articulation; PHON = phonation; PROS = prosody; RES = resonance; RESP = respiration.

One disadvantage of the taxonomy, however, is that it relies entirely on subjective observations, which requires expert clinical training and may be too coarse and unreliable for capturing the range of diversity in dysarthria (Kent, 1996). To address this limitation, researchers have been exploring the diagnostic utility of a wide variety of speech analytic approaches for identifying variants of disordered speech (Rusz et al., 2018; Rowe et al., 2020) — an effort referred to as quantitative or digital phenotyping.

2.1.3. Diversity in speech subsystems impairment

Regardless of dysarthria subtype, disordered speech is the byproduct of impairments in neural control over one or more of the five speech subsystems (i.e., respiration, phonation, resonance, prosody, and articulation) (see Figure 1). Objective characterizations of dysarthria through quantitative and digital phenotyping have allowed for more precise measures of speech, which has further illuminated the diversity in subsystem functioning. Indeed, deficits in each subsystem can engender specific aberrant speech features, many of which can be detected in the acoustic signal. For example, respiratory deficits in ataxic dysarthria can lead to excessive loudness variations, quantified acoustically using amplitude modulation (Leong et al., 2014); similarly, phonatory deficits in flaccid dysarthria can lead to a breathy vocal quality, quantified acoustically using cepstral peak prominence (Heman-Ackah et al., 2002).

While phonatory, resonatory, respiratory, and prosodic deficits can significantly limit communicative capacity, articulatory subsystem impairments have the greatest impact on speech intelligibility (Lee et al., 2014; Sidtis et al., 2011). Given the strong association between intelligibility and ASR performance (Jacks et al., 2019; McHenry and LaConte, 2010; Tu et al., 2016), it is possible that (1) articulatory motor impairments may be a major contributor to degraded ASR performance and (2) representing the range of articulatory motor impairments seen in dysarthria may maximize ASR accuracy and generalizability.

Considering the potential value of articulatory features and the need for objective and reliable measures of speech function, our group conducted a scoping review of the dysarthria literature to summarize the variety of acoustic techniques used to characterize articulatory impairments in neurodegenerative diseases (Rowe et al., under review). Across the 89 articles that met our inclusion criteria, we identified 24 different articulatory impairment features. To summarize the findings, we stratified the acoustic features into five aspects of articulatory motor control: Coordination, Consistency, Speed, Precision, and Rate (Rowe and Green, 2019). The findings demonstrated variable manifestation of articulatory impairments (1) across diseases (e.g., speakers with ataxia [AT] exhibited greater impairments in features associated with Rate than did speakers with Parkinson’s disease [PD]) and (2) across articulatory components within each disease (e.g., speakers with Huntington’s disease [HD] demonstrated greater impairments in Consistency than in Rate) (see Figure 2) (Rowe et al., under review).

Figure 2. — Meta-analysis of the mean effect size (disease group compared to healthy controls) for all acoustic features within each articulatory component.

*Note:* ALS = amyotrophic lateral sclerosis; AT = ataxia; HD = Huntington’s disease; MS = multiple sclerosis; MSA = multiple systems atrophy; PD = Parkinson’s disease; PSP = progressive supranuclear palsy.

2.1.4. Within-speaker variability due to motor disease type, disease progression, fatigue, and medication use

Beyond severity, dysarthria type, and subsystem involvement, there is a significant amount of within speaker variability that should be considered in ASR corpora development. For example, some dysarthria types, such as ataxic dysarthria, exhibit inconsistent motor patterns of limb and speech muscles (Darley et al., 1969), which can result in significant variability even across repetitions of the same utterance. Furthermore, across all dysarthria types, changes in disease progression, fatigue, and medication use can lead to rapid and transitory fluctuations in speech. Progressive diseases, such as ALS or PD, can result in declines in speech performance over several months or even weeks. Moreover, patient populations who are prone to fatigue may show daily fluctuations in speech patterns (Abraham and Drory, 2012). Lastly, medication use can result in dramatic changes—both positive and negative—in speech output. For example, levodopa has been related to improvements in voice quality, pitch variation, and articulatory function in patients with PD (Wolfe et al., 1975), while antipsychotic medication has been related to excess word stress and increased timing deficits in patients with HD (Rusz et al., 2014). To mitigate the detrimental effects of within-speaker variability on ASR performance, training datasets may need to include multiple instances of the same utterances recorded at different timepoints in individuals experiencing frequent speech changes (due to motor disease type, disease progression, fatigue, and/or medication use).

2.2. What phonemic patterns are present is dysarthric speech and therefore might impact ASR performance?

The influence of dysarthria diversity on phonemes is complex and not fully understood, as phoneme production involves intricate interactions between speech subsystems. However, while modern e2e ASR systems often operate at the word or “word piece” level (Kochenderfer, 2015) and employ a strong language model, sound-level distortions may still have a substantial negative impact on the recognition accuracy. In these cases, it may be necessary to compensate for these acoustic distortions by adjusting the acoustic model or increasing their representation in the training data. Previous research has used methods such as phoneme confusion matrices to identify phonetic error patterns and create pronunciation models. For instance, Caballero-Morales and Trujillo-Romero (2014) examined substitution errors made by an adapted SI ASR system¹ for a speaker with severe dysarthria. They noted that phonemes /r/, /s/, /sh/, and /th/ and phonemes /k/, /m/, and /p/ were consistently substituted by /f/ and /t/, respectively. The authors suggested that an improved system could use these error patterns to estimate /k/, /m/, or /p/ from a recognized /t/ (Caballero-Morales and Trujillo-Romero, 2014). However, most of the dysarthria ASR literature is based on datasets that combine all subtypes of dysarthria and, therefore, do not specify which phonemes are misrecognized for each subtype. We propose that a heterogeneous corpus of disordered speech based on known error patterns may improve phoneme recognition. Below, we describe a subset of such error patterns in individuals with dysarthria. A more detailed and extensive list of these patterns can be found in Duffy (2013).

Few studies, to our knowledge, have examined the association between dysarthria subtype and ASR phonemic error patterns. Shor et al. (2019) examined the WER of an ASR model fine-tuned to speakers with ALS, a neurodegenerative disease characterized by mixed flaccid-spastic dysarthria. The study found that (1) /p/, /k/, /f/, and /zh/ were among five phonemes that accounted for the highest likelihood of deletion and (2) /m/ and /n/ accounted for 17% of substitution/insertion errors in the ASR response (Shor et al., 2019). The authors’ first finding is consistent with the muscular weakness and low muscle tone characteristic of flaccid dysarthria, which frequently leads to deficits in sounds that require a buildup of pressure (i.e., pressure consonants) (Darley et al., 1969). Furthermore, flaccidity may result in hypernasality due to air escape from the nasal cavity (i.e., velopharyngeal insufficiency). As a result, speakers often incorrectly insert nasal consonants, such as /m/ and /n/, during speech (Duffy, 2013). Shor et al. (2019) ‘s latter finding is consistent with this evidence and is presumably related to increased hypernasality in speakers with ALS (Shor et al., 2019). Increased severity in ALS speakers also affects phonetic features, including stop nasal (e.g., “no” for “toe”) and glottal-null (e.g., “high” for “eye”) contrasts (Kent et al., 1989). Additionally, abnormal lingual displacement and coupling in ALS has been associated with reduced vowel distinctiveness (Rong et al., 2021).

Spastic dysarthria is characterized by muscle stiffness and rigidity (Darley et al., 1969). Prior work on speakers with cerebral palsy (CP), who often exhibit pure spastic dysarthria, demonstrated that the predominant phonemic errors occurred on fricatives (Platt et al., 1980), suggesting that spasticity impairs oral constriction. A more recent study corroborated this finding by highlighting abnormalities in the fricative /s/ in speakers with CP (Chen and Stevens, 2001). Another etiology of spastic dysarthria—traumatic brain injury—can result in phonetic contrast errors between glottal-null (e.g., “hall”/”all”), voiced-voiceless (e.g., “bit”/”pit”), alveolar-palatal (e.g., “shy”/”sigh”), and nasal-stop (e.g., “meat”/”beat”) sounds (Roy et al., 2001).

Ataxic dysarthria is characterized by muscle weakness and incoordination (Darley et al., 1969). Seminal acoustic work has described the impact of voice onset time (VOT) disturbances on voicing contrasts (Ackermann and Hertrich, 1997). Later research revealed similar findings, demonstrating that VOT abnormalities in speakers with Friedreich’s ataxia (FA) resulted in voicing contrast errors (e.g., /d/ vs. /t/ or /s/ vs. /z/) (Blaney and Hewlet, 2007).

Hypokinetic dysarthria is characterized by reduced range and speed of movement (Darley et al., 1969). Acoustically, speakers with hypokinetic dysarthria secondary to PD tend to replace a stop gap with low-intensity noise due to incomplete plosive closure, a process known as spirantization, which often occurs on voiceless phonemes, such as /p/, /t/, or /k/ (Canter, 1965). Reduced range of motion characteristic of hypokinesia also leads to articulatory undershoot and is reflected in features such as reduced second formant (F2) slope and restricted vowel space (Kim et al., 2009), which can lead to vowel centralization (e.g., /uh/ for /i/).

Lastly, hyperkinetic dysarthria, which is characterized by excess movement, encompasses a diverse range of speech characteristics (Darley et al., 1969). The phonemic errors in speakers with hyperkinesia are often influenced by the associated movement disorder. For example, hyperkinesia associated with HD may lead to variability in VOT and incomplete closure of pressure consonants, which could result in voicing substitutions (e.g., /t/ for /d/) and manner substitutions (e.g., /z/ for /d/), respectively (Hertrich and Ackermann, 1994). However, hyperkinesia associated with tardive dyskinesia (TD), an antipsychotic medication side effect, may lead to excessive formant fluctuations and distorted vowels during sustained phonation (e.g., sustained /ah/) (Gerratt et al., 1984).

Overall, understanding the phonemic patterns specific to different dysarthria types can provide insight into which words or “word pieces” (e.g., those that include pressure consonants or nasal sounds) may need to be disproportionately represented in the training data.

3. Considerations for Improving ASR Corpora

3.1. What can be done to adequately represent the different sources of diversity in dysarthria ASR training corpora?

Deploying e2e machine learning models may preclude the need to understand the underlying pathophysiological phenomena given a sufficiently diverse training set. However, the current ineffectiveness of ASR approaches for dysarthric speech suggests that the limits of e2e models are presently defined by the lack of training sample diversity, which in this case is the wide variety and variability of dysarthric speech patterns both between and within speakers. Attempting to capture this diversity solely by adding more speakers is a costly endeavor that would likely be insufficient. The domain knowledge that we have discussed in this paper is likely to help optimize participant selection strategies in large cohort studies on speech disordered ASR.

We propose that developing diverse corpora may involve a principled method of creating datasets for highly heterogeneous data, which may best be achieved through a three-pronged approach: (1) clinical phenotyping (i.e., characterizations of speech based on perceptual features); (2) quantitative phenotyping (i.e., characterizations of speech based on objective features); and (3) data-driven clustering (unsupervised groupings of speakers). Clinical phenotyping will require domain experts, such as speech-language pathologists, to guide the inclusion criteria for ensuring adequate representation of atypical speech characteristics (e.g., speech severity, articulatory and phonatory deficits, etc.). Ultimately, with large datasets and validated quantitative measures of speech, data-driven clustering of dysarthric speech characteristics may become feasible. Upon the development of larger and more diverse datasets, quantifying heterogeneity with a diversity metric may be the next step toward ensuring that the training samples are sufficiently diverse. Such a metric will only be possible with a deeper understanding of the potential variables to consider in dysarthria and their impact on speech changes.

Of course, all these approaches will need to be supported by large-scale data collection efforts that will require partnerships with speech-language pathology clinics, private foundations, and medical institutions (Macdonald et al., 2021). This effort will be greatly facilitated by the development of secure but accessible electronic medical record systems and mHealth platforms (i.e., the use of mobile technologies that improve health outcomes), which will, in turn, aid in identifying and collecting speech recordings from individuals with diverse etiologies and speech impairments (Ramanarayanan et al., under review).

4. Conclusions

Increasing ASR accuracy for dysarthric speech will have significant implications for communication and quality of life. This paper outlined the sources of diversity inherent to speech motor disorders, their potential impact on ASR performance, and the importance of their representation in training sets. Representing a wide diversity of dysarthria subtypes in ASR corpora may be an important step for improving disordered speech ASR and is consistent with the call to action in the AI community to reduce bias in the training data by increasing diversity.

Funding

This work was supported by NIH-NIDCD under Grants K24DC016312, F31DC019556, F31DC019016, and F31DC020108.

Footnotes

Conflict of Interest

The fourth author is an employee of Google LLC. The other four authors do not have any commercial or financial relationships that could be construed as a potential conflict of interest.

^1.

Note that the adapted ASR system did not employ an e2e deep learning model. Thus, because the system did not possess a strong language model, phoneme confusion could be measured in a meaningful way.

References

Abraham A and Drory VE (2012). Fatigue in motor neuron diseases. Neuromuscul 22(3), S198–S202. [DOI] [PubMed] [Google Scholar]
Ackermann H and Hertrich I (1997). Voice onset time in ataxic dysarthria. Brain Lang 56(3), 321–333. [DOI] [PubMed] [Google Scholar]
Benzeghiba M, De Mori R, Deroo O, Dupont S, Erbes T, Jouvet D, Fissore L, Laface P, Mertins A, Ris C, Rose R, Tyagi V, and Wellekens C (2007). Automatic speech recognition and speech variability: A review. Speech Commun 49, 763–786. [Google Scholar]
Blaney B and Hewlett N (2007). Dysarthria and Friedreich’s ataxia: What can intelligibility assessment tell us? Int. J. Lang. Comm. Disord 42(1), 19–37. [DOI] [PubMed] [Google Scholar]
Blaney B and Wilson J (2000). Acoustic variability in dysarthria and computer speech recognition. Clin. Linguist. Phon 14(4), 307–327. [Google Scholar]
Caballero-Morales SO and Trujillo-Romero F (2014). Evolutionary approach for integration of multiple pronunciation patterns for enhancement of dysarthric speech recognition. Expert Syst. Appl 41, 841–852. [Google Scholar]
Canter GJ (1965). Speech characteristics of patients with Parkinson’s disease: Articulation, diadochokinesis, and overall speech adequacy. J. Speech Hear. Disord 30(3), 217–224. [DOI] [PubMed] [Google Scholar]
Chen H, and Stevens KN (2001). An acoustical study of the fricative /s/ in the speech of individuals with dysarthria. J. Speech, Lang. Hear. Res 44(6), 1300–1314. [DOI] [PubMed] [Google Scholar]
Darley FL, Aronson AE, and Brown JR (1969). Differential diagnostic patterns of dysarthria. J. Speech, Lang. Hear. Res 12, 246–269. [DOI] [PubMed] [Google Scholar]
Duffy JR (2013). Motor Speech Disorders: Substrates, Differential Diagnosis, and Management (3rd Edition). Saint Louis, Missouri: Elsevier Mosby. [Google Scholar]
Gerratt BR, Goetz CG, and Fisher HB (1984). Speech abnormalities in tardive dyskinesia. Arch. Neurol 41, 273–276. [DOI] [PubMed] [Google Scholar]
Green JR, MacDonald B, Jiang P-P, Cattiau J, Heywood R, Cave R, Seaver K, Ladewig M, Tobin J, Brenner M, Nelson PQ, and Tomanek K (2021). Automatic Speech Recognition of Disordered Speech: Personalized models outperforming human listeners on short phrases. Proc. INTERSPEECH, 1–5. [Google Scholar]
Gupta R, Chaspari T, Kim J, Kumar N, Bone D, and Narayanan S (2016). Pathological speech processing: State-of-the-art, current challenges, and future directions. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 6470–6474. [Google Scholar]
Hartelius L, Elmberg M, Holm R, Loovberg AS, and Nikolaidis S, 2008. Living with dysarthria: Evaluation of a self-report questionnaire. Folia Phoniatr. Logop 60(1), 11–19. [DOI] [PubMed] [Google Scholar]
Heman-Ackah YD, Michael DD, and Goding GS (2002). The relationship between cepstral peak prominence and selected parameters of dysphonia. J. Voice 16(1), 20–27. [DOI] [PubMed] [Google Scholar]
Hertrich I and Ackermann H (1994). Acoustic analysis of speech timing in Huntington’s disease. Brain Lang 47, 182–196. [DOI] [PubMed] [Google Scholar]
Jacks A, Haley KL, Bishop G, and Harmon TG (2019). Automated speech recognition in adult stroke survivors: Comparing human and computer transcriptions. Folia Phoniatr. Logop 71(5–6), 286–296. [DOI] [PubMed] [Google Scholar]
Kent RD (1996). Hearing and believing: Some limits to the auditory perceptual assessment of speech and voice disorders. Am. J. Speech Lang. Pathol 5, 7–23. [Google Scholar]
Kent RD, Weismer G, Kent JF, and Rosenberg JC (1989). Toward phonetic intelligibility testing in dysarthria. J. Speech Hear. Disord 54(4), 482–499. [DOI] [PubMed] [Google Scholar]
Keshet J (2018). Automatic speech recognition: A primer for speech language pathology researchers. Int. J. Speech-Lang. Pathol 20(6), 599–609. [DOI] [PubMed] [Google Scholar]
Kim MJ, Yoo J, and Kim H (2013). Dysarthric speech recognition using dysarthria-severity-dependent and speaker-adaptive models. Proc. INTERSPEECH, 3622–3626. [Google Scholar]
Kim Y, Weismer G, Kent RD, and Duffy JR (2009). Statistical models of F2 slope in relation to severity of dysarthria. Folia Phoniatr. Logop 61(6), 329–335. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kochenderfer MJ (2015). Decision Making Under Uncertainty Cambridge, Massachusetts: The MIT Press. [Google Scholar]
Lee J, Hustad KC, and Weismer G (2014). Predicting speech intelligibility with a multiple speech subsystems approach in children with cerebral palsy. J. Speech, Lang. Hear. Res 57, 1666–1678. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leong V, Stone M, Turner R, and Goswami U (2014). A role for amplitude modulation phase relationships in speech rhythm perception. J. Acoust. Soc. Am 136(1), 366–381. [DOI] [PubMed] [Google Scholar]
Macdonald RL, et al. (2021). Disordered speech data collection: Lessons learned at 1 million utterances from Project Euphonia. Proc. INTERSPEECH, 1–5. [Google Scholar]
McHenry MA and LaConte SM (2010). Computer speech recognition as an objective measure of intelligibility. J. Med. Speech. Lang. Pathol 18(4), 99–103. [Google Scholar]
Mengistu KT and Rudzicz F (2011). Adapting acoustic and lexical models to dysarthric speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 4924–4927. [Google Scholar]
Moore M, Venkateswara H, and Panchanathan S (2018). Whistle blowing ASRs: Evaluating the need for more inclusive speech recognition systems. Proc. INTERSPEECH, 466–470. [Google Scholar]
Mustafa MB, Salim SS, Mohamed N, Al-Qatab B, and Siong CE (2014). Severity-based adaptation with limited data for ASR to aid dysarthric speakers. PLoS ONE 9(1), 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Platt LJ, Andrews G, and Howie PM (1980). Dysarthria of adult cerebral palsy: II. Phonemic analysis of articulation errors. J. Speech, Lang. Hear. Res 23, 51–55. [DOI] [PubMed] [Google Scholar]
Ramanarayanan V, Lammert AC, Rowe HP, Quatieri TF, and Green JR (under review). Speech as a biomarker: Opportunities, interpretability, and challenges
Rong R, Yunusova Y, Wang J, and Green JR (2015). Predicting early bulbar decline in amyotrophic lateral sclerosis: A speech subsystem approach. Behav. Neurol, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rong P, Ulcer E, Rowe LM, Allison K, Woo J, El Fakhri G, and Green JR (2021). Speech intelligibility loss due to amyotrophic lateral sclerosis: The effect of tongue movement reduction on vowel and consonant acoustic features. Clin. Linguist. Phon 11, 1–22. [DOI] [PubMed] [Google Scholar]
Rowe HP and Green JR (2019). Profiling speech motor impairments in persons with amyotrophic lateral sclerosis: An acoustic-based approach. Proc. INTERSPEECH, 4509–4513. [Google Scholar]
Rowe HP, Gutz SE, Maffei M, and Green JR (2020). Acoustic based articulatory phenotypes of amyotrophic lateral sclerosis and Parkinson’s disease: Towards an interpretable, hypothesis-driven framework of motor control. Proc. INTERSPEECH, 4816–4820. [Google Scholar]
Rowe HP, Shellikeri S, Yunusova Y, Chenausky K, and Green JR (under review). Quantifying articulatory impairments in neurodegenerative motor diseases: A scoping review and meta-analysis of hypothesis-driven acoustic features [DOI] [PMC free article] [PubMed]
Roy, Leeper HA, Blomgren M, and Cameron RM (2001). A description of phonetic, acoustic, and physiological changes associated with improved intelligibility in a speaker with spastic dysarthria. Am. J. Speech Lang. Pathol 10, 274–288. [Google Scholar]
Rusz J, Klempir Jiri, Tykalova T, Barborova E, Cmejla R, Ruzicka E, and Roth J (2014). Characteristics and occurrence of speech impairment in Huntington’s disease: Possible influence of antipsychotic medication. J. Neural Transm 121, 1529–1539, 2014. [DOI] [PubMed] [Google Scholar]
Rusz J, Benova B, Ruzickova H, Novotny M, Tykalova T, Hlavnicka J, Uher T, Vaneckova M, Andelova M, Novotna K, Kadrozkova L, and Horakova D (2018). Characteristics of motor speech phenotypes in multiple sclerosis. Mult. Scler. Relat 19, 62–69. [DOI] [PubMed] [Google Scholar]
Shor J, Emanuel D, Lang O, Tuval O, Brenner M, Cattiau J, Vieira F, McNally M, Charbonneau T, Nollstadt M, Hassidim A, and Matias Y (2019). Personalizing ASR for dysarthric and accented speech with limited data. arXiv:1907.13511, 1–5. [Google Scholar]
Sidtis JJ, Ahn JS, Gomez C, and Sidtis D (2011). Speech characteristics associated with three phenotypes of ataxia. J. Commun. Disord 44(4), 478–492. [DOI] [PMC free article] [PubMed] [Google Scholar]
Takashima R, Takiguchi T, and Ariki Y (2020). Two-step acoustic model adaptation for dysarthric speech recognition. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 6104–6108. [Google Scholar]
Tu M, Wisler A, Berisha V, and Liss JM (2016). The relationship between perceptual disturbances in dysarthric speech and automatic speech recognition performance. J. Acoust. Soc. Am 140(5), 416–422. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yorkston KM, Beukelman DR, Hakel M, and Dorsey M (2007). Speech intelligibility test for windows [Measurement instrument]. Nebraska: Institute for Rehabilitation Science and Engineering at Madonna Rehabilitation Hospital [Google Scholar]
Wolfe VI, Garvin JS, Bacon M, and Waldrop W (1975). Speech changes in Parkinson’s disease during treatment with L-dopa. J. Commun. Disord 8, 271–279. [DOI] [PubMed] [Google Scholar]
Xiong F, Barker J, and Christensen H (2019). Phonetic analysis of dysarthric speech tempo and applications to robust personalized dysarthric speech recognition. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 5836–5840. [Google Scholar]

[R1] Abraham A and Drory VE (2012). Fatigue in motor neuron diseases. Neuromuscul 22(3), S198–S202. [DOI] [PubMed] [Google Scholar]

[R2] Ackermann H and Hertrich I (1997). Voice onset time in ataxic dysarthria. Brain Lang 56(3), 321–333. [DOI] [PubMed] [Google Scholar]

[R3] Benzeghiba M, De Mori R, Deroo O, Dupont S, Erbes T, Jouvet D, Fissore L, Laface P, Mertins A, Ris C, Rose R, Tyagi V, and Wellekens C (2007). Automatic speech recognition and speech variability: A review. Speech Commun 49, 763–786. [Google Scholar]

[R4] Blaney B and Hewlett N (2007). Dysarthria and Friedreich’s ataxia: What can intelligibility assessment tell us? Int. J. Lang. Comm. Disord 42(1), 19–37. [DOI] [PubMed] [Google Scholar]

[R5] Blaney B and Wilson J (2000). Acoustic variability in dysarthria and computer speech recognition. Clin. Linguist. Phon 14(4), 307–327. [Google Scholar]

[R6] Caballero-Morales SO and Trujillo-Romero F (2014). Evolutionary approach for integration of multiple pronunciation patterns for enhancement of dysarthric speech recognition. Expert Syst. Appl 41, 841–852. [Google Scholar]

[R7] Canter GJ (1965). Speech characteristics of patients with Parkinson’s disease: Articulation, diadochokinesis, and overall speech adequacy. J. Speech Hear. Disord 30(3), 217–224. [DOI] [PubMed] [Google Scholar]

[R8] Chen H, and Stevens KN (2001). An acoustical study of the fricative /s/ in the speech of individuals with dysarthria. J. Speech, Lang. Hear. Res 44(6), 1300–1314. [DOI] [PubMed] [Google Scholar]

[R9] Darley FL, Aronson AE, and Brown JR (1969). Differential diagnostic patterns of dysarthria. J. Speech, Lang. Hear. Res 12, 246–269. [DOI] [PubMed] [Google Scholar]

[R10] Duffy JR (2013). Motor Speech Disorders: Substrates, Differential Diagnosis, and Management (3rd Edition). Saint Louis, Missouri: Elsevier Mosby. [Google Scholar]

[R11] Gerratt BR, Goetz CG, and Fisher HB (1984). Speech abnormalities in tardive dyskinesia. Arch. Neurol 41, 273–276. [DOI] [PubMed] [Google Scholar]

[R12] Green JR, MacDonald B, Jiang P-P, Cattiau J, Heywood R, Cave R, Seaver K, Ladewig M, Tobin J, Brenner M, Nelson PQ, and Tomanek K (2021). Automatic Speech Recognition of Disordered Speech: Personalized models outperforming human listeners on short phrases. Proc. INTERSPEECH, 1–5. [Google Scholar]

[R13] Gupta R, Chaspari T, Kim J, Kumar N, Bone D, and Narayanan S (2016). Pathological speech processing: State-of-the-art, current challenges, and future directions. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 6470–6474. [Google Scholar]

[R14] Hartelius L, Elmberg M, Holm R, Loovberg AS, and Nikolaidis S, 2008. Living with dysarthria: Evaluation of a self-report questionnaire. Folia Phoniatr. Logop 60(1), 11–19. [DOI] [PubMed] [Google Scholar]

[R15] Heman-Ackah YD, Michael DD, and Goding GS (2002). The relationship between cepstral peak prominence and selected parameters of dysphonia. J. Voice 16(1), 20–27. [DOI] [PubMed] [Google Scholar]

[R16] Hertrich I and Ackermann H (1994). Acoustic analysis of speech timing in Huntington’s disease. Brain Lang 47, 182–196. [DOI] [PubMed] [Google Scholar]

[R17] Jacks A, Haley KL, Bishop G, and Harmon TG (2019). Automated speech recognition in adult stroke survivors: Comparing human and computer transcriptions. Folia Phoniatr. Logop 71(5–6), 286–296. [DOI] [PubMed] [Google Scholar]

[R18] Kent RD (1996). Hearing and believing: Some limits to the auditory perceptual assessment of speech and voice disorders. Am. J. Speech Lang. Pathol 5, 7–23. [Google Scholar]

[R19] Kent RD, Weismer G, Kent JF, and Rosenberg JC (1989). Toward phonetic intelligibility testing in dysarthria. J. Speech Hear. Disord 54(4), 482–499. [DOI] [PubMed] [Google Scholar]

[R20] Keshet J (2018). Automatic speech recognition: A primer for speech language pathology researchers. Int. J. Speech-Lang. Pathol 20(6), 599–609. [DOI] [PubMed] [Google Scholar]

[R21] Kim MJ, Yoo J, and Kim H (2013). Dysarthric speech recognition using dysarthria-severity-dependent and speaker-adaptive models. Proc. INTERSPEECH, 3622–3626. [Google Scholar]

[R22] Kim Y, Weismer G, Kent RD, and Duffy JR (2009). Statistical models of F2 slope in relation to severity of dysarthria. Folia Phoniatr. Logop 61(6), 329–335. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Kochenderfer MJ (2015). Decision Making Under Uncertainty Cambridge, Massachusetts: The MIT Press. [Google Scholar]

[R24] Lee J, Hustad KC, and Weismer G (2014). Predicting speech intelligibility with a multiple speech subsystems approach in children with cerebral palsy. J. Speech, Lang. Hear. Res 57, 1666–1678. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Leong V, Stone M, Turner R, and Goswami U (2014). A role for amplitude modulation phase relationships in speech rhythm perception. J. Acoust. Soc. Am 136(1), 366–381. [DOI] [PubMed] [Google Scholar]

[R26] Macdonald RL, et al. (2021). Disordered speech data collection: Lessons learned at 1 million utterances from Project Euphonia. Proc. INTERSPEECH, 1–5. [Google Scholar]

[R27] McHenry MA and LaConte SM (2010). Computer speech recognition as an objective measure of intelligibility. J. Med. Speech. Lang. Pathol 18(4), 99–103. [Google Scholar]

[R28] Mengistu KT and Rudzicz F (2011). Adapting acoustic and lexical models to dysarthric speech. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 4924–4927. [Google Scholar]

[R29] Moore M, Venkateswara H, and Panchanathan S (2018). Whistle blowing ASRs: Evaluating the need for more inclusive speech recognition systems. Proc. INTERSPEECH, 466–470. [Google Scholar]

[R30] Mustafa MB, Salim SS, Mohamed N, Al-Qatab B, and Siong CE (2014). Severity-based adaptation with limited data for ASR to aid dysarthric speakers. PLoS ONE 9(1), 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Platt LJ, Andrews G, and Howie PM (1980). Dysarthria of adult cerebral palsy: II. Phonemic analysis of articulation errors. J. Speech, Lang. Hear. Res 23, 51–55. [DOI] [PubMed] [Google Scholar]

[R32] Ramanarayanan V, Lammert AC, Rowe HP, Quatieri TF, and Green JR (under review). Speech as a biomarker: Opportunities, interpretability, and challenges

[R33] Rong R, Yunusova Y, Wang J, and Green JR (2015). Predicting early bulbar decline in amyotrophic lateral sclerosis: A speech subsystem approach. Behav. Neurol, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Rong P, Ulcer E, Rowe LM, Allison K, Woo J, El Fakhri G, and Green JR (2021). Speech intelligibility loss due to amyotrophic lateral sclerosis: The effect of tongue movement reduction on vowel and consonant acoustic features. Clin. Linguist. Phon 11, 1–22. [DOI] [PubMed] [Google Scholar]

[R35] Rowe HP and Green JR (2019). Profiling speech motor impairments in persons with amyotrophic lateral sclerosis: An acoustic-based approach. Proc. INTERSPEECH, 4509–4513. [Google Scholar]

[R36] Rowe HP, Gutz SE, Maffei M, and Green JR (2020). Acoustic based articulatory phenotypes of amyotrophic lateral sclerosis and Parkinson’s disease: Towards an interpretable, hypothesis-driven framework of motor control. Proc. INTERSPEECH, 4816–4820. [Google Scholar]

[R37] Rowe HP, Shellikeri S, Yunusova Y, Chenausky K, and Green JR (under review). Quantifying articulatory impairments in neurodegenerative motor diseases: A scoping review and meta-analysis of hypothesis-driven acoustic features [DOI] [PMC free article] [PubMed]

[R38] Roy, Leeper HA, Blomgren M, and Cameron RM (2001). A description of phonetic, acoustic, and physiological changes associated with improved intelligibility in a speaker with spastic dysarthria. Am. J. Speech Lang. Pathol 10, 274–288. [Google Scholar]

[R39] Rusz J, Klempir Jiri, Tykalova T, Barborova E, Cmejla R, Ruzicka E, and Roth J (2014). Characteristics and occurrence of speech impairment in Huntington’s disease: Possible influence of antipsychotic medication. J. Neural Transm 121, 1529–1539, 2014. [DOI] [PubMed] [Google Scholar]

[R40] Rusz J, Benova B, Ruzickova H, Novotny M, Tykalova T, Hlavnicka J, Uher T, Vaneckova M, Andelova M, Novotna K, Kadrozkova L, and Horakova D (2018). Characteristics of motor speech phenotypes in multiple sclerosis. Mult. Scler. Relat 19, 62–69. [DOI] [PubMed] [Google Scholar]

[R41] Shor J, Emanuel D, Lang O, Tuval O, Brenner M, Cattiau J, Vieira F, McNally M, Charbonneau T, Nollstadt M, Hassidim A, and Matias Y (2019). Personalizing ASR for dysarthric and accented speech with limited data. arXiv:1907.13511, 1–5. [Google Scholar]

[R42] Sidtis JJ, Ahn JS, Gomez C, and Sidtis D (2011). Speech characteristics associated with three phenotypes of ataxia. J. Commun. Disord 44(4), 478–492. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] Takashima R, Takiguchi T, and Ariki Y (2020). Two-step acoustic model adaptation for dysarthric speech recognition. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 6104–6108. [Google Scholar]

[R44] Tu M, Wisler A, Berisha V, and Liss JM (2016). The relationship between perceptual disturbances in dysarthric speech and automatic speech recognition performance. J. Acoust. Soc. Am 140(5), 416–422. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Yorkston KM, Beukelman DR, Hakel M, and Dorsey M (2007). Speech intelligibility test for windows [Measurement instrument]. Nebraska: Institute for Rehabilitation Science and Engineering at Madonna Rehabilitation Hospital [Google Scholar]

[R46] Wolfe VI, Garvin JS, Bacon M, and Waldrop W (1975). Speech changes in Parkinson’s disease during treatment with L-dopa. J. Commun. Disord 8, 271–279. [DOI] [PubMed] [Google Scholar]

[R47] Xiong F, Barker J, and Christensen H (2019). Phonetic analysis of dysarthric speech tempo and applications to robust personalized dysarthric speech recognition. Proc. IEEE Int. Conf. Acoust. Speech Signal Process, 5836–5840. [Google Scholar]

PERMALINK

Characterizing Dysarthria Diversity for Automatic Speech Recognition: A Tutorial from the Clinical Perspective

Hannah P Rowe

Sarah E Gutz

Marc F Maffei

Katrin Tomanek

Jordan R Green

Abstract

1. Introduction

2. Characterizing Dysarthria Diversity

2.1. What types of diversity need to be represented in dysarthria ASR training corpora?

2.1.1. Diversity in speech severity

2.1.2. Diversity in dysarthria type

Figure 1.

2.1.3. Diversity in speech subsystems impairment

Figure 2.

2.1.4. Within-speaker variability due to motor disease type, disease progression, fatigue, and medication use

2.2. What phonemic patterns are present is dysarthric speech and therefore might impact ASR performance?

3. Considerations for Improving ASR Corpora

3.1. What can be done to adequately represent the different sources of diversity in dysarthria ASR training corpora?

4. Conclusions

Funding

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Characterizing Dysarthria Diversity for Automatic Speech Recognition: A Tutorial from the Clinical Perspective

Hannah P Rowe

Sarah E Gutz

Marc F Maffei

Katrin Tomanek

Jordan R Green

Abstract

1. Introduction

2. Characterizing Dysarthria Diversity

2.1. What types of diversity need to be represented in dysarthria ASR training corpora?

2.1.1. Diversity in speech severity

2.1.2. Diversity in dysarthria type

Figure 1.

2.1.3. Diversity in speech subsystems impairment

Figure 2.

2.1.4. Within-speaker variability due to motor disease type, disease progression, fatigue, and medication use

2.2. What phonemic patterns are present is dysarthric speech and therefore might impact ASR performance?

3. Considerations for Improving ASR Corpora

3.1. What can be done to adequately represent the different sources of diversity in dysarthria ASR training corpora?

4. Conclusions

Funding

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases