Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 17.
Published in final edited form as: Annu Int Conf IEEE Eng Med Biol Soc. 2010;2010:5201–5204. doi: 10.1109/IEMBS.2010.5626104

Predicting Severity of Parkinson’s Disease from Speech

Meysam Asgari 1, Izhak Shafran 1
PMCID: PMC7889280  NIHMSID: NIHMS1670849  PMID: 21095825

Abstract

Parkinson’s disease is known to cause mild to profound communication impairments depending on the stage of progression of the disease. There is a growing interest in home-based assessment tools for measuring severity of Parkinson’s disease and speech is an appealing source of evidence. This paper reports tasks to elicit a versatile sample of voice production, algorithms to extract useful information from speech and models to predict the severity of the disease. Apart from standard features from time domain (e.g., energy, speaking rate), spectral domain (e.g., pitch, spectral entropy) and cepstral domain (e.g, mel-frequency warped cepstral coefficients), we also estimate harmonic-to-noise ratio, shimmer and jitter using our recently developed algorithms. In a preliminary study, we evaluate the proposed paradigm on data collected through 2 clinics from 82 subjects in 116 assessment sessions. Our results show that the information extracted from speech, elicited through 3 tasks, can predict the severity of the disease to within a mean absolute error of 5.7 with respect to the clinical assessment using the Unified Parkinson’s Disease Rating Scale; the range of target motor sub-scale is 0 to 108. Our analysis shows that elicitation of speech through less constrained task provides useful information not captured in widely employed phonation task. While still preliminary, our results demonstrate that the proposed computational approach has promising realworld applications such as in home-based assessment or in telemonitoring of Parkinson’s disease.

I. Introduction

Speech is a complex task which requires the sequential and parallel control of multiple systems and mechanisms in a highly refined and specific manner. Respiration or breathing drives air through the constriction in larynx to create a sound source for speech. The resulting vocal note is modified by velo-pharyngeal and oral-labial articulators to produce what we hear as speech sounds. All these elements work as a finely balanced and integrated system. Parkinson’s disease (PD), a degenerative disorder of the central nervous system, is characterized by muscle rigidity, tremor, a loss and slowing of physical movement. Not surprisingly PD can affect all of the components of speech production - breathing, laryngeal function, articulator movement - as well as their coordination for smooth and fluent speech. Resulting dysarthric speech often exhibits monotonous pitch, slurring, reduced stress, inappropriate pauses, variable speech rate, short rushes of speech, harsh voice, imprecise consonant production and breathy voice, among others [1].

Currently, the severity of Parkinson’s disease is assessed clinically using a widely accepted metric, the Unified Parkinson’s Disease Rating Scale (UPDRS), which ranges from 0 to 176, with 0 corresponding to a healthy state and 176 to a severe affliction [2]. The assessment is time-consuming and is performed by trained medical personnel, which becomes burdensome when, for example, frequent re-assessment is required to fine-tune dosage of Levodopa for controlling the progression of the disease or to monitor effectiveness of other forms of interventions. Not surprisingly, there has been a growing interest in creating tools and methods for alternative home-based assessments of this disease. Easier methods of assessment could potentially be an important screening tool for a wider population as PD is the second most common neurodegenerative disease in US after Alzheimer’s disease. Since speech can be easily collected remotely across large distance using mundane hand-held devices, it is an appealing source of evidence for telemonitoring PD.

Most clinical ratings of speech pathologies such as hypokinetic dysarthria in PD are based on perceptions of trained clinicians. Clinicians typically use 20 acoustic dimensions for differential diagnosis of dysarthria [3]. One approach for automating assessment is to quantify these 20 dimensions using speech processing. For example, Guerra and Lovey estimated linear regression using quantities extracted from speech to match many dimensions of clinical assessment and then combined them into one model for classifying dysarthria [4]. While such an approach allows comparison of the acoustic dimensions between the clinical and automated method, any non-linear interaction between the acoustic measurements is missed and hence the predictive model is suboptimal. Moreover, only 12 of these 20 dimensions allow a relatively straightforward mapping. Instead, we adopt a machine learning approach, where we define a large number of features that can be reliably extracted from speech and let the learning algorithm pick out the features that are most useful in predicting the clinical rating.

II. Experimental Paradigm

Corpus:

Empirical evaluation reported in this paper were performed on data collected from 82 subjects, including 21 controls, in 116 assessment sessions. The data was collected through 2 clinics, namely OHSU and Parkinson’s Institute using a portable device, designed specifically for the task and described further in [5]. Subjects were asked to perform several tasks designed to exercise different aspects of speech and non-speech motor control. The tasks were administered on a portable computer under the supervision of a clinician who was familiar with the computerized tests. As a clinical reference, the severity of subjects’ condition were measured by clinicians using the Unified Parkinson’s Disease Rating Scale (UPDRS), the current gold standard. In this study, we focus on the motor sub-scale of the UPDRS (mUPDRS), which spans from 0 for healthy individual to 108 for extreme disability. The mean mUPDRS in the corpus is about 17.51, while the standard deviation is about 11.29 and the mean absolute error (MAE) with respect to the mean is about 9.0.

Elicitation Tasks:

Speech was elicited from subject’s under 3 different conditions to obtain evidence of hypokinetic dysarthria.

  • Sustained phonation: Subjects were instructed to phonate the vowel /a/ in a clear and steady voice as long as possible. Voiced fundamental frequency or pitch as well as measures sequential cycle-to-cycle frequency (jitter) and amplitude (shimmer) variations are a peculiarly powerful parameter in assessing the functional and anatomical status of the larynx.

  • Diadochokinetic (DDK) task: DDK task is often used as a clinical test to assess the functional capacities of the articulatory system. It involves production of syllable sequences containing consonant-vowel combinations with bilabial, alveolar, and velar places of articulation. Specifically, the subjects were asked to repeat the sequence of syllables /pa/, /ta/ and /ka/ continuously for about 10 seconds as fast and as clearly as they possibly could, again in two separate recordings. The task allows examination of articulatory precision, as well as the ability to rapidly alternate articulators between segments. Accelerated or decelerated speech tempos, important cues for PD, has been observed in this task [6],

  • Reading task: Subjects were asked to read three passages that are often employed in speech pathology and are referred to as “The Rainbow Passage”, “The North Wind and The Sun”, and “The Grandfather Passage”. Reading task imposes an additional cognitive processes during speech production and allows measurement of vocal intensity, voice quality, and speaking rate include number of pauses, length of pauses, length of phrases, duration of spoken syllables, voice onset time (VOT), sentence duration and others.

III. Speech Processing

Criteria used by clinicians to rate hypokinetic dysarthria are often difficult to quantify. As mentioned earlier, we sidestep the difficult task of quantifying perceptual cues, and instead focus on extracting a large number of surface features. Information that can be extracted from the speech signal can be grouped into the frequency domain (e.g., pitch and spectral properties), the temporal domain (e.g., energy and duration), and the cepstral domain (e.g., spectral envelop). Note, the focus of this work is to utilize surface features and as such we do not utilize information related to phone- or word-identities, since that requires a speech recognizer optimized for this task.

As in most speech processing systems, we extract 25 millisecond long frames using a Hanning window at a rate of 100 frames per second before computing the following features.

  1. Pitch: One of the key features in frequency domain is pitch, which can be extracted using a standard pitch tracking algorithm. Briefly, the algorithm consists of two stages - a pitch detection algorithm that selects and scores local pitch candidate and a post-processing algorithm such as Viterbi that removes unlikely candidates using smoothness constraints. The local pitch candidates are typically selected using a normalized cross-correlation function (NCCF). The frames are classified into voiced and unvoiced types using NCCF. Jitter measures the cycle-to-cycle frequency and is measured using first-order and second-order time derivatives of pitch.

  2. Spectral Entropy: Properties of the spectrum serve as a useful proxy for cues related to voicing and quality. Spectral entropy can be used to characterize “speechiness” of the signal and has been widely employed to discriminate speech from noise. As such, we compute the entropy of the log power spectrum for each frame, where the log domain was chosen to mirror perception.

  3. Cepstral Coefficients: Shape of the spectral envelop is extracted from cepstral coefficients. Thirteen cepstral coefficients of each frame was augmented with their first- and second-order time derivatives.

  4. Segmental Duration and Frequency: In the time-domain, apart from the energy at each frame, we compute the number and duration of voiced and unvoiced segments, which provides useful cues about speaking rate.

  5. Harmonicity: Laryngologists often rate the degree of hoarseness (or harshness) to assess the functioning of the larynx. Spectrograms and perceptual studies reveal that this perceived abnormality of the voice is related to loss of harmonic components [7]. As the degree of perceived hoarseness increases, more noise appears to replace the harmonic structure and as a result harmonic to noise ratio (HNR) decreases. Irregular vocal fold vibration causes random modulation of the source signal and affects the amplitude (shimmer) distribution of harmonics throughout the spectrum and its time period (jitter). In addition, the ratio of energy between first and second harmonics (H1/H2) for each voiced frame has been found to be useful for characterizing breathy voice resulting from incomplete closure of vocal folds.

    Using a harmonic model of voiced signal, we recently developed a regularized maximum likelihood estimation algorithm for computing the four quantities related to harmonic content [8]. Exploiting its harmonic structure, we estimate a sparse model of voiced speech which is then used to isolate the harmonic content in the recorded speech. The model allows the amplitude of the harmonics to vary smoothly over the duration of the frame and thus it is able to follow perturbations associated with shimmer and jitter. Apart from allowing us to compute HNR and H1/H2, by isolating the voiced speech, we simplify the estimation of shimmer and jitter.

The features computed at the frame-level needs to be summarized into a global feature vector of fixed dimension for each subject before we can apply models for predicting clinical ratings. Features extracted from voiced regions tend to differ in nature compared to those from unvoiced regions. These differences were preserved and features were summarized in voiced and unvoiced regions separately. Each feature was summarized across all frames from the voiced (unvoiced) segments in terms of standard distribution statistics such as mean, median, variance, minimum and maximum. Speech pathologists often plot and examine the interaction between quantities such as pitch and energy to fully understand the capacity of speech production [9]. We capture such interactions by computing the covariance matrix (upper triangular elements) of frame-level feature vectors over voiced (unvoiced) segments. The segment-level duration statistics including mean, median, variance, minimum and maximum were computed for both voiced and unvoiced regions. The summary features were concatenated into a global feature vector for each subject. There has been suggestion that many speech features are better represented in log domain. So, we performed experiments by augmenting the global feature vector with its mirror in log domain. The resulting features were computed separately for the three elicitation tasks (phonation, DDK and reading) and augmented into one vector, up to 17K long, for each subject.

IV. Regression Model

The clinical rating of severity of Parkinson’s disease as measured by the motor sub-scale of UPDRS (mUPDRS) was predicted from extracted speech features using several regression models estimated by support vector machines. Epsilon-SVR and nu-SVR were employed using several kernel functions including polynomial, radial basis function and sigmoid kernels [10], [11]. The models were evaluated using a 20-fold cross-validation and the results were measured using mean absolute error (MAE). Not all the features extracted from speech are expected to be useful and in fact many are likely to be noisy. We apply standard feature selection algorithm and evaluate several models using cross-validation to pick the one with optimal performance. One weakness of most feature selection algorithm is that compute the utility of each element separately and not over subsets. For understanding the contribution of the different features, we introduced them incrementally and their performance is reported in Table I. The first regression model was estimated with frequency-domain, temporal-domain and cepstral-domain features described in (1–3) under Section III. Subsequently, log space features, segmental durations and harmonicity were introduced.

TABLE I.

Mean absolute error (MAE) measured on a 20-fold cross-validation for predicting severity of Parkinson’s disease (mUPDRS) from speech.

Speech Features # Features MAE
(a) Baseline 7K 6.14
(b) (a) + log-space 14K 6.06
(c) (b) + duration 14K 5.85
(d) (c) + HNR + H1/H2 15K 5.81
(e) (d) + jitter + shimmer 15K 5.66

The baseline system contains features related to pitch, spectral entropy and cepstral coefficients, in all about 7K features per subject. From among these features, automatic feature selection picks about 800 features to predict the UPDRS scores with an MAE of about 6.14 and a standard deviation of about 2.63. Recall that guessing the mean UPDRS score on this data incurs an MAE of about 9.0. As a check for overfitting, we shuffled the labels, selected features and then learned the regression using the same algorithms. The resulting models performed significantly worse, at about 8.5 MAE. To put the reported results in the right perspective, studies show that the clinicians do not agree with each other completely and attain a correlation of about 0.82 and commit an error of about 2 points. The improvement in prediction with the baseline model is statistically significant. The mapping of features in the log-space provides a small and consistent gain, but not as large as the ones reported in [12] whose experimental setup (utterance-level test vs. train split, not subject-level), number of subjects (only 42), features and models are significantly different from ours. The frequency and duration of voiced segments proved to be useful cues in predicting mUPDRS as expected from clinical observations [6]. Finally, the HNR and the ratio of energy in first to second harmonic estimated using the algorithm proposed in this paper provides further improvement in predicting mUPDRS. The gains from harmonicity are consistent with previous studies on classification of dysarthria [4]. Among all combination of features listed in the table, the size of the optimal feature set was about 550 features for model (e). The best performance was consistently obtained with epsilon SVR using 3rd degree polynomial kernel functions.

In order to gauge the relative importance of the three elicitation tasks and the extracted features, we selected the most useful features using the elastic net which provides a compromise between the ridge-regression penalty and the lasso penalty [13]. Among the top 43 features that performed distinctly better than the rest, the number of features from different sources of information are summarized in Figure 1. The number of features selected from the three elicitation tasks, namely, the phonation task, the DDK task and the reading task were found to be 4, 18 and 21 respectively. Surprisingly, a majority of the features are from the reading task, a task that was omitted in many previous studies (e.g., [12], [4]). As expected, the features extracted from the voiced segments (34 vs. 8) are more useful than the those from the unvoiced segments. Duration computed in the reading task is useful, presumably because systematic variations are better captured in the less constrained task. The features from the log space are slightly more numerous than others (23 vs. 18), and when they are chosen they are almost always from the covariance features. Interestingly, the number of covariance features, which capture the interaction between, for example, pitch and loudness, are disproportionately higher than singleton features (35 vs. 8).

Fig. 1.

Fig. 1.

The type of features among the top 43 features that performed distinctly better than the rest

V. Summary

This paper describes an experimental paradigm for elicitation of speech, methods for extracting useful features from speech, models for predicting severity of Parkinson’s disease as reflected in motor sub-scale of UPDRS and highlights useful features. Clinicians rely on perceptual qualities such as voice quality such as hoarseness for assessment. We formulate an algorithm to estimate harmonic to noise ratio, ratio of energy in first to second harmonic, and shimmer, which are closely related to hoarseness. Exploiting the harmonics in voiced signals, our proposed algorithm frames the task as a maximum likelihood estimation problem and its utility is demonstrated in predictive models. Apart from frequency-domain, time-domain and cepstral-domain features used in previous studies of speech pathology, features that capture their pairwise interactions are shown to be useful. Among all features, a disproportionately large number of features from reading task, a task often ignored, contribute towards predicting severity of Parkinson’s disease. Segmental duration or speaking rate related features are more meaningful when they are extracted from reading task, a task which is less constrained than the phonation and the DDK tasks. Finally, the epsilon support vector machines with polynomial kernel of degree 3 is found to be most effective, operating at about 5.7 mean absolute error as measured on a 20-fold cross-validation over 116 assessments.

Recall that this study examines only the surface features, cues related to phone- or word-identities could provide additional cues that may provide useful information. Similarly, information from unvoiced segments have not been fully exploited so far. Similarly, the cues from formant trajectory could also be useful in quantifying the versatility of speech production, especially function of muscles involved in shaping the oral cavity. In this preliminary study, we have not screened subjects for anatomical lesions on vocal folds which could potentially mislead the speech analysis. Finally, the observed residual errors may also be due to problems with reliability of mUPDRS (e.g. as reported in studies on 400 PD subjects [14]), and in the lack of adequate attention to assessment of speech disorder in UPDRS.

VI. Acknowledgment

This work was supported by Kinetics Foundation and NSF Award IIS 0905095. We would like to thank L. Holmstrom, K. Kubota and J. McNames for facilitating the study, designing the data collection and making the data available to us. We are extremely grateful to our clinical collaborators M. Aminoff, C. Cristine, J. Tetrud, G. Liang, F. Horak, S. Gunzler, and B. Marks for performing the clinical assessments and collecting the speech data from the subjects.

References

  • [1].Darley FL, et al. , “Differential diagnostic patterns of dysarthria,” J Speech Hear Res, vol. 12, no. 2, pp. 246–69, June 1969. [DOI] [PubMed] [Google Scholar]
  • [2].Movement Disorder Society Task Force on Rating Scales for Parkinson’s Disease, “The Unified Parkinson’s Disease Rating Scale (UPDRS): status and recommendations,” Mov Disord, vol. 18, no. 7, pp. 738–50, July 2003. [DOI] [PubMed] [Google Scholar]
  • [3].Darley FL, et al. , Motor Speech Disorders. Saunders Company, 1975. [Google Scholar]
  • [4].Guerra EC and Lovey DF, “A modern approach to dysarthria classification,” in Proceedings of the IEEE Conference on Engineering in Medicine and Biology Society (EMBS), 2003, pp. 2257–2260. [Google Scholar]
  • [5].Goetz CG, et al. , “Testing objective measures of motor impairment in early parkinson’s disease: Feasibility study of an at-home testing device,” Mov Disord, vol. 24, no. 4, pp. 551–6, March 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Skodda S, et al. , “Instability of syllable repetition as a model for impaired motor processing: is parkinson’s disease a ”rhythm disorder”?” J Neural Transm, March 2010. [DOI] [PubMed] [Google Scholar]
  • [7].Yumoto E, et al. , “Harmonics-to-noise ratio as an index of the degree of hoarseness,” J Acoust Soc Am, vol. 71, no. 6, pp. 1544–9, June 1982. [DOI] [PubMed] [Google Scholar]
  • [8].Asgari M and Shafran I, “Extracting cues from speech for predicting severity of parkinson’s disease,” in Machine Learning for Signal Processing, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Titze IR, Principles of Voice Production. Prentice Hall, 1994. [Google Scholar]
  • [10].Schӧlkopf B, et al. , “New support vector algorithms,” Neural Com- put, vol. 12, no. 5, pp. 1207–1245, 2000. [DOI] [PubMed] [Google Scholar]
  • [11].Schӧlkopf B, et al. , “Estimating the support of a high-dimensional distribution.” Neural Computation, vol. 13, no. 7, pp. 1443–1471, 2001. [DOI] [PubMed] [Google Scholar]
  • [12].Tsanas A, et al. , “Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson’s disease progression,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2010, pp. 594–597. [Google Scholar]
  • [13].Zou H and Hastie T, “Regularization and variable selection via the elastic net,” Journal Of The Royal Statistical Society Series B, vol. 67, no. 2, pp. 301–320, 2005. [Online]. Available: http://ideas.repec.org/a/bla/jorssb/v67y2005i2p301-320.html [Google Scholar]
  • [14].Siderowf A, et al. , “Test-retest reliability of the unified parkinson’s disease rating scale in patients with early parkinson’s disease: results from a multicenter clinical trial,” Mov Disord, vol. 17, no. 4, pp. 758–63, July 2002. [DOI] [PubMed] [Google Scholar]

RESOURCES