Skip to main content
Journal of Speech, Language, and Hearing Research : JSLHR logoLink to Journal of Speech, Language, and Hearing Research : JSLHR
. 2021 Mar 1;64(6 Suppl):2276–2286. doi: 10.1044/2020_JSLHR-20-00288

The Effects of Symptom Onset Location on Automatic Amyotrophic Lateral Sclerosis Detection Using the Correlation Structure of Articulatory Movements

Alan Wisler a, Kristin Teplansky a, Daragh Heitzman b, Jun Wang a,c,
PMCID: PMC8740667  PMID: 33647219

Abstract

Purpose

Kinematic measurements of speech have demonstrated some success in automatic detection of early symptoms of amyotrophic lateral sclerosis (ALS). In this study, we examined how the region of symptom onset (bulbar vs. spinal) affects the ability of data-driven models to detect ALS.

Method

We used a correlation structure of articulatory movements combined with a machine learning model (i.e., artificial neural network) to detect differences between people with ALS and healthy controls. The performance of this system was evaluated separately for participants with bulbar onset and spinal onset to examine how region of onset affects classification performance. We then performed a regression analysis to examine how different severity measures and region of onset affects model performance.

Results

The proposed model was significantly more accurate in classifying the bulbar-onset participants, achieving an area under the curve of 0.809 relative to the 0.674 achieved for spinal-onset participants. The regression analysis, however, found that differences in classifier performance across participants were better explained by their speech performance (intelligible speaking rate), and no significant differences were observed based on region of onset when intelligible speaking rate was accounted for.

Conclusions

Although we found a significant difference in the model's ability to detect ALS depending on the region of onset, this disparity can be primarily explained by observable differences in speech motor symptoms. Thus, when the severity of speech symptoms (e.g., intelligible speaking rate) was accounted for, symptom onset location did not affect the proposed computational model's ability to detect ALS.


Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease that inhibits the brain's ability to control muscle movements (Kiernan et al., 2011). The loss of motor control is initially localized in either the limbs or the bulbar region; however, as the disease progresses, the symptoms spread to involve muscle groups in multiple anatomical regions. In approximately 70% of ALS cases, patients are diagnosed with spinal onset and experience initial symptoms in the limbs, which impair common daily activities such as walking or writing. Initially, speech is often preserved for some unknown time. In approximately 25% of cases, patients are diagnosed with bulbar onset, where initial symptoms primarily affect tasks such as speaking and swallowing (Kiernan et al., 2011). In most cases, ALS results in death by respiratory paralysis within 5 years of disease onset. Currently, a universal biomarker or diagnostic test for identifying patients with ALS does not exist. Instead, diagnosis typically relies on clinical assessment, electromyography, and laboratory testing that exclude other diseases (Brown & Al-Chalabi, 2017). The ability to identify and treat symptoms early on in ALS is crucial to the success of therapeutic intervention, which can both extend and improve the patient's remaining quality of life (Kiernan et al., 2011).

As speech data can be collected inexpensively and noninvasively, it is an ideal source of information for generating diagnostic assessments. As a result, acoustic analyses have been used for a variety of neurological diseases. The efficacy of speech-based diagnostic models has been explored in a range of conditions, including Parkinson's disease (Benba et al., 2015; Orozco-Arroyave et al., 2016; Williamson et al., 2015), cognitive impairment (Garrard et al., 2014; Orimaye et al., 2017; Roark et al., 2011; Yu et al., 2014), and depression (Cummins et al., 2011; Sturim et al., 2011; Williamson et al., 2014). Generally speaking, these models collect some type of sensor data (acoustic, kinematic, etc.), apply signal processing to extract salient information from the raw signals, then use machine learning to map from the extracted features to clinically relevant information (e.g., whether or not a person has a given disease). One approach for extracting salient information from speech acoustic or kinematic data that has shown promise across a range of different pathologies is to analyze how different characteristic measures of speech (such as a localized acoustic feature or the position of a certain articulator) are correlated with one another across time. Prior studies have consistently observed reduced complexity in the structure of these correlations for individuals with communication disorders relative to healthy controls (Williamson et al., 2013, 2015; Yu et al., 2014). This field of research has grown rapidly in recent years and yielded important insights into the feasibility of using speech signal data in clinical practice.

Particularly for ALS, researchers have also used acoustic analysis to predict the progression of ALS and to detect symptoms of ALS at an earlier stage than current available measures (Bandini, Green, Taati, et al., 2018; Norel et al., 2018; Wang, Kothalkar, Cao, & Heitzman, 2016). A study by An et al. (2018) demonstrated the ability to detect subtle changes in articulatory behavior that occur in patients with ALS who are highly intelligible using convolutional neural networks. The authors achieved an accuracy of 80.9%. More recently, researchers have used video-based signals of speech and nonspeech facial movements to detect ALS from healthy controls with an accuracy of up to 88.9% (Bandini, Green, Taati, et al., 2018). While these results are promising, prior studies have not investigated the role of onset area on the performance of these methods. These studies either did not report the region of onset for their participants (Norel et al., 2018; Wang, Kothalkar, Cao, & Heitzman, 2016) or used a data set that includes both bulbar- and spinal-onset participants (Bandini, Green, Taati, et al., 2018). Shellikeri et al. (2016) reported the region of onset but focused their analysis on bulbar-onset participants only. Thus, the extent to which the site of disease onset influences the overall accuracy of current speech-based models remains unclear. To improve the capabilities of speech-based diagnostics for early detection, a better understanding of the effects of symptom onset location is needed.

Recent works suggest that articulatory measures are more sensitive than acoustic measures in capturing speech decline due to ALS (Green et al., 2013; Wang et al., 2018). Kinematic sensors can be used to track an individual's tongue and lip movements during speech production tasks and effectively capture changes in speech motor control. Movement parameters such as duration, speed, range, and stability measures are commonly used to provide a global understanding of articulatory behavior (Green et al., 2013; Kuruvilla-Dugdale & Chuquilin-Arista, 2017; Kuruvilla-Dugdale & Mefferd, 2017; Lee & Bell, 2018; Rong et al., 2018; Shellikeri et al., 2016; Yunusova et al., 2008). Based on prior research on ALS speech, at a certain point in disease progression, articulatory movements can be described as longer in duration and slower in speed than healthy controls (Shellikeri et al., 2016; Yunusova et al., 2008). Speakers with ALS may also use a smaller tongue movement range and more stable movement patterns (Mefferd et al., 2014; Shellikeri et al., 2016). Bandini, Green, Wang, et al. (2018) used fundamental movement parameters along with higher derivatives of position, such as velocity, acceleration, and jerk to classify subgroups (i.e., presymptomatic and symptomatic) of speakers with ALS using a support vector machine with a quadratic kernel. The authors reported 87% accuracy in distinguishing phases of bulbar decline. Existing research suggests that kinematic measures are sensitive to capturing modifications in articulatory behavior due to ALS; however, few studies have been able to capture changes in speakers with ALS who perceptually sound healthy (i.e., highly intelligible) from kinematic measures only (without using acoustic data). Furthermore, the participant groups used in these studies were composed either partially or entirely of individuals with bulbar onset. There is little evidence in the literature that kinematic measures will provide a sensitive marker of disease progression for individuals with spinal ALS.

In this article, we investigated if the correlation structure of articulatory movements derived from individuals with ALS during speech production can be used to predict their disease state. A machine learning model (artificial neural networks [ANNs]) was used as a classifier to discriminate between participants with ALS and healthy controls. We examined the ability of the proposed model to identify speech patterns unique to ALS and how performance differs depending on where the patient's first symptoms occur (i.e., bulbar or spinal region onset). We hypothesized that the proposed model will be more accurate in classifying bulbar-onset participants than spinal-onset participants as bulbar-onset participants generally exhibit speech symptoms earlier. In addition, we conducted a regression analysis aimed at predicting the model's classification accuracy for the different participants based on different clinical severity measures, including each participant's intelligible speaking rate (ISR) and their Amyotrophic Lateral Sclerosis Functional Rating Scale–Revised (ALSFRS-R) score (Cedarbaum et al., 1999). The goals of this analysis were to determine which clinical measures are most predictive of classifier performance and to examine whether differences in performance across the bulbar- and spinal-onset groups can be explained by clinical measures of disease severity. Our expectation was that differences in classifier performance across the two groups (bulbar vs. spinal onset) could be explained by clinical severity measures, and as a result, region of onset would not be a significant factor in determining the classification performance in the multivariate model.

Method

Data

This study included 41 participants diagnosed with ALS (23 men, 18 women) and 24 healthy controls. All participants signed a consent form prior to data collection. This study has been approved by the institutional review boards of the University of Texas at Austin (UT IRB 2019-12-0006) and the University of Texas at Dallas (the last author's previous institution; UTD IRB14-03). Certified neurologists confirmed the diagnosis of ALS (31 were diagnosed with spinal onset and 10 were diagnosed with bulbar onset). The average age at the first visit was 58.5 years. All participants were native speakers of English and exhibited normal-hearing capabilities. A hearing screening was performed at the first and last data collection session. Each participant was screened at 25 dB HL at 1000, 2000, and 4000 Hz. The participant was scored as a pass if they responded to all frequencies in one or both ears. The data collection sessions were scheduled every 4–6 months, depending on the site of onset (i.e., bulbar or spinal). As the goal of this project was early detection, only data from the first of these sessions were analyzed in this article. At each session, self-reported data were gathered on the ability to complete activities of daily living using the ALSFRS-R (Cedarbaum et al., 1999). The ALSFRS-R is composed of 12 questions, each of which measures a unique aspect of motor function (such as speaking, walking, or breathing) along a 5-point ordinal scale (4 = normal function and 0 = minimal capabilities). The sum score across these 12 areas is referred to as the total score (0–48) and provides an estimate of the overall motoric capabilities of a patient with ALS. Additionally, the ALSFRS-R Bulbar subscore (0–12) allows us to measure the specific motoric capabilities along the tasks related to bulbar motor function (speaking, swallowing, and salivating). The Speech Intelligibility Test (SIT) software (Dorsey et al., 2007) was used to obtain measures of speech performance. The SIT software randomly generates sentences that increase in length from five to 15 words. The SIT test was audio-recorded and transcribed by a licensed speech-language pathologist. The SIT software computed sentence-level speech intelligibility (the percentage of words that are understood by a listener) and speaking rate (the number of words produced per minute). To derive the ISR, we multiplied speech intelligibility by speaking rate. This measure provides an index of the number of intelligible words per minute. A detailed summary of the demographic information and clinical characteristics for each group of participants is included in Table 1. The participants were asked to produce 20 predefined phrases such as “Call me back when you can” and “I need some assistance” at their habitual loudness and speaking rate. That is, they were instructed to produce speech as they normally would in daily conversation. Each phrase was produced up to 4 times. For the full list of phrases, please see Wisler et al., (2019). The phrases were selected because they are commonly used in daily life.

Table 1.

Summary of the demographic breakdowns of each participant group, including averages and standard errors.

Attribute Healthy controls Participants with ALS
Spinal onset Bulbar onset
Sex 11 M/13 F 19 M/12 F 4 M/6 F
Age 60.5 (± 1.89) 57.9 (± 1.75) 54.2 (± 6.59)
Intelligible speaking rate 176.7 (± 4.91) 142.5 (± 6.21) 99.2 (± 20.94)
ALSFRS-R (total) 35.5 (± 1.40) 39.4 (± 1.52)
ALSFRS-R (bulbar) 10.3 (± 0.35) 8.2 (± 0.61)

Note. ALS = amyotrophic lateral sclerosis; M = male; F = female; ALSFRS-R = Amyotrophic Lateral Sclerosis Functional Rating Scale–Revised.

Motion Tracking Device

To collect synchronized acoustic and kinematic data, the NDI wave system (Northern Digital, Inc.) and a Shure Microflex microphone were used. The microphone was positioned approximately 15 cm away from the speakers' mouth. The sampling rate of the acoustic signal was 22 kHz. The Wave system has a sampling rate of 100 Hz and spatial precision of approximately 0.5 mm (Berry, 2011). To derive kinematic data, an optimal sensor setup was used (Wang, Samal, et al., 2016). A lightweight helmet that contained a reference center at the forehead was worn to isolate articulatory movements. Dental glue (GluStitch) and tape were used to attach four tiny sensor electrodes to the tongue tip, tongue back, upper lip, and lower lip. Each sensor captures movements along the y dimension (superior–inferior movements), x dimension (lateral movements), and z dimension (anterior–posterior movements). Figure 1 provides a picture of the NDI wave system when connected to a participant for recording, along with a diagram illustrating the placement of the four sensor electrodes.

Figure 1.

Figure 1.

Illustrations of the NDI wave system used to collect the kinematic movement data from the articulators during speech production. (A) A picture of the system when connected to a participant for data recording. (B) A diagram illustrating the placement of the sensors on the tongue and lips. TT = tongue tip; TB = tongue-back; UL = upper lip; LL = lower lip.

Data Preprocessing

The first step in the data preparation procedure is to normalize the kinematic data by subtracting the mean from each kinematic signal. The signals are then smoothed using a 20-Hz low-pass Butterworth filter. To condense the three-dimensional kinematic trajectory of each articulator to a one-dimensional signal, we applied the principal component analysis. The principal component analysis is an effective way of reducing the dimensionality of kinematic signals by identifying the dimension of movement that captures maximum variance in the articulatory motion (Mefferd, 2016). The resulting four-dimensional kinematic signal is then used for the correlation structure analysis. All of the data preprocessing procedures, as well as the signal processing, machine learning, and statistical analyses introduced in the following sections, were conducted in MATLAB.

Correlation Structure Analysis

The spatiotemporal correlation structure approach used in this article was first introduced for seizure detection based on electroencephalogram (EEG) signals (Williamson et al., 2012) and has since been applied to detect changes in articulatory coordination resulting from different neurological conditions, such as Parkinson's disease or cognitive impairment (Helfer et al., 2014; Williamson et al., 2013, 2015; Yu et al., 2014). The broad applicability of this approach stems from the fact that it can be applied to virtually any multichannel signal and has previously been applied to EEG signals, acoustic feature signals, and articulation signals. Although characteristics of the data being analyzed may alter some specific details of how the approach is implemented, the overall approach remains largely consistent across modalities. For example, because we analyzed short speech samples (almost equal to 2–3 s in duration) rather than EEG recordings that span several hours, we could not use the same frame sizes as those used in Williamson et al. (2012). However, despite differences in the studied populations and the data modalities, there were consistent patterns in our observations of the correlation structure, specifically that data drawn from a pathological state showed greater energy concentration in the initial set of eigenvalues.

To calculate the correlation structure, we first divided each of the four articulation signals (tongue tip, tongue back, upper lip, and lower lip) into a set of overlapping ns -length frames. While it has been common in previous research to make these frames a fixed duration, doing so for this project was not ideal due to the variability in phrase durations. Because of this, we instead set the frame length to be equal to half the duration of the phrase and considered all frames in the speech signal. Each frame was then normalized using z-scoring. We refer to the array representing the normalized signal data for sensor s at time index i as xis . The correlation matrix for sensor pair s 1, s 2 was calculated by

Rs1s2=x1s1x1s2x2s1x1s2xnss1x1s2x1s1x2s2x2s1x2s2xnss1x2s2x1s1xnss2x2s1xnss2xnss1xnss2 (1)

The correlation matrices for each pair of signals were then concatenated into a single space delay correlation matrix in the following manner:

R=RTTTTRTBTTRULTTRLLTTRTTTBRTBTBRULTBRLLTBRTTULRTBULRULULRLLULRTTLLRTBLLRULLLRLLLL (2)

We then calculated the eigen decomposition of this correlation matrix and used the rank-order eigenvalues as features for our classification analysis. This provides a concise low-dimensional representation of the overall complexity of the correlation structure. In theory, a simpler correlation structure may be accurately represented by only a few eigenvectors, leading to higher magnitudes in the initial set of eigenvalues followed by a steep drop-off. In contrast, more complex correlation structures will exhibit less amplitude in the initial set of coefficients and a more gradual drop-off. To better understand how this data representation might look for the phrase-level speech kinematic data analyzed in this article, Figure 2 presents sample visualizations of the correlation matrices of a healthy control, a participant with spinal ALS, and a participant with bulbar ALS. Based on the eigenvalue plots presented in Figure 2D, the speech sample drawn from the healthy speaker exhibits a more complex correlation structure than the participants with ALS, as illustrated by the slower decay in the eigenvalues. Similarly, the more gradual decay in eigenvalues for the spinal-onset participant relative to the bulbar-onset participant indicates a more complex correlation structure (closer to that of the healthy control).

Figure 2.

Figure 2.

Heat maps for the state delay correlation matrices are displayed for (A) a sample healthy control, (B) a sample participant with spinal-onset amyotrophic lateral sclerosis (ALS), and (C) and a sample participant with bulbar-onset ALS. (D) A comparison of the eigenspectrum of the three correlation matrices. The slower drop-off in eigenvalues observed for the participants with ALS indicates a less complex correlation structure. TT = tongue tip; TB = tongue-back; UL = upper lip; LL = lower lip.

Machine Learning Analysis

The goal of this analysis was to develop a data-driven model to map the measures extracted from the previously described correlation structure analysis onto predictions about whether or not the kinematic motions of a given phrase of speech were produced by someone with ALS. To accomplish this, we utilized a feedforward ANN containing a single hidden layer with six artificial neurons. A diagram of the proposed model is displayed in Figure 3A. ANNs are a type of statistical learning model that is loosely based on the structure of the human brain. As illustrated in Figure 3B, every artificial neuron in the hidden layer learns weights to create a linear combination of the input features and then passes the resulting combination through a nonlinear activation function. The two neurons in the output layer then use the same process to map the outputs of the hidden layer neurons into predictions as to whether or not a given feature vector is likely to have come from one of the participants with ALS. The appeal of ANNs stems from the fact that, despite being made up of relatively simple building blocks (a single neuron differs only slightly from a standard linear model), they can be designed to address virtually any statistical learning challenge with the appropriate architecture.

Figure 3.

Figure 3.

(A) The architecture of the artificial neural network used in this article to map from the eignenvalues of the correlation structure to predictions of a participant's disease state (healthy vs. amyotrophic lateral sclerosis). (B) The structure of the eight artificial neurons in this network (six in the hidden layer, two in the output layer). The neuron's inputs are each multiplied by unique weights and then summed together, as in a standard linear model. This sum is then passed through a nonlinear activation function.

The ANN was trained to learn the mapping between the input eigenvalue features and the relevant class label (healthy control vs. ALS) using scaled conjugate gradient back-propagation. To evaluate the out-of-sample performance of the ANN, we used leave-one-participant-out cross-validation. Thus, at each stage of the cross-validation loop, the classifier was trained on 64 of the participants, and then predictions are generated for each speech sample from the single held-out participant. The initial outputs generated by the ANN are not binary variables but, instead, are continuous values that can be considered estimates of the posterior likelihood of each class. To convert these estimates into binary predictions, we selected a threshold value and assigned any estimated probabilities (of class “ALS”) in excess of that threshold to class “ALS” and any probabilities below that threshold to class “control.” The choice of threshold is important, as higher thresholds will yield higher specificity and lower sensitivity and lower thresholds will yield higher sensitivity and lower specificity. To visualize this relationship, we generated receiver operator characteristic (ROC) curves that display the sensitivity and specificity trade-off for a large range of possible thresholds.

Using the out-of-sample predictions generated by the ANN, we conducted a statistical analysis to examine how region of onset and other factors affected the accuracy of the proposed model in detecting ALS among different participants. We first directly compared the two groups and used a one-tail two-sample t test to determine if the model's performance was significantly higher for bulbar-onset participants than for the spinal-onset participants. We then conducted a multivariate regression analysis that used region of onset along with three different measures of disease severity (ISR, ALSFRS-R total score, and ALSFRS-R Bulbar subscore) as independent variables to predict classification accuracy. The goal of this analysis was to determine whether the observed differences in classifier performance across the two groups could be explained by measurable differences in speech severity.

Results

The ROC curves displayed in Figure 4 illustrate the trade-off between sensitivity and specificity for different threshold values. A threshold of zero will yield perfect sensitivity and zero specificity for any model (represented by the top right corner of the plot). A threshold of one will yield perfect specificity and zero sensitivity for any model (represented by the bottom left corner of the plot). Thresholds in between these extremes will yield some balance of these two metrics, and better models will yield curves that come closer to the top-left corner of the plot (which represents perfect sensitivity and specificity) and have more area underneath them. Measuring the area underneath an ROC curve yields a relatively good measure of the model's overall performance.

Figure 4.

Figure 4.

Receiver operator characteristic curves for the proposed artificial neural network classification model broken down by the participant's region of onset. (A) The results for the baseline model, which generates predictions using only a single phrase. (B) The results when aggregate predictions are generated by averaging the posterior predictions across five different phrases.

As the primary goal of this experiment was to understand how the performance of the proposed model differs based on region of onset, separate ROC curves are displayed for the detection performance in the spinal onset and bulbar onset groups. In addition to the single-phrase predictions generated by the baseline model, we also considered the possibility of aggregating predictions across multiple phrases. By averaging the estimated probabilities across multiple phrases, it was possible to generate more accurate predictions than would be attainable from one phrase alone. The ROC curve for predictions aggregated across five phrases (selected at random) is displayed in Figure 4B. From these ROC curves, we see that both the baseline and aggregated predictions are noticeably more accurate in detecting the bulbar-onset participants, regardless of the chosen threshold.

For further examination of the classifier performance, we set the model threshold based on the class ratio observed in our data set ( T=4165=0.631 ). While 0.5 is typically used as the default threshold, doing so in this case would result in a model that is biased toward being overly sensitive due to the higher representation of participants with ALS relative to healthy controls in the training data. Therefore, setting the threshold based on this disparity helped to offset this bias. As a result, the model achieved a sensitivity of .5318 and a specificity of .8125 for phrase-level predictions. To assess how the model's performance differs based on the participant's region of onset, we looked at the accuracy that the model achieved for each of the 41 participants with ALS. The accuracy for each participant was calculated by dividing the number of correctly identified speech samples for that participant by their total number of speech samples (80 for most participants). The average accuracy was 70.0% (SE = 10.9%) for bulbar-onset participants and only 47.5% (SE = 10.5%) for the spinal-onset participants. Using a two-sample one-tail t test, this difference was found to be statistically significant at the 0.05 threshold (p = .0367).

To better understand the factors that contributed to classification performance of the proposed model, we conducted a linear regression analysis to evaluate the relationship between three different measures of severity (ISR, ALSFRS-R total score, and ALSFRS-R Bulbar subscore), region of onset, and classifier performance. This model will help elucidate whether the observed differences in classifier performance can be explained by a measurable difference in the patients' severity between the two groups. The results of this regression analysis are displayed in Table 2. Of the four considered input variables, ISR was the only variable with a statistically significant effect in this model. To better understand the relationship between classifier accuracy and ISR, a scatter plot illustrating the relationship between these two variables is presented in Figure 5. Each dot in Figure 5 represents the ISR and classifier accuracy for one recording session (almost equal to 80 phrase-level predictions). The dark line represents the line of best fit (ordinary least squares) for the relationship between ISR and accuracy, and the dotted lines represent the 95% confidence interval for this line. Although ISR is the only variable with a significant effect in the multivariate model, that does not mean that the other variables are not predictive of classifier accuracy. When considered in isolation, the ALSFRS-R Bulbar subscore is significantly negatively correlated with the classification accuracy (p < .001); however, because much of the relevant information it carries is accounted for by ISR, it did not have a significant effect in the multivariate model. Similarly, despite observing significant differences in classification performance between the bulbar- and spinal-onset groups, region of onset did not have a significant effect on classifier performance in the multivariate analysis. This suggests that onset location did not affect the performance of the proposed system beyond measurable population-level differences in speech symptom severity.

Table 2.

Regression results for the relationship between patients' clinical information (intelligible speaking rate, Amyotrophic Lateral Sclerosis Functional Rating Scale [ALSFRS-R] scores, region of onset) and the accuracy of the proposed diagnostic model in detecting amyotrophic lateral sclerosis (ALS).

Factor Coefficient SE 95% CI p
Intercept (baseline) 118.8 25.822 [68.2, 169.4] < .001
Intelligible speaking rate −0.442 0.107 [−0.65, −0.23] < .001
ALSFRS-R (total) 0.803 0.572 [−0.32, 1.92] .169
ALSFRS-R (Bulbar subscore) −4.469 2.40 [−9.17, 0.24] .071
Region of onset 9.124 10.754 [−11.95, 30.20] .402

Note. Intercept provides the baseline classification accuracy expected if all clinical measures were zero (and participant has bulbar-onset ALS). The coefficients for each variable then indicates the expected increase (or decrease) in classifier accuracy corresponding to a unit increase in that variable. Region of onset is binary (either bulbar or spinal). SE = standard error; CI = confidence interval.

Figure 5.

Figure 5.

Scatter plot of participant-level classification accuracy versus intelligible speaking rate for the participants included in this study. The included line of best fit helps to illustrate the inverse relationship between these two variables.

Discussion

This article investigated the efficacy of using the spatiotemporal correlation structure of kinematic articulation movements during speech production for detecting ALS. We also investigated how characteristics of the disease, particularly the region of onset, affected the ability of the proposed system to detect the disease in specific individuals. In addressing our first question, the proposed model attained a sensitivity of .5318, a specificity of .8125, and balanced accuracy of 67.21%. Although this detection performance is significantly higher than chance and thus indicates the proposed correlation structure analysis is able to capture some useful information about the changes in articulatory coordination resulting from ALS, this system falls slightly short of the performance level of two previously published systems that perform detection based on audio or visual data. Bandini, Green, Taati, et al. (2018) reported a 74.3% accuracy using logistic regression to classify ALS based on motion trajectories extracted from recorded video data visual motion trajectories. An et al. (2018) achieved 76.2% accuracy in detecting a highly intelligible set of participants based on the spectrogram of phrase-level audio data. As we found in our experiment, both of these studies showed that significant improvements in classification performance could be attained by aggregating predictions across multiple samples. The highest detection performance reported in the literature came from using deep learning for multimodal classification based on a combination of acoustic and kinematic data to achieve an accuracy of 96.53% (Wang, Kothalkar, Kim, et al., 2016). Although it is difficult to compare the performance of disparate models that are evaluated on different participant groups, research in this area continuously highlighted the benefits of aggregating information from different recording tasks and different data modalities in developing predictive models.

There are several potential ways that the performance of the proposed system could be improved. Although the sensors used in this article were chosen based on evidence from prior research showing the tongue and lip as the most sensitive markers for bulbar dysfunction (DePaul & Brooks, 1993; Green et al., 2013; Mefferd et al., 2012), jaw kinematics have shown sensitivity in capturing articulatory changes related to ALS (Bandini, Green, Wang, et al., 2018; Rong et al., 2015). Thus, the inclusion of jaw kinematics could improve the sensitivity of the proposed model. As previously stated, the inclusion of audio data could help capture important characteristics of speech related to phonation that are difficult to assess from kinematic motion data alone. As we demonstrated in this article, aggregating predictions across a larger group of speech kinematic data can also help overcome errors that are localized to specific stimuli and improve model performance. Although there are certainly ways to improve the performance of the proposed model, approaching this challenge in this difficult setting (single-phrase predictions based on only kinematic motion data) creates a good framework for assessing the model errors and determining what clinical factors they may be correlated with.

Prior research in this area has not investigated the role that region of onset plays on the ability of speech-based diagnostic models to detect ALS. We hypothesized that, because bulbar-onset patients experience speech symptoms earlier and generally maintain more severe speech symptoms over the course of the disease, the proposed system will detect them more accurately than spinal-onset participants. This hypothesis was borne out by the classification results, as there is a significantly higher probability (70%–47.5%) of the proposed system successfully detecting ALS in speech kinematic data from bulbar-onset participants than from spinal-onset participants. This means that spinal-onset participants were misidentified at an almost equal to 50% higher rate than the bulbar-onset participants in our data set. Although the spinal-onset participants clearly presented a greater challenge for the proposed system, this does not indicate that speech-based systems are not useful in this population. The ROC curves in Figure 4 show that the proposed system achieves significantly higher than chance-level performance in both patient groups. Instead, this indicates that systems based only on speech are likely to identify spinal-onset participants less accurately and later in the progression of the disease than their bulbar-onset counterparts. As a result, robust detection methods will likely need to employ a combination of speech and nonspeech data modalities (such as measurements of fine/gross motor function) in their assessments.

The final question that this article sought to address was whether any observed difference in performance based on region of onset could be explained by other clinical severity measures. This question will help us consider the degree to which region of onset needs to be considered in the evaluation of future speech-based detection systems. If inherent differences exist between the two groups (that cannot be explained by other measures of severity in the patient's symptoms), then the relative number of bulbar- and spinal-onset patients included in the data is important in evaluating the model's efficacy. However, if these differences can be explained by clinical measures of severity, such as the ALSFRS-R, then understanding the severity level of the participant group that is used to assess a given model is sufficient to understanding the diagnostic challenge that that group presents. When accounting for four different patient factors (region of onset, ISR, ALSFRS-R total score, and ALSFRS-R Bulbar subscore), the only variable that the linear regression model determined to be a significant factor in determining model accuracy was the participant's ISR. The coefficient of the effect assigned to ISR suggests that, for every 2.26-words-per-minute increase in a participant's ISR, their speech stimuli becomes 1% less likely to be correctly identified as belonging to someone with ALS. Overall, this linear regression model was able to account for 54.5% of the variance in the model's performance.

Based on our findings, region of onset does not appear to have a significant effect on the detection performance of the proposed system when controlling for the level of severity of a patient's speech symptoms. Although spinal-onset patients may not present a unique diagnostic challenge from bulbar-onset patients, the delay in the presentation of their speech symptoms (da Costa Franceschini & Mourao, 2015; Kühnlein et al., 2008) could still limit the efficacy of speech-based methods for early detection. As such, any diagnostic models that make decisions using speech data alone will likely hold some implicit bias against this population. Thus, to ensure models are not biased against any particular group of patients, future research in this area should incorporate data from a range of motoric tasks in generating diagnostic assessments, similarly to how the ALSFRS-R assesses motor function across 12 distinct areas. Additionally, current research has generally relied on data that were collected after the individual has been diagnosed. As such, the degree to which the behavioral patterns that previously proposed speech-based diagnostic systems rely on exist in a prediagnosis population needs to be further studied.

Conclusions

This article investigated automatic ALS detection using the spatiotemporal correlation structure of the kinematic movements of articulators during speech production and a machine learning model (ANN). To our knowledge, this work constituted the first effort toward ALS detection based on kinematic motion data alone. The proposed system achieved a 67.21% balanced accuracy in classifying between participants with ALS and healthy controls based on phrase-level speech kinematic data. In studying the efficacy of the proposed system across different participants with varying presentations of ALS, we found that the proposed approach was more reliable in detecting bulbar-onset participants than those who experienced their first symptoms elsewhere. Though bulbar-onset participants were easier to identify, a follow-up regression analysis showed that this difference can be explained by measurable differences in ISR between the two patient groups. From this, we concluded that, although there may be no inherent difference in the diagnostic challenge presented by these two groups, spinal-onset participants may be more difficult to detect in early ALS than the bulbar-onset group due to a delay in the onset of speech symptoms. As a result, robust detection of ALS during the early stages of the disease will likely require the incorporation of nonspeech data modalities into the classification process.

Acknowledgments

This work was in part supported by the National Institutes of Health through Grants R01DC013547 (PI: Green), R01DC016621 (PI: Wang), and R03DC013990 (PI: Wang). A portion of the data and analysis included here was presented as a poster at the Signal Analytics for Motor Speech Workshop With the Conference on Motor Speech in Santa Barbara, California, February 20–23, 2020. We thank Jordan R. Green, Thomas F. Campbell, Yana Yunusova, Jennifer McGlothlin, Brian D. Richburg, Beiming Cao, Carolina Uzquiano, Alyssa Shrode, Brittany Shrode, and Bjorn Bleta for their support and the volunteering participants.

Funding Statement

This work was in part supported by the National Institutes of Health through Grants R01DC013547 (PI: Green), R01DC016621 (PI: Wang), and R03DC013990 (PI: Wang). A portion of the data and analysis included here was presented as a poster at the Signal Analytics for Motor Speech Workshop With the Conference on Motor Speech in Santa Barbara, California, February 20–23, 2020.

References

  1. An, K. , Kim, M. , Teplansky, K. , Green, J. R. , Campbell, T. F. , Yunusova, Y. , Heitzman, D. , & Wang, J. (2018). Automatic early detection of amyotrophic lateral sclerosis from intelligible speech using convolutional neural networks. Proceedings of Interspeech 2018, 1913–1917. https://doi.org/10.21437/Interspeech.2018-2496 [Google Scholar]
  2. Bandini, A. , Green, J. R. , Taati, B. , Orlandi, S. , Zinman, L. , & Yunusova, Y. (2018). Automatic detection of amyotrophic lateral sclerosis (ALS) from video-based analysis of facial movements: Speech and non-speech tasks. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (pp. 150–157). IEEE. https://doi.org/10.1109/FG.2018.00031 [Google Scholar]
  3. Bandini, A. , Green, J. R. , Wang, J. , Campbell, T. F. , Zinman, L. , & Yunusova, Y. (2018). Kinematic features of jaw and lips distinguish symptomatic from presymptomatic stages of bulbar decline in amyotrophic lateral sclerosis. Journal of Speech, Language, and Hearing Research, 61(5), 1118–1129. https://doi.org/10.1044/2018_JSLHR-S-17-0262 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Benba, A. , Jilbab, A. , & Hammouch, A. (2015). Detecting patients with Parkinson's disease using Mel frequency cepstral coefficients and support vector machines. International Journal on Electrical Engineering and Informatics, 7(2), 297. https://doi.org/10.15676/ijeei.2015.7.2.10 [Google Scholar]
  5. Berry, J. J. (2011). Accuracy of the NDI wave speech research system. Journal of Speech, Language, and Hearing Research. https://doi.org/10.1044/1092-4388(2011/10-0226) [DOI] [PubMed] [Google Scholar]
  6. Brown, R. H. , & Al-Chalabi, A. (2017). Amyotrophic lateral sclerosis. New England Journal of Medicine, 377(2), 162–172. https://doi.org/10.1056/NEJMra1603471 [DOI] [PubMed] [Google Scholar]
  7. Cedarbaum, J. M. , Stambler, N. , Malta, E. , Fuller, C. , Hilt, D. , Thurmond, B. , & Nakanishi, A. (1999). The ALSFRS-R: A revised ALS functional rating scale that incorporates assessments of respiratory function. Journal of the Neurological Sciences, 169(1–2), 13–21. https://doi.org/10.1016/S0022-510X(99)00210-5 [DOI] [PubMed] [Google Scholar]
  8. Cummins, N. , Epps, J. , Breakspear, M. , & Goecke, R. (2011). An investigation of depressed speech detection: Features and normalization. In Cosi P., De Mori R., Di Fabbrizio G., & Pieraccini R. (Eds.), Twelfth Annual Conference of the International Speech Communication Association (pp. 2997–3000). International Speech Communication Association.
  9. da Costa Franceschini, A. , & Mourao, L. F. (2015). Dysarthria and dysphagia in amyotrophic lateral sclerosis with spinal onset: A study of quality of life related to swallowing. NeuroRehabilitation, 36(1), 127–134. https://doi.org/10.3233/NRE-141200 [DOI] [PubMed] [Google Scholar]
  10. DePaul, R. , & Brooks, B. R. (1993). Multiple orofacial indices in amyotrophic lateral sclerosis. Journal of Speech and Hearing Research, 36(6), 1158–1167. https://doi.org/10.1044/jshr.3606.1158 [DOI] [PubMed] [Google Scholar]
  11. Dorsey, M. , Yorkston, K. , Beukelman, D. , & Hakel, M. (2007). Speech Intelligibility Test for Windows. Institute for Rehabilitation Science and Engineering at Madonna. [Google Scholar]
  12. Garrard, P. , Rentoumi, V. , Gesierich, B. , Miller, B. , & Gorno-Tempini, M. L. (2014). Machine learning approaches to diagnosis and laterality effects in semantic dementia discourse. Cortex, 55, 122–129. https://doi.org/10.1016/j.cortex.2013.05.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Green, J. R. , Yunusova, Y. , Kuruvilla, M. S. , Wang, J. , Pattee, G. L. , Synhorst, L. , Zinman, L. , & Berry, J. D. (2013). Bulbar and speech motor assessment in ALS: Challenges and future directions. Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, 14(7–8), 494–500. https://doi.org/10.3109/21678421.2013.817585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Helfer, B. S. , Quatieri, T. F. , Williamson, J. R. , Keyes, L. , Evans, B. , Greene, W. N. , Vian, T. , Lacirignola, J. , Shenk, T. , Talavage, T. Palmer, J. , & Heaton, K. (2014). Articulatory dynamics and coordination in classifying cognitive change with preclinical mTBI. In Li H., Meng H., Ma B., Chng E. S., & L. Xie, (Eds.), Fifteenth Annual Conference of the International Speech Communication Association (pp. 485–489). International Speech Communication Association.
  15. Kiernan, M. C. , Vucic, S. , Cheah, B. C. , Turner, M. R. , Eisen, A. , Hardiman, O. , Burrell, J. R. , & Zoing, M. C. (2011). Amyotrophic lateral sclerosis. The Lancet, 377(9769), 942–955. https://doi.org/10.1016/S0140-6736(10)61156-7 [DOI] [PubMed] [Google Scholar]
  16. Kuruvilla-Dugdale, M. , & Chuquilin-Arista, M. (2017). An investigation of clear speech effects on articulatory kinematics in talkers with ALS. Clinical Linguistics & Phonetics, 31(10), 725–742. https://doi.org/10.1080/02699206.2017.1318173 [DOI] [PubMed] [Google Scholar]
  17. Kuruvilla-Dugdale, M. , & Mefferd, A. (2017). Spatiotemporal movement variability in ALS: Speaking rate effects on tongue, lower lip, and jaw motor control. Journal of Communication Disorders, 67, 22–34. https://doi.org/10.1016/j.jcomdis.2017.05.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kühnlein, P. , Gdynia, H.-J. , Sperfeld, A.-D. , Lindner-Pfleghar, B. , Ludolph, A. C. , Prosiegel, M. , & Riecker, A. (2008). Diagnosis and treatment of bulbar symptoms in amyotrophic lateral sclerosis. Nature Clinical Practice Neurology, 4(7), 366–374. https://doi.org/10.1038/ncpneuro0853 [DOI] [PubMed] [Google Scholar]
  19. Lee, J. , & Bell, M. (2018). Articulatory range of movement in individuals with dysarthria secondary to amyotrophic lateral sclerosis. American Journal of Speech-Language Pathology, 27(3), 996–1009. https://doi.org/10.1044/2018_AJSLP-17-0064 [DOI] [PubMed] [Google Scholar]
  20. Mefferd, A. S. (2016). Associations between tongue movement pattern consistency and formant movement pattern consistency in response to speech behavioral modifications. The Journal of the Acoustical Society of America, 140(5), 3728–3737. https://doi.org/10.1121/1.4967446 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Mefferd, A. S. , Green, J. R. , & Pattee, G. (2012). A novel fixed-target task to determine articulatory speed constraints in persons with amyotrophic lateral sclerosis. Journal of Communication Disorders, 45(1), 35–45. https://doi.org/10.1016/j.jcomdis.2011.09.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mefferd, A. S. , Pattee, G. L. , & Green, J. R. (2014). Speaking rate effects on articulatory pattern consistency in talkers with mild ALS. Clinical Linguistics & Phonetics, 28(11), 799–811. https://doi.org/10.3109/02699206.2014.908239 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Norel, R. , Pietrowicz, M. , Agurto, C. , Rishoni, S. , Cecchi, G. , Israel, P. , & Hasharon, R. (2018). Detection of amyotrophic lateral sclerosis (ALS) via acoustic analysis. Proceedings of Interspeech 2018 (pp. 377–381). https://doi.org/10.21437/Interspeech.2018-2389
  24. Orimaye, S. O. , Wong, J. S. M. , Golden, K. J. , Wong, C. P. , & Soyiri, I. N. (2017). Predicting probable Alzheimer's disease using linguistic deficits and biomarkers. BMC Bioinformatics, 18(1), 1–13. https://doi.org/10.1186/s12859-016-1456-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Orozco-Arroyave, J. R. , Hönig, F. , Arias-Londoño, J. D. , Vargas-Bonilla, J. F. , Daqrouq, K. , Skodda, S. , Rusz, J. , & Nöth, E. (2016). Automatic detection of Parkinson's disease in running speech spoken in three different languages. The Journal of the Acoustical Society of America, 139(1), 481–500. https://doi.org/10.1121/1.4939739 [DOI] [PubMed] [Google Scholar]
  26. Roark, B. , Mitchell, M. , Hosom, J.-P. , Hollingshead, K. , & Kaye, J. (2011). Spoken language derived measures for detecting mild cognitive impairment. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2081–2090. https://doi.org/10.1109/TASL.2011.2112351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Rong, P. , Yunusova, Y. , Richburg, B. , & Green, J. R. (2018). Automatic extraction of abnormal lip movement features from the alternating motion rate task in amyotrophic lateral sclerosis. International Journal of Speech-Language Pathology, 20(6), 610–623. https://doi.org/10.1080/17549507.2018.1485739 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Rong, P. , Yunusova, Y. , Wang, J. , & Green, J. R. (2015). Predicting early bulbar decline in amyotrophic lateral sclerosis: A speech subsystem approach. Behavioural Neurology, 2015. https://doi.org/10.1155/2015/183027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shellikeri, S. , Green, J. R. , Kulkarni, M. , Rong, P. , Martino, R. , Zinman, L. , & Yunusova, Y. (2016). Speech movement measures as markers of bulbar disease in amyotrophic lateral sclerosis. Journal of Speech, Language, and Hearing Research, 59(5), 887–899. https://doi.org/10.1044/2016_JSLHR-S-15-0238 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Sturim, D. , Torres-Carrasquillo, P. A. , Quatieri, T. F. , Malyska, N. , & McCree, A. (2011). Automatic detection of depression in speech using Gaussian mixture modeling with factor analysis. In Cosi P., De Mori R., Di Fabbrizio G., & Pieraccini R. (Eds.), Twelfth Annual Conference of the International Speech Communication Association (pp. 2981–2984). International Speech Communication Association.
  31. Wang, J. , Kothalkar, P. V. , Cao, B. , & Heitzman, D. (2016). Towards automatic detection of amyotrophic lateral sclerosis from speech acoustic and articulatory samples. Proceedings of Interspeech 2016, 1195–1199. https://doi.org/10.21437/Interspeech.2016-1542 [Google Scholar]
  32. Wang, J. , Kothalkar, P. V. , Kim, M. , Bandini, A. , Cao, B. , Yunusova, Y. , Campbell, T. F. , Heitzman, D. , & Green, J. R. (2018). Automatic prediction of intelligible speaking rate for individuals with ALS from speech acoustic and articulatory samples. International Journal of Speech-Language Pathology, 20(6), 669–679. https://doi.org/10.1080/17549507.2018.1508499 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Wang, J. , Kothalkar, P. V. , Kim, M. , Yunusova, Y. , Campbell, T. F. , Heitzman, D. , & Green, J. R. (2016). Predicting intelligible speaking rate in individuals with amyotrophic lateral sclerosis from a small number of speech acoustic and articulatory samples. Workshop on Speech and Language Processing for Assistive Technologies, 2016, 91. https://doi.org/10.21437/SLPAT.2016-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Wang, J. , Samal, A. , Rong, P. , & Green, J. R. (2016). An optimal set of flesh points on tongue and lips for speech-movement classification. Journal of Speech, Language, and Hearing Research, 59(1), 15–26. https://doi.org/10.1044/2015_JSLHR-S-14-0112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Williamson, J. R. , Bliss, D. W. , Browne, D. W. , & Narayanan, J. T. (2012). Seizure prediction using EEG spatiotemporal correlation structure. Epilepsy and Behavior, 25(2), 230–238. https://doi.org/10.1016/j.yebeh.2012.07.007 [DOI] [PubMed] [Google Scholar]
  36. Williamson, J. R. , Quatieri, T. F. , Helfer, B. S. , Ciccarelli, G. , & Mehta, D. D. (2014). Vocal and facial biomarkers of depression based on motor incoordination and timing. Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, 65–72. https://doi.org/10.1145/2661806.2661809 [Google Scholar]
  37. Williamson, J. R. , Quatieri, T. F. , Helfer, B. S. , Perricone, J. , Ghosh, S. S. , Ciccarelli, G. , & Mehta, D. D. (2015). Segment-dependent dynamics in predicting Parkinson's disease. Proceedings of Interspeech 2015, 518–522. [Google Scholar]
  38. Williamson, J. R. , Street, W. , Quatieri, T. F. , Street, W. , Helfer, B. S. , Street, W. , Horwitz, R. , Street, W. , Yu, B. , Street, W. , & Street, W. (2013). Vocal biomarkers of depression based on motor incoordination. In Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge (pp. 41–48). https://doi.org/10.1145/2512530.2512531
  39. Wisler, A. , Teplansky, K. , Green, J. R. , Yunusova, Y. , Campbell, T. , Heitzman, D. , & Wang, J. (2019). Speech-based estimation of bulbar regression in amyotrophic lateral sclerosis. Proceedings of the Eighth Workshop on Speech and Language Processing for Assistive Technologies, 24–31. https://doi.org/10.18653/v1/W19-1704 [Google Scholar]
  40. Yu, B. , Quatieri, T. F. , Williamson, J. R. , & Mundt, J. C. (2014). Prediction of cognitive performance in an animal fluency task based on rate and articulatory markers. In Proceedings of Interspeech 2014.
  41. Yunusova, Y. , Weismer, G. , Westbury, J. R. , & Lindstrom, M. J. (2008). Articulatory movements during vowels in speakers with dysarthria and healthy controls. Journal of Speech, Language, and Hearing Research. https://doi.org/10.1044/1092-4388(2008/043) [DOI] [PubMed] [Google Scholar]

Articles from Journal of Speech, Language, and Hearing Research : JSLHR are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES