Skip to main content
American Journal of Speech-Language Pathology logoLink to American Journal of Speech-Language Pathology
. 2020 Apr 24;29(2):873–882. doi: 10.1044/2019_AJSLP-19-00162

Using Crowdsourced Listeners' Ratings to Measure Speech Changes in Hypokinetic Dysarthria: A Proof-of-Concept Study

Christopher Nightingale a, Michelle Swartz b, Lorraine Olson Ramig c,d,e,f, Tara McAllister a,
PMCID: PMC7842862  PMID: 32331503

Abstract

Purpose

Interventions for speech disorders aim to produce changes that are not only acoustically measurable or perceptible to trained professionals but are also apparent to naive listeners. Due to challenges associated with obtaining ratings from suitably large listener samples, however, few studies currently evaluate speech interventions by this criterion. Online crowdsourcing technologies could enhance the measurement of intervention effects by making it easier to obtain real-world listeners' ratings.

Method

Stimuli, drawn from a published study by Sapir et al. (“Effects of intensive voice treatment (Lee Silverman Voice Treatment [LSVT]) on vowel articulation in dysarthric individuals with idiopathic Parkinson disease: Acoustic and perceptual findings” in Journal of Speech, Language, and Hearing Research, 50(4), 2007), were words produced by individuals who received intensive treatment (LSVT LOUD) for hypokinetic dysarthria secondary to Parkinson's disease. Thirty-six online naive listeners heard randomly ordered pairs of words elicited pre- and posttreatment and reported which they perceived as “more clearly articulated.”

Results

Mixed-effects logistic regression indicated that words elicited posttreatment were significantly more likely to be rated “more clear.” Across individuals, acoustically measured magnitude of change was significantly correlated with pre–post difference in listener ratings.

Conclusions

These results partly replicate the findings of Sapir et al. (2007) and demonstrate that their acoustically measured changes are detectable by everyday listeners. This supports the viability of using crowdsourcing to obtain more functionally relevant measures of change in clinical speech samples.

Supplemental Material

https://doi.org/10.23641/asha.12170112


Parkinson's disease (PD) is a progressive neurodegenerative disease in which pathology of the basal ganglia control circuit affects various aspects of motor control. Individuals with PD may experience speech deficits that make up a perceptually distinct motor speech disorder termed “hypokinetic dysarthria” (Duffy, 2013). Manifestations are evident in many subsystems of speech production, but most readily affect voice, articulation, and prosody. Affected individuals' speech is commonly characterized by reductions in vocal loudness, prosodic pitch inflection, and range of articulatory movements, as well as dysfluencies; the overall impact is a reduction in speech intelligibility (Sapir et al., 2007).

Lee Silverman Voice Treatment (LSVT LOUD®) is an intensive 4-week regimen that trains individuals to recalibrate their vocal effort in order to produce speech that is louder and more clearly audible (Ramig et al., 1988). The design of the treatment program is based on principles of motor learning and neuroplasticity that inform clinical practices common to physical therapy and neurology (Fox et al., 2002; Ramig et al., 2018). Since LSVT's conception, numerous studies have documented positive changes in the voice and speech characteristics of individuals who complete the program (Fox et al., 2002; Ramig et al., 1995; Ramig, Sapir, Fox et al., 2001; Sapir et al., 2002, 2007). Three randomized controlled trial studies (Ramig et al., 2018; Ramig, Sapir, Countryman, et al., 2001; Ramig, Sapir, Fox et al., 2001) reported significant acoustic changes in the speech of individuals with PD following LSVT LOUD treatment. Studies using perceptual measures of change also found significant improvement in measures including intelligibility, vocal quality, clarity, and loudness (Moya-Galé et al., 2018; Sapir et al., 2002, 2007).

One limitation of the studies described above is their reliance on acoustic measures and/or trained listeners' ratings to demonstrate significant speech improvements following LSVT LOUD treatment. These methods may be sensitive to subtle changes that would not necessarily be apparent to everyday listeners who are naive to phonetic training and with whom the speaker interacts on a daily basis. Thus, the real-world impact of the statistically significant changes measured in these studies remains uncertain. One alternative is to ask naive listeners to evaluate speech samples from individuals with PD elicited before and after treatment. However, without standard training, naive listeners may use idiosyncratic strategies to arrive at ratings of speech samples, which raises concerns about the validity and reliability of their ratings. This heterogeneity makes it necessary to pool responses across a large number of listeners in order to draw reliable conclusions. Because it has historically been challenging to obtain ratings from a sufficiently sized sample of naive listeners, most researchers have favored the use of trained listeners and/or acoustic measures (noteworthy counterexamples include Sussman & Tjaden, 2012, which recruited a sample of 52 naive listeners in the laboratory setting, and Tjaden et al., 2014, which used 100 naive listeners). This research note explores crowdsourcing as one possible tool to obtain perceptual ratings from a large number of naive listeners, with the goal of making it more convenient to obtain functionally relevant assessments of the impact of speech interventions including LSVT LOUD.

Crowdsourcing in Behavioral Research

Crowdsourcing involves the online recruitment of members of the general population to solve problems of different levels of complexity. Although a single individual recruited online is unlikely to demonstrate expert-level performance on a given task, by aggregating responses over numerous such individuals, it is possible to arrive at outcomes that converge with expert responses (Ipeirotis et al., 2014). Research over the past decade has established crowdsourcing as a valued tool in fields such as linguistics (Gibson et al., 2011; Sprouse, 2011), behavioral psychology (Goodman et al., 2013; Paolacci et al., 2010), and neuroscience (Chartier et al., 2018; Long et al., 2016). More recent extensions to speech pathology have shown agreement between expert and crowdsourced listeners' ratings of the speech of children with residual rhotic errors (McAllister Byun et al., 2015, 2016) and established the validity of a paradigm to familiarize naive listeners with the speech patterns of adults with dysarthria (Lansford et al., 2016). To date, the Amazon Mechanical Turk (AMT) crowdsourcing platform has been most commonly used in academic research, and it constitutes the focus of this study. However, alternative platforms such as Prolific Academic (Palan & Schitter, 2018) are gaining in popularity for research use.

Critics of crowdsourcing have raised justifiable concerns about the limited control researchers have over the setting in which the experiment takes place and the equipment that is used. Data collected via crowdsourcing tend to be “noisier” than lab-based data, but because crowdsourcing allows the researcher to complete experimental tasks in a fraction of the time required in a lab (Crump et al., 2013; Sescleifer et al., 2018; Sprouse, 2011), it is possible to overcome this drawback by recruiting a larger sample size. Multiple published studies have offered empirical validation of data collected through AMT, either by replicating classic findings (Crump et al., 2013; Horton et al., 2011) or by directly comparing lab samples to results obtained through AMT (Lansford et al., 2016; Paolacci et al., 2010). A recent systematic review of studies of crowdsourcing for perceptual rating of speech (Sescleifer et al., 2018) found that “lay ratings are highly concordant with expert opinion, validating crowdsourcing as a reliable methodology”; across studies, the authors calculated a mean correlation coefficient of .81 between crowdsourced and expert listener ratings. On the other hand, the systematic review returned only eight studies (of which only four were published in peer-reviewed journals), suggesting that there is considerable need for further research on this topic.

It is important to note that not all studies using crowdsourcing have called for a sample size that exceeds what would be feasible to collect in the lab setting. 1 For instance, McAllister Byun et al. (2015) concluded that crowdsourced listeners' ratings converge with an “industry standard” level of reliability when pooled across just nine listeners. In such cases, the advantage derived from crowdsourcing may emerge in a cumulative fashion as ratings are collected over a large body of data. In one single-case experimental study of 12 participants (Preston et al., 2019), over 14,000 unique utterances needed to be evaluated to track changes in /r/ production accuracy over the course of treatment; the full data set was posted to AMT, and crowdsourced listeners' ratings were collected until each token had received nine unique ratings, following the recommendation from McAllister Byun et al. (2015). A total of 51 listeners contributed ratings toward the completion of this task. Thus, even if the number of listeners targeted per token is modest, a large data set may require a sizable sample of raters, creating a situation where the efficiency advantage of crowdsourcing can be valuable.

This proof-of-concept study tested whether crowdsourced listeners' ratings would replicate Sapir et al.'s (2007) finding of a statistically significant degree of improvement in the speech of a group of participants with PD following LSVT LOUD treatment. Because our primary goal was to evaluate whether changes over the course of treatment would be apparent to an everyday “person off the street,” no efforts were made to train listeners to a standard criterion or select those listeners who showed the greatest aptitude for the task. (However, such measures can be put into place to optimize the quality of ratings obtained through crowdsourcing; we return to this point in detail in the Discussion section.) This study additionally examined the correlation across participants between acoustically measured magnitude of change (pre–post difference in the ratio of second formant frequencies in /i/ and /u/ [F2i/F2u]) and pre–post difference in crowdsourced listeners' ratings. If there is convergence between results derived using these two methods, this would provide initial support for the viability of crowdsourced ratings while also offering a partial replication of Sapir et al.'s (2007) finding in the context of real-world listeners.

Study for Replication

In their original study, Sapir et al. (2007) recruited 29 individuals with PD and randomly assigned 14 of them, seven men and seven women, to the LSVT LOUD treatment group. The remaining 15 participants, eight men and seven women, were assigned to the nontreatment group. Assignment was stratified on severity of PD and severity of speech disorder. In addition, 14 individuals who were neurologically healthy, seven men and seven women, were recruited. Results from the nontreatment group and neurologically healthy group were not reanalyzed for the purpose of this proof-of-concept study, which is why the present findings can offer at best a partial replication of the original result. On the other hand, an analysis involving only the experimental group is sufficient for our primary goal of evaluating whether crowdsourced listeners' ratings can be used to detect an acoustically measurable effect, independent of whether that effect was caused by treatment or external factors. The mean age of the treatment group was 68 years (SD = 6 years), with a mean time since diagnosis of 9.08 years (SD = 6.97 years). The mean stage of disease was 1.33 (SD = 1.63; Hoehn & Yahr, 1967). Ninety percent of participants were judged to have mild-to-moderate hypokinetic dysarthria, while 10% where judged to have severe dysarthria. Participants' cognition was informally assessed by a speech-language pathologist, and all participants were judged to have grossly intact cognitive functioning. All participants were taking anti-Parkinson's medication and underwent a laryngoscopic examination that revealed no laryngeal tissue pathology.

All members of the treatment group received treatment following the standard LSVT LOUD protocol, which consisted of 50- to 60-min sessions taking place four times per week over 4 weeks. Speech samples were collected in three baseline sessions before the beginning of treatment (PRE), and again in two sessions after the end of the treatment period (POST). Three target phrases were selected for analysis from within a larger elicitation protocol: “The potato stew is in the pot,” “Buy Bobby a puppy,” and “The blue spot is on the key,” which were read aloud three times each. As their primary acoustic measure, Sapir et al. (2007) reported the ratio of second formant (F2) frequencies in the vowel /i/ in “key” and /u/ in “stew” (F2i/F2u ratio). Because F2 reflects the position of the tongue along a front–back dimension, and /i/ and /u/ represent extreme front and back points in the English vowel quadrilateral, the F2i/F2u ratio provides an index of the magnitude of an individual's tongue movements during speech. It has been used successfully in other studies of dysarthric speech including Rosen et al. (2008) and Yunusova et al. (2005). Sapir et al. (2007) found a significant increase in F2i/F2u ratio that was specific to the group that received LSVT LOUD treatment. Individuals with PD in the no-treatment group, as well as healthy control participants, did not show a significant change over the same time period. Given that hypokinetic dysarthria is associated with reduced magnitude of lingual movements, a change to a larger F2i/F2u ratio over the course of treatment is suggestive of improved lingual mobility.

Sapir et al. (2007) also collected perceptual ratings of vowels extracted from the second utterance of each targeted phrase in the first PRE and POST recordings. In a computerized procedure, a mixed sample of speech-language pathologists and graduate students in speech-language pathology 2 was presented with pairs of the same vowel produced by the same participant at PRE- and POST-treatment time points. Listeners used a visual analog scale (VAS) to rate the second vowel in each pair as “better than,” “same as,” or “worse than” the first vowel presented. For the group that received LSVT LOUD treatment, 78.8% of POST-training vowels were judged as better than their PRE-treatment counterparts, compared with 34.1% of POST vowels for the untreated control group. Pearson product–moment analysis revealed a significant correlation between changes in perception of vowel goodness ratings and F2i/F2u ratio (r = .40 for /i/ and r = .80 for /u/). Thus, both acoustic measures and trained listeners' ratings of isolated vowels converged on a finding of significant improvement from PRE- to POST-treatment time points. However, it was unknown whether naive listeners' ratings would also support this finding of significant change; this study aimed to test this hypothesis.

Method

Twenty-eight recordings from the treatment group from Sapir et al. (2007) were reanalyzed for this study, representing the first PRE recording and the first POST recording for each of the 14 participants in the group. 3 Using Praat acoustic analysis software (Boersma & Weenink, 2010), the target words “stew” and “key” were extracted from the first two utterances of the corresponding phrases. 4 In the original study, trained listeners rated only the isolated vowel sound extracted from each word. For this study, the entire word was presented because it was judged that naive listeners might have trouble understanding the metalinguistically more demanding task of rating isolated vowels. A .25-s silent interval was inserted before and after each word to simulate the effect of the word being produced in isolation. All recordings were then normalized to a standard root-mean-square amplitude of 70 dB. Words were paired in order of elicitation (e.g., the first utterance of “key” from the PRE recording was paired with the first utterance of “key” from the POST recording for the same participant), yielding a total of 56 pairs.

Speech samples were presented for rating using Experigen, an online platform for stimulus presentation and response recording (Becker & Levine, 2013). The order in which stimuli were presented was randomized both within and across word pairs. Raters could listen to each sound file in a pair up to three times. They were instructed to select the word that was “more clearly articulated” (see Figure 1). 5 Before rating the stimuli, the raters participated in a familiarization phase composed of 10 sample pairs that were hand-selected to feature a relatively unambiguous contrast in clarity. Raters received feedback on the accuracy of their response to each familiarization trial; however, these trials were not used to include or exclude raters. Recordings used for familiarization were not used in subsequent test trials.

Figure 1.

Figure 1.

Example of the rating interface viewed by crowdsourced participants on Amazon Mechanical Turk.

The protocol for collection of online ratings was approved as exempt from review by the institutional review board at New York University. Thirty-six raters (Mage = 34.5 years, SD = 11 years) were recruited using AMT. The size of the sample of listeners (n = 36) was targeted based on the size of the expert listener group in McAllister Byun et al. (2015). Raters were required to have United States–based IP addresses and to have an acceptance rate of at least 95% across tasks previously completed on AMT. 6 They were also required to self-report as monolingual English speakers without a history of speech, language, or hearing impairment and to report no previous training in speech-language pathology or linguistics. Finally, raters were instructed that the use of headphones was required and that their volume should be turned up to at least 70% of their computer's maximum volume. However, no independent verification of these rater-reported details was possible; we return to this limitation in the Discussion section. Participants were compensated $1.25 for completing the experiment, which was estimated to take approximately 10 min. Responses from the target sample size of 36 raters were collected in approximately 7 hr.

For analysis, data were pooled across all words and listeners. Words were coded as “1” or “0” (more vs. less clearly articulated), which served as the dependent variable in a logistic mixed model with fixed effects including PRE- versus POST-treatment time of elicitation, vowel target (/i/ vs. /u/), and the interaction between these factors. Random intercepts for speaker were included to adjust for heterogeneity across participants, with a random slope for time of elicitation to capture the fact that participants might not all show the same direction or magnitude of change from PRE to POST treatment. A random intercept for rater was also included to adjust for individual differences in the criteria used by raters to classify utterances as more versus less clearly articulated.

A second analysis tested whether there was a correlation across participants between crowdsourced listeners' perceptual ratings and acoustic measures of change. To assess this question, the PRE- to POST-treatment difference in perceptual rating for each individual was quantified as the percentage of recordings rated more clearly articulated at POST treatment minus the percentage rated more clearly articulated at PRE treatment. Acoustic change for each individual was quantified as the change in F2i/F2u ratio from PRE to POST treatment. Pearson r was used to evaluate the strength of the correlation between these two measures. All analyses were carried out in the R software environment (R Core Team, 2015). Complete data and code used to generate the results reported below are available Supplemental Materials S1, S2, and S3 and on the Open Science Framework at https://osf.io/9sy7j/.

Results

Descriptively, the utterance elicited in the POST-treatment condition was selected as more clearly articulated in 1,292/2,014 trials (64.15%), whereas the utterance elicited in the PRE-treatment condition was selected as more clearly articulated in 722/2,014 trials (35.85%). The significance of this difference was evaluated using the logistic mixed-effects model described above. There was a significant main effect of PRE- versus POST-treatment time point (β = 0.88, SE = 0.33, p = .01), with the positive direction of the coefficient indicating that utterances elicited at the POST time point were more likely to be rated more clearly articulated than utterances elicited PRE treatment. There was no significant main effect of word (β = −0.00, SE = 0.07, p = 1.00), but there was a significant interaction between time and word (β = 0.21, SE = 0.10, p = .04). This interaction can be visualized in Figure 2, which plots the proportion of recordings rated more clearly articulated at the PRE- and POST-treatment time point for both words. Although both words showed a similar magnitude of change in median rating, there is a greater degree of overlap in pre- versus posttreatment ratings for “key” than for “stew.” The interpretation of this interaction will not be pursued in greater detail in the context of this short report. However, one possible explanation is suggested by discussion in Sapir et al. (2007), where it was pointed out that changes in articulatory mobility over the course of treatment could affect the lips and tongue differentially. It is possible that articulator-specific effects of treatment could account for differing behavior of targets containing rounded versus unrounded vowels.

Figure 2.

Figure 2.

Boxplots showing percentage of “more clearly articulated” ratings for tokens elicited before the beginning of treatment (PRE) and after the end of treatment (POST), separated by word. Middle bars represent medians, boxes represent the interquartile range (25th–75th percentiles), and whiskers extend to the most extreme nonoutlier data point.

Figure 3 shows the percentage of recordings rated more clearly articulated at PRE versus POST treatment for each individual participant. Although acoustic measures of change are not numerically represented in this figure, individuals are ordered by magnitude of change in F2i/F2u ratio, from smallest to largest. Across participants, there was a significant correlation between acoustic and perceptual measures of change, r(12) = .62, p = .02, that was moderate in magnitude. This correlation indicates that greater increases in perceptually rated clarity tended to be associated with greater increases in F2i/F2u ratio. However, Figure 3 also shows cases of dissociation between perceptual and acoustic measures of change. For instance, out of the four individuals who exhibited the greatest acoustic change from PRE to POST treatment, three (P071, P265, and P760) showed a correspondingly robust improvement in perceptually rated clarity, but the fourth (P067) showed a slight decrease in perceptually rated clarity from PRE to POST treatment. Conversely, P795 showed a sizable increase in perceptually rated clarity despite little change in F2i/F2u ratio. These instances of dissociation speak to the fact that a single acoustic measure cannot capture the full range of factors that contribute to listeners' perceptual ratings of clarity. In future studies, it may be possible to achieve stronger agreement across acoustic and perceptual domains by following best practices for acoustic measurement, such as using Bark-transformed frequencies and taking measurements at a flexible rather than a fixed time point (Fletcher et al., 2017).

Figure 3.

Figure 3.

Bar plots showing percentage of “more clearly articulated” ratings for tokens elicited before the beginning of treatment (PRE) and after the end of treatment period (POST) for each participant. Participants have been rank-ordered by the magnitude of change in the ratio of second formant frequencies in /i/ and /u/ [F2i/F2u], from lowest to highest.

As a measure of interrater reliability, the proportion of raters who agreed with the modal rating for each token was calculated. On average, agreement across raters was 75.7%. A one-proportion z test indicated that agreement across raters was significantly greater than chance (50%), χ2 = 1061.7, p < .0001.

Discussion

In Sapir et al. (2007), both an acoustic measure (F2i/F2u ratio) and trained listeners' perceptual ratings of isolated vowel sounds showed improvement from PRE to POST training with LSVT LOUD; the original study also demonstrated that no such change was present in a group of individuals who did not receive such treatment. However, Sapir et al. (2007) did not address the question of whether the improvements observed in connection with treatment would be perceptually apparent to everyday listeners without any special training. This study used crowdsourcing to obtain naive listeners' ratings of the relative clarity of words produced PRE and POST training in the group that received LSVT LOUD treatment. The crowdsourced ratings replicated Sapir and colleagues' finding that words elicited at the POST time point were significantly more likely to be rated more clearly articulated than words elicited PRE training. Extending the original results, the present findings demonstrated that the change in treated participants' speech clarity was perceptually apparent even to an unselected sample of listeners with no previous training in speech pathology or linguistics. The current study additionally showed a significant correlation between acoustic and perceptual measures of change from PRE to POST treatment; the magnitude of the correlation was similar to that reported by Sapir et al. (2007). Interestingly, though, there were also cases of dissociation between acoustic and perceptual measures of change, suggesting that these two measures can provide complementary information and thus should both be collected whenever possible.

It is also interesting to note that significant differences in perceptually rated clarity were obtained even though the audio samples were normalized for intensity prior to rating (an action primarily intended for the comfort of online listeners, who tend to object to unpredictable changes in intensity from token to token). Because vocal volume is explicitly targeted in LSVT LOUD treatment, differences in intensity can be expected to make a significant contribution to perceptual changes from pre- to posttreatment. In this study, even though intensity was eliminated as a cue, listeners were able to differentiate PRE- versus POST-training tokens based on spectral differences alone. This further testifies to the robustness of change over the course of treatment documented in Sapir et al. (2007).

Overall, this study adds to a growing evidence base (Lansford et al., 2016; McAllister Byun et al., 2015, 2016; Sescleifer et al., 2018) indicating that online crowdsourcing can validly be applied to clinical speech samples. Moreover, these ratings are extremely quick and convenient to obtain, with the present sample of 36 unique listeners collected in roughly 7 hr. Clinical researchers may thus wish to consider online crowdsourcing as an efficient means to obtain naturalistic judgments of treatment efficacy. To facilitate uptake of crowdsourcing among researchers in speech-language pathology, complete JavaScript code used to run the present experiment on the Experigen platform is available on the Open Science Framework at https://osf.io/z2j65/

Finally, it is important to note that the addition of crowdsourcing to the research toolbox is not merely a question of convenience, but can also confer benefits in the area of scientific rigor and reproducibility. As pointed out by McAllister Byun et al. (2015), the methodological challenges associated with obtaining ratings from a large sample of outside listeners may lead researchers to make methodological compromises that can introduce bias into their results, such as using study team members as data raters. By increasing the efficiency with which researchers can access blinded raters, crowdsourcing can cut down on this source of bias. Finally, when paired with open data–sharing practices, crowdsourcing could promote replication of research results (Crump et al., 2013). If the data that form the basis for a given finding are made public, a third party can re-implement the process of obtaining crowdsourced ratings of those stimuli and provide an independent evaluation to confirm or dispute the conclusions drawn.

Finding the Experts in the Crowd

Although this study reproduced Sapir et al.'s finding that significantly more POST tokens were selected as “more clear” than PRE tokens, the magnitude of the effect differed from that reported in the original study. Specifically, the trained listeners in the original study classified POST tokens as better exemplars than their PRE counterparts in 78.8% of responses, whereas in this study, the relevant percentage was 64.2%. It is not surprising that our unselected sample of naive listeners did not perform identically to trained listeners; in fact, we explicitly hypothesized that naive listeners would be less sensitive to subtle differences in speech production than trained raters. Our goal in this study was to demonstrate that the treatment effect reported in Sapir et al. (2007) would be perceptually apparent even to everyday listeners with no special selection or training. Such a sample can be valuable to document the robustness and real-world relevance of a treatment effect obtained under somewhat idealized circumstances.

Of course, at other times, researchers will instead be interested in obtaining ratings that converge as closely as possible with those of trained experts. Such outcomes can be achieved through crowdsourcing if steps are taken to assess rater performance and include data only from high-performing listeners (Harel et al., 2017). A common practice is to identify a set of “gold-coded” tokens with a predetermined correct response and score listeners' accuracy in rating these items, then include only those raters who meet a set accuracy threshold. Alternatively, researchers can track intrarater reliability across repeated presentations of a subset of stimuli, because methodological research has shown that this measure of rating consistency is strongly correlated with accuracy in rating gold-coded speech samples (Harel et al., 2017). In either case, the task used to evaluate listener performance can be administered as a pretest to determine eligibility to participate in a rating study, or “catch trials” can be interspersed throughout the main task and used to exclude low-performing raters in a post hoc fashion.

Similarly, in this study, online participants were required to report various pieces of information (e.g., native language, linguistic training, and current use of headphones), but we did not implement any measures to independently verify these rater-reported details. In future research aimed more narrowly at replicating the performance of trained listeners in a laboratory setting, it may be appropriate to develop preliminary tasks designed to verify the participants' responses to these questions. For example, participants might be required to listen to and type a word whose recording volume has been set to fall at the threshold of audibility when presented over headphones on a typical computer at 75% of playback volume. By blocking participants who do not complete this task successfully, we could indirectly confirm that listeners are indeed wearing headphones set to an appropriate volume.

Limitations

Even with measures in place to select skilled and attentive raters, there are limitations to the type of information that we can reasonably expect to obtain from crowdsourced listeners. Although listeners with no special training are generally able to render gestalt judgments about the clarity, accuracy, or naturalness of speech productions, they are unlikely to be able to provide ratings for more specific constructs such as nasality or vocal roughness. However, previous research has suggested that even trained listeners show poor reliability in judging such specific constructs, leading to the recommendation that more global measures be used (Kreiman et al., 1993). There is ample previous literature documenting the utility of global estimates of speech severity (Sussman & Tjaden, 2012; Weismer et al., 2001) or naturalness (Whitehill et al., 2004) to characterize degree of impairment in dysarthric speech.

Of specific relevance for this study of hypokinetic dysarthria, crowdsourcing is not optimal for collecting measures of absolute vocal loudness (McNaney et al., 2016), because different computer systems will have different playback volumes. (Although listeners could still judge the relative intensity of tokens presented side by side, this runs the risk that listeners could experience discomfort in connection with very loud volumes or unexpected changes in intensit.) Crowdsourcing may also be suboptimal for measures that are highly dependent on information in a specific range of the frequency spectrum (e.g., high-frequency noise in the context of judgments of breathiness) since the varying headphones used by crowdsourced listeners will differ in frequency response. Thus, although crowdsourcing can be valuable in many contexts, users need to be aware of limitations to the scope in which it should be applied. Adding crowdsourcing to a toolbox that also includes trained listener ratings and acoustic measures could empower clinical researchers to select the method(s) that optimally align with the goals of a given study.

A second limitation of the present research pertains to the scope of the study in its capacity as a partial replication of Sapir et al. (2007). The primary goal of this study was methodological: to document the viability of using crowdsourced data collection to evaluate clinical speech samples. Our secondary clinical goal was to replicate the Sapir et al. (2007) finding regarding the efficacy of LSVT LOUD treatment for hypokinetic dysarthria. We did partly replicate their findings by reproducing the significant effect of treatment that they documented in their experimental group. However, our online rating tasks did not include data from two other groups in the original study, namely, healthy adults and adults with hypokinetic dysarthria who did not receive LSVT LOUD treatment. The clinical impact of the present findings would be considerably reinforced if, in addition to replicating the effect of treatment reported by Sapir et al. (2007), we also replicated the null results obtained for the other two groups. At the same time, we did observe a significant correlation between acoustic and perceptual measures of change over the course of training, and there is no particular reason to posit that the relationship between acoustic and perceptual measures should differ across groups. Thus, it is reasonable to hypothesize that crowdsourced ratings would support the original finding of no significant change in F2i/F2u ratios from PRE to POST for the two nontreatment groups in Sapir et al. (2007). Having made an initial demonstration of the viability of crowdsourced data collection, our future work will obtain ratings for both treated and untreated groups in order to be able to demonstrate intervention effects in a controlled manner.

Lastly, the strength of this study as a replication of Sapir et al. (2007) is further limited by several points of divergence in methodology between the original study and the current follow-up. In the original study, raters used a visual analog scale to compare tokens in a gradient fashion, whereas this study used a forced-choice task. This was intended to simplify the task for untrained raters, but future research should examine agreement across trained and untrained listeners using visual analog scaling. It is particularly noteworthy that raters in this study did not have the option of indicating if they judged both tokens to be equal in clarity; it would be advisable to make this option available in future studies. Similarly, while the original study obtained ratings of isolated vowel sounds, in the current study, entire words were presented in an effort to make the task less metalinguistically demanding. Collecting ratings at the word level tends to limit the strength of the correlation between perceptual and acoustic measures, since the listeners' ratings may reflect changes in consonant production that are not reflected in acoustic measures derived from vowels. Nevertheless, the moderate correlation between perceptual ratings and acoustically measured change in this study (r = .62) was similar in magnitude to that observed for isolated vowels by Sapir et al. (r = .60 when averaging across /i/ and /u/ vowels).

Conclusions

This study used online crowdsourcing to collect 36 naive listeners' ratings of words produced before and after LSVT LOUD treatment for hypokinetic dysarthria (Sapir et al., 2007). Our results replicated one component of the original findings by demonstrating that participants made significant gains in speech clarity from PRE- to POST-treatment phases of the study. Furthermore, we extended the original finding of an effect of LSVT LOUD treatment to a more naturalistic context by demonstrating that these gains are evident not only in acoustic measures and trained listeners' ratings but also in judgments from an unselected sample of everyday listeners who lack clinical or phonetic training. We also found that measures of the magnitude of progress over the course of treatment derived from crowdsourced listeners' ratings were significantly correlated with acoustic measures of change over the same interval. This study was intended only as a proof-of-concept demonstration, and as such, it has various limitations that can be improved on in future research. However, the present findings do provide a preliminary indication that crowdsourcing can be considered a valid means to evaluate speech samples collected from clinical populations.

Supplementary Material

Supplemental Material S1. Participant questionnaire regarding current technology use.
Supplemental Material S2. Verbal instructions provided at time of testing.
Supplemental Material S3. Verbal instructions provided at time of testing.

Acknowledgments

The current study was supported by National Institutes of Health Grant R41-DC016778 (PI: McAllister); data collection for the original study (Sapir et al., 2007) was supported by National Institutes of Health Grant R01-DC01150 (PI: Ramig). The authors gratefully acknowledge Shimon Sapir, Jennifer L. Spielman, Brad H. Story, and Cynthia Fox for generously agreeing to share the materials from their original study. They would also like to thank Daphna Harel for assistance with statistical analysis, Daniel Szeredi for coding support, and all participants, both in Sapir et al. (2007) and on Amazon Mechanical Turk, for their time and effort.

Funding Statement

The current study was supported by National Institutes of Health Grant R41-DC016778 (PI: McAllister); data collection for the original study (Sapir et al., 2007) was supported by National Institutes of Health Grant R01-DC01150 (PI: Ramig).

Footnotes

1

In fact, the task that forms the focus of this study could be achieved with relative ease in the laboratory setting. The scope of the project was deliberately kept small for the purpose of this proof-of-concept investigation, with plans to scale up only after the viability of the method is demonstrated.

2

It is worth noting that, while all of the raters in Sapir et al. (2007) had received phonetic and clinical training, the inclusion of graduate student listeners limits the level of expertise that the raters can be claimed to have. A recent study (Smith et al., 2019) suggested that students in early stages of training in speech-language pathology produce speech ratings comparable to those of untrained listeners. It will be of interest for future research to document agreement between naive listeners, like those of this study, and trained listeners with differing degrees of experience.

3

As part of the original study, participants agreed to de-identified sharing of study data, including audio recordings. A letter from the Office of Regulatory Compliance at the original institution was obtained confirming that this was a Health Insurance Portability and Accountability Act–compliant reuse of the data.

4

The word “Bobby” was also included in the perceptual rating task in the original Sapir et al. (2007) study. In this study, we were specifically interested in comparing listener ratings against F2i/F2u ratio, which Sapir et al. (2007) identified as their acoustic measure of primary interest. Because the target vowel in “Bobby” does not contribute to this measure, it was excluded from data collection for this study.

5

A forced-choice task was used instead of the VAS rating from Sapir et al. (2007) in an effort to limit task complexity for untrained listeners. The paradigm from the original study requires the rater to form an impression of the clarity of the first token and then use that as the standard against which the second is rated, which is potentially more demanding than selecting one of two choices as more clear. However, the VAS from Sapir et al. (2007) included an option to rate both tokens as equally clear, which was not an option in the present forced-choice task; we return to this limitation in the Discussion section.

6

Individuals who post tasks to AMT have the right to reject completed work if they deem it unsatisfactory. The rate of acceptance or rejection of tasks completed by a given worker is made available through the AMT interface and represents an important source of information when attempting to identify trustworthy raters.

References

  1. Becker M., & Levine J. (2013). Experigen—An online experiment platform. https://github.com/tlozoot/experigen
  2. Boersma P., & Weenink D. (2010). Praat: Doing phonetics by computer. http://www.praat.org/
  3. Chartier J., Anumanchipalli G., Johnson K., & Chang E. F. (2018). Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron, 98(5), 1042–1054. https://doi.org/10.1016/j.neuron.2018.04.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Crump M. J., McDonnell J. V., & Gureckis T. M. (2013). Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research. PLOS ONE, 8(3). https://doi.org/10.1371/journal.pone.0057410 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Duffy J. (2013). Motor speech disorders, 3rd Edition: Substrates, differential diagnosis, and management. Elsevier Mosby. [Google Scholar]
  6. Fletcher A., McAuliffe M., Lansford K., & Liss J. M. (2017). Assessing vowel centralization in dysarthria: A comparison of methods. Journal of Speech, Language, and Hearing Research, 60(2), 341–354. https://doi.org/10.1044/2016_JSLHR-S-15-0355 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Fox C. M., Morrison C. E., Ramig L. O., & Sapir S. (2002). Current perspectives on the Lee Silverman Voice Treatment (LSVT) for individuals with idiopathic Parkinson disease. American Journal of Speech-Language Pathology, 11(2), 111–123. https://doi.org/10.1044/1058-0360(2002/012) [Google Scholar]
  8. Gibson E., Piantadosi S., & Fedorenko K. (2011). Using Mechanical Turk to obtain and analyze English acceptability judgments. Language and Linguistics Compass, 5(8), 509–524. https://doi.org/10.1111/j.1749-818X.2011.00295.x [Google Scholar]
  9. Goodman J. K., Cryder C. E., & Cheema A. (2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26(3), 213–224. https://doi.org/10.1002/bdm.1753 [Google Scholar]
  10. Harel D., Hitchcock E. R., Szeredi D., Ortiz J., & McAllister Byun T. (2017). Finding the experts in the crowd: Validity and reliability of crowdsourced measures of children's gradient speech contrasts. Clinical Linguistics & Phonetics, 31(1), 104–117. https://doi.org/10.3109/02699206.2016.1174306 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hoehn M. M., & Yahr M. D. (1967). Parkinsonism: Onset, progression and mortality. Neurology, 17(5), 427–442. https://doi.org/10.1212/wnl.17.5.427 [DOI] [PubMed] [Google Scholar]
  12. Horton J. J., Rand D. G., & Zeckhauser R. J. (2011). The online laboratory: Conducting experiments in a real labor market. Experimental Economics, 14(3), 399–425. https://doi.org/10.1007/s10683-011-9273-9 [Google Scholar]
  13. Ipeirotis P. G., Provost F., Sheng V. S., & Wang J. (2014). Repeated labeling using multiple noisy labelers. Data Mining and Knowledge Discovery, 28(2), 402–441. https://doi.org/10.1007/s10618-013-0306-1 [Google Scholar]
  14. Kreiman J., Gerratt B. R., Kempster G. B., Erman A., & Berke G. S. (1993). Perceptual evaluation of voice quality: Review, tutorial, and a framework for future research. Journal of Speech and Hearing Research, 36(1), 21–40. https://doi.org/10.1044/jshr.3601.21 [DOI] [PubMed] [Google Scholar]
  15. Lansford K. L., Borrie S. A., & Bystricky L. (2016). Use of crowdsourcing to assess the ecological validity of perceptual-training paradigms in dysarthria. American Journal of Speech-Language Pathology, 25(2), 233–239. https://doi.org/10.1044/2015_AJSLP-15-0059 [DOI] [PubMed] [Google Scholar]
  16. Long M. A., Katlowitz K. A., Svirsky M. A., Clary R. C., McAllister Byun T., Majaj N., Oya H., Howard M. A., & Greenlee J. D. W. (2016). Functional segregation of cortical regions underlying speech timing and articulation. Neuron, 89(6), 1187–1193. https://doi.org/10.1016/j.neuron.2016.01.032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. McAllister Byun T., Halpin P. F., & Szeredi D. (2015). Online crowdsourcing for efficient rating of speech: A validation study. Journal of Communication Disorders, 53, 70–83. https://doi.org/10.1016/j.jcomdis.2014.11.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. McAllister Byun T., Harel D., Halpin P. F., & Szeredi D. (2016). Deriving gradient measures of child speech from crowdsourced ratings. Journal of Communication Disorders, 64, 91–102. https://doi.org/10.1016/j.jcomdis.2016.07.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. McNaney R., Othman M., Richardson D., Dunphy P., Amaral T., Miller N., Stringer H., Olivier P., & Vines J. (2016). Speeching: Mobile crowdsourced speech assessment to support self-monitoring and management for people with Parkinson's. In Kaye J., Druin A., Lampe C., Morris D., & Hourcase J. P. (Eds.), Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (pp. 4464–4476). Association for Computing Machinery; https://doi.org/10.1145/2858036.2858321 [Google Scholar]
  20. Moya-Galé G., Goudarzi A., Bayés À., McAuliffe M., Bulté B., & Levy E. S. (2018). The Effects of intensive speech treatment on conversational intelligibility in Spanish speakers with Parkinson's disease. American Journal of Speech-Language Pathology, 27(1), 154–165. https://doi.org/10.1044/2017_AJSLP-17-0032 [DOI] [PubMed] [Google Scholar]
  21. Palan S., & Schitter C. (2018). Prolific.ac—A subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 22–27. https://doi.org/10.1016/j.jbef.2017.12.004 [Google Scholar]
  22. Paolacci G., Chandler J., & Ipeirotis P. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419. [Google Scholar]
  23. Preston J. L., McAllister T., Phillips E., Boyce S., Tiede M., Kin J. S., & Whalen D. H. (2019). Remediating residual rhotic errors with traditional and ultrasound-enhanced treatment: A single-case experimental study. American Journal of Speech-Language Pathology, 28(3), 1167–1183. https://doi.org/10.1044/2019_AJSLP-18-0261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. R Core Team. (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing; https://www.R-project.org/ [Google Scholar]
  25. Ramig L. O., Countryman S., Thompson L. L., & Horii Y. (1995). Comparison of two forms of intensive speech treatment for Parkinson disease. Journal of Speech and Hearing Research, 38(6), 1232–1251. https://doi.org/10.1044/jshr.3806.1232 [DOI] [PubMed] [Google Scholar]
  26. Ramig L. O., Halpern A., Spielman J., Fox C., & Freeman K. (2018). Speech treatment in Parkinson's disease: Randomized controlled trial (RCT). Movement Disorders, 33(11), 1777–1791. https://doi.org/10.1002/mds.27460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Ramig L. O., Mead C., Scherer R., Horii Y., Larson K., & Kohler D. (1988). Voice therapy and Parkinson's disease: A longitudinal study of efficacy. Paper presented at the Clinical Dysarthria Conference, San Diego, CA, United States. [Google Scholar]
  28. Ramig L. O., Sapir S., Countryman S., Pawlas A. A., O'brien C., Hoehn M., & Thompson L. L. (2001). Intensive voice treatment (LSVT®) for patients with Parkinson's disease: A 2 year follow up. Journal of Neurology, Neurosurgery & Psychiatry, 71(4), 493–498. https://doi.org/10.1136/jnnp.71.4.493 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Ramig L. O., Sapir S., Fox C., & Countryman S. (2001). Changes in vocal loudness following intensive voice treatment (LSVT®) in individuals with Parkinson's disease: A comparison with untreated patients and normal age-matched controls. Movement Disorders, 16(1), 79–83. https://doi.org/10.1002/1531-8257(200101)16:1<79::AID-MDS1013>3.0.CO;2-H [DOI] [PubMed] [Google Scholar]
  30. Rosen K. M., Goozée J. V., & Murdoch B. E. (2008). Examining the effects of multiple sclerosis on speech production: Does phonetic structure matter? Journal of Communication Disorders, 41(1), 49–69. https://doi.org/10.1016/j.jcomdis.2007.03.009 [DOI] [PubMed] [Google Scholar]
  31. Sapir S., Ramig L. O., Hoyt P., Countryman S., O'Brien C., & Hoehn M. (2002). Speech loudness and quality 12 months after intensive voice treatment (LSVT) for Parkinson's disease: A comparison with an alternative speech treatment. Folia Phoniatrica et Logopaedica, 54(6), 296–303. https://doi.org/10.1159/000066148 [DOI] [PubMed] [Google Scholar]
  32. Sapir S., Spielman J. L., Ramig L. O., Story B. H., & Fox C. (2007). Effects of intensive voice treatment (the Lee Silverman Voice Treatment [LSVT]) on vowel articulation in dysarthric individuals with idiopathic Parkinson disease: Acoustic and perceptual findings. Journal of Speech, Language, and Hearing Research, 50(4), 899–912. https://doi.org/10.1044/1092-4388(2007/064) [DOI] [PubMed] [Google Scholar]
  33. Sescleifer A., Francoisse C., & Lin A. (2018). Systematic review: Online crowdsourcing to assess perceptual speech outcomes. Journal of Surgical Research, 232, 351–364. https://doi.org/10.1016/j.jss.2018.06.032 [DOI] [PubMed] [Google Scholar]
  34. Smith C. H., Patel S., Woolley R. L., Brady M. C., Rick C. E., Halfpenny R., Rontiris A., Knox-Smith L., Dowling F., Clarke C. E., Au P., Ives N., Wheatley K., & Sackley C. M. (2019). Rating the intelligibility of dysarthic speech amongst people with Parkinson's Disease: A comparison of trained and untrained listeners. Clinical Linguistics & Phonetics, 33(10–11), 1063–1070. https://doi.org/10.1080/02699206.2019.1604806 [DOI] [PubMed] [Google Scholar]
  35. Sprouse J. (2011). A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behavior Research Methods, 43(1), 155–167. https://doi.org/10.3758/s13428-010-0039-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Sussman J. E., & Tjaden K. (2012). Perceptual measures of speech from individuals with Parkinson's disease and multiple sclerosis: Intelligibility and beyond. Journal of Speech, Language, and Hearing Research, 55(4), 1208–1219. https://doi.org/10.1044/1092-4388(2011/11-0048) [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Tjaden K., Sussman J. E., & Wilding G. E. (2014). Impact of clear, loud, and slow speech on scaled intelligibility and speech severity in Parkinson's disease and multiple sclerosis. Journal of Speech, Language, and Hearing Research, 57(3), 779–792. https://doi.org/10.1044/2014_JSLHR-S-12-0372 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Weismer G., Jeng J. Y., Laures J. S., Kent R. D., & Kent J. F. (2001). Acoustic and intelligibility characteristics of sentence production in neurogenic speech disorders. Folia Phoniatrica et Logopaedica, 53(1), 1–18. https://doi.org/10.1159/000052649 [DOI] [PubMed] [Google Scholar]
  39. Whitehill T. L., Ciocca V., & Yiu E. M.-L. (2004). Perceptual and acoustic predictors of intelligibility and acceptability in Cantonese speakers with dysarthria. Journal of Medical Speech-Language Pathology, 12(4), 229–234. [Google Scholar]
  40. Yunusova Y., Weismer G., Kent R. D., & Rusche N. M. (2005). Breath-group intelligibility in dysarthria: Characteristics and underlying correlates. Journal of Speech, Language, and Hearing Research, 48(6), 1294–1310. https://doi.org/10.1044/1092-4388(2005/090) [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material S1. Participant questionnaire regarding current technology use.
Supplemental Material S2. Verbal instructions provided at time of testing.
Supplemental Material S3. Verbal instructions provided at time of testing.

Articles from American Journal of Speech-Language Pathology are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES