Skip to main content
Journal of Speech, Language, and Hearing Research : JSLHR logoLink to Journal of Speech, Language, and Hearing Research : JSLHR
. 2025 Dec 5;69(1):108–122. doi: 10.1044/2025_JSLHR-25-00109

Listener Ratings of Stuttering: Evaluating Two Auditory–Perceptual Scales

Allison Johnson a, Anja Bullen a, Sima Sokolov a, Tanya Eadie a, Melissa Kokaly a, Gabriel J Cler a,
PMCID: PMC12806039  PMID: 41348924

Abstract

Purpose:

Stuttering is a motor speech difference characterized by a disruption in the fluency, timing, and rhythm of speech. There is a lack of agreement on how to reliably and efficiently assess stuttering in research and the clinic. This study aims to compare the validity and reliability of two types of auditory–perceptual scales, direct magnitude estimation and equal-appearing interval scales, on assessing stuttering in adult speakers.

Method:

Two experiments compared unfamiliar listener ratings of speech samples from adults who stutter. Raters used one of two different rating scales to determine the construct validity and reliability of scaling procedures for capturing stuttering. The two experiments varied by the number and duration of samples (set number of syllables vs. set duration) and by the training given to participants (defining stuttering severity vs. allowing participants to define severity themselves).

Results:

Both experiments demonstrated the appropriateness of both scales for rating stuttering.

Conclusions:

Contrary to earlier studies, our findings indicated that a 7-point equal-appearing interval scale validly captured unfamiliar listeners' perception of stuttering severity. Future study is needed to determine the number of raters needed to provide stable ratings as well as the utility of average or single ratings to capture clinically relevant change.


Stuttering is a motor speech difference characterized by a disruption in the fluency, timing, and rhythm of speech. Stuttering presents as repetitions of sounds, syllables, or words; prolongations of sounds; or prolonged pauses often accompanied by tension (Wingate, 1964). Persistent developmental stuttering impacts around 1% of the general adult population (Craig et al., 2002; Yairi & Ambrose, 2013) and has the potential to have effects on quality of life in a variety of domains (Blood & Blood, 2016; Carter et al., 2017; Croft & Byrd, 2020).

One barrier in stuttering research is the lack of reliable and efficient measures of stuttering severity that translate to clinical settings. Auditory–perceptual judgments of speech are one type of measure commonly used to describe and classify communication disorders (Kent, 1996) and are often used as part of a multidimensional protocol to assess stuttering severity. Auditory–perceptual measures may be used by clinicians, people who stutter and their families, as well as unfamiliar listeners. Unfamiliar listeners (also called naive or inexperienced listeners) may be solicited to act as a proxy for the general public who interact with people who stutter in daily life (Tuthill, 1940; Van Borsel & Eeckhout, 2008).

The purpose of this study is to compare the validity and reliability of two auditory–perceptual scales: a simple equal-appearing interval (EAI) scale that is already used in stuttering (O'Brian et al., 2020) compared to another scale that is less commonly used in stuttering but is necessary to accurately capture other speech and voice domains (direct magnitude estimation [DME]). The findings of this study will have implications for methodologies employed for evaluating stuttering severity in the speech of adults who stutter in both research and clinical contexts.

Measurement of Stuttering Severity

Measurement of the physical characteristics of stuttering may be one component of clinical practice, particularly when a client and a clinician have selected a clinical target surrounding increasing fluency, as well as having utility in research. However, there has been considerable debate among researchers and clinicians regarding appropriate techniques for measuring stuttering severity (Brundage et al., 2021; Yaruss, 1997). It is also important to note that measuring stuttering severity does not equate to measuring the quality of life or personal perceptions of people who stutter; it is critical that separate evidence-based measures be used to address those aspects effectively in assessment and therapy (Gerlach et al., 2021; Manning & Gayle Beck, 2013).

Difficulties with accuracy and reliability (not to mention ease) of measuring stuttering severity have been long noted in the field (e.g., D. Lewis & Sherman, 1951; Naylor, 1953). Some studies use stuttering frequency as a proxy for severity, in which they report the number of syllables that contain stuttering per 100 syllables (i.e., percent stuttered syllables [%SS]). The reliability of such counts is low, even within/between experts (see Cordes & Ingham, 1994, for a review). Even when experts were asked to just denote whether a stuttering moment occurred in a given short sample, agreement (seven of eight expert judgments stuttered or fluent) was achieved only on 77% of the samples (Ingham et al., 1993).

Even if such counts were reliable and accurate, however, the perception of stuttering severity is not equivalent to stuttering frequency, but rather multidimensional (O. Bloodstein et al., 2021), such as the perception of voice quality (Barsties & De Bodt, 2015) or dysarthria (Stipancic et al., 2024). For example, one study indicated discrepancies between percent stuttering severity and (clinician-judged) global ratings of stuttering severity, particularly in samples with many stuttering moments (increasing %SS but may not increase ratings if they are very short) or fewer stuttering moments but with fixed postures (lower %SS but higher severity; O'Brian, Packman, Onslow, & O'Brian, 2004). Beyond stuttering frequency, listener perceptions of severity could be impacted by a variety of factors, including the type of stuttering event (e.g., prolongation vs. repetition), the duration or severity of each stuttering event, overall speech rate, naturalness, verbal concomitants (such as restarting a phrase, pullouts), perceived speaker effort, and the secondary behaviors or physical concomitants that can accompany stuttering. The reliability of measuring each of those separately is likely low. For example, although considered a necessity in the clinic, the reliability of noting the type of disfluency is also very low, around 50% (Cordes, 2000). The relationships between these factors are complex (e.g., naturalness may be reduced following speech therapy as fluency-shaping techniques are employed; naturalness in posttreatment speech is related to speech rate; Kalinowski et al., 1994; Metz et al., 1990).

The most common stuttering severity measure, the Stuttering Severity Instrument–Fourth Edition (SSI-4; Riley, 2009), combines several of these factors (frequency, duration, and concomitant or associated behaviors). Much like measurements of these factors individually, the SSI-4 has low reliability (Davidow, 2021) and is time-consuming. Across rating scales and factors, a significant amount of research has investigated the impact of experience or training on listener ratings. An early study noted that both experts and people who stutter counted more “borderline” disfluencies (e.g., “slight hesitations, minor breaks in rhythm, pauses in extemporaneous speech”) than unfamiliar (“disinterested”) listeners (Tuthill, 1940). Tuthill then discusses whether those borderline disfluencies should be counted, if they are not notably disrupting the speaker's message. Other research has indicated that for simple ratings, unfamiliar listeners and experts are equivalently reliable (e.g., Amir et al., 2018). A number of training systems have been developed and evaluated (e.g., Bainbridge et al., 2015) for speech pathology trainees or other unfamiliar listeners in order to increase both accuracy and reliability in other severity rating systems (e.g., SSI-4 or counting disfluencies).

Movement Toward Global Perceptual Rating Scales

Given the clinical utility of real-time stuttering severity measurement for decision making (Yaruss, 1998), combined with the time-consuming nature and low reliability of other measures, there has been a move toward using simple 5-, 7-, or 9-point EAI scales of stuttering severity as an outcome measure, rated by a clinician, the parent of a child who stutters, or self-reported ratings from the person who stutters (O'Brian et al., 2020; Onslow et al., 2018). These global rating scales have been shown to correlate with other measures of stuttering, such as the SSI (K. E. Lewis, 1995), %SS (O'Brian, Packman, Onslow, & O'Brian, 2004), and speech rate (Young, 1961). Importantly, though, the scale type used to capture perceptual dimensions of stuttering must also be considered when assessing the validity of the rating tool. In particular, early studies suggest that these EAI scales are inappropriate for rating stuttering severity (Berry & Silverman, 1972; Schiavetti et al., 1983). Since there is no current consensus on accurate and reliable measurement of stuttering severity, this study is focused on testing the reliability and validity of two different types of auditory–perceptual methodologies for measuring (global) stuttering severity.

EAI Scaling

EAI scales are familiar to raters and popular in the clinic and research. An EAI scale has discrete numbers (e.g., 1, 2, 3, 4, and 5 for a 5-point scale, where the endpoints are defined as least severe and most severe). Some researchers have noted that EAI ratings correlate to disfluency counts (e.g., percentage of syllables stuttered) and are quick and easy to perform (O'Brian et al., 2020; Onslow et al., 2018). The advantages of EAI scales are that they are simple to use, require no equipment, can be used in isolation, and appear to need little or no training (O'Brian, Packman, & Onslow, 2004). Additionally, clients can use them for self-report of stuttering severity outside the clinic, which recognizes the situational and temporal variability that can be very common with stuttering (Tichenor & Yaruss, 2021). This can allow clients and clinicians to communicate more easily and effectively about stuttering severity (O'Brian, Packman, & Onslow, 2004).

However, it has been known for 50 years that EAI scaling is inappropriate for measuring a variety of percepts in the communication fields (Stevens, 1975). These percepts include loudness (Stevens, 1975), nasality (Zraick & Liss, 2000), roughness (Toner & Emanuel, 1989), and overall severity in both Parkinson's disease (Stipancic et al., 2024) and dysphonia (Eadie & Doyle, 2002). EAI scales are inappropriate when perception is not equal across the whole scale. This kind of percept, called a “prothetic” construct, cannot be scaled appropriately by EAI because listeners divide some intervals into smaller subunits than others (Stevens, 1975). In other words, the (perceptual) distances between points on the scale are not, in fact, equal. EAI scales may also not capture the full range of perception (i.e., ceiling and floor effects).

Early work demonstrated that interval widths presented in EAI scales of stuttering severity are not equal, because the interval points at the lower end are perceived at about half the width of the interval points at the upper end (Berry & Silverman, 1972). Thus, findings from prior research suggest that EAI scaling might not be appropriate for rating a construct such as stuttering severity. This is not to say that EAI scales cannot be “reliable” for such percepts; for example, D. Lewis and Sherman (1951) showed high reliability of an EAI scale. Berry and Silverman (1972) took the D. Lewis and Sherman scale and showed that while it was reliable, it was not accurately capturing the percept of stuttering severity. When percepts are not fully captured by EAI, another type of scaling should be used.

DME Scaling

DME is a scaling procedure that has been suggested to be appropriate for all percepts (e.g., Martin, 1965). In a common DME procedure, participants are presented with a speech sample (here: reference sample; sometimes referred to as the modulus; Stevens, 1975) and told that that reference should be assigned a value of, for example, 100. Listeners then rate all subsequent stimuli relative to the magnitude of the reference: If the sample to be rated is twice as severe as the reference, it should be assigned 200. If it is half as severe as the reference, it should be assigned 50.

One advantage of DME scales is that they do not have the systematic bias that EAI scales do as to whether the intervals are truly equal (Berry & Silverman, 1972; Engen, 1971; Stevens, 1975). Another advantage of DME scales is that the resultant data are continuous rather than categorical, which means parametric statistical methods are appropriate. On the other hand, DME is unlikely to be implemented in a clinical setting as it requires more explanation, the ratings themselves take more time, and most importantly, the measure is always proportional to something else (i.e., either compared to the first speaker in a series or a reference sample). As such, DME is primarily used as a research tool, particularly in other areas of communication disorders (Eadie & Doyle, 2002; Zraick & Liss, 2000).

Determining Scale Appropriateness for Measuring a Construct

Stevens (1975) investigated how different percepts are perceived and concluded that there are two kinds of perceptual continua: those that can be divided into equal intervals (metathetic) and those that cannot (prothetic). If percepts can be divided into equal intervals, then ratings using equal interval scales and DME will have a direct, linear relationship. If percepts are not divided into equal intervals, then the relationship between EAI and DME ratings will be curvilinear. Importantly, all percepts can be validly rated with DME; only those with equal perceptual intervals can be rated validly with EAI scales (Stevens, 1975).

Several studies have used DME to measure stuttering severity, including an early paper by Martin (1965). To determine the appropriateness of EAI (an easier scaling method), though, one must compare ratings of the same samples collected with both EAI and DME scales (Stevens, 1975). Three previous studies have used both DME and EAI scales for perceived stuttering severity. McColl and Fucci (2006) gathered DME and EAI ratings of stuttering severity from samples from one typically fluent speaker who injected linguistically unnecessary pauses (suggested to be a proxy for stuttering). They found that both DME and EAI ratings correlated highly with the rate of (experimentally generated) pauses. Their results might suggest that given the similarly strong relationship to an objective correlate of speech disfluency (pauses), either type of scale might be used for capturing stuttering severity. However, they did not explicitly study the relationship between DME and EAI scales, so the results do not directly answer the question of the nature of stuttering severity as a percept.

Berry and Silverman (1972) compared DME ratings provided by naive listeners to samples chosen to represent each point on a 1–9 EAI scale in another study (D. Lewis & Sherman, 1951). They found that there was a nonlinear relationship between the EAI and DME ratings of the same samples, with the lower three intervals perceived about half as widely as the upper six intervals. These results are consistent with a prothetic continuum, indicating that stuttering severity would only be accurately captured by DME scales.

Finally, and most relevant to the current study, Schiavetti et al. (1983) investigated the relationship between DME and EAI ratings of stuttering severity. In that study, samples from 20 adult speakers who stutter were rated by three groups of 15 listeners using a 7-point EAI scale or two different types of DME. The results showed that DME values were related to the interval scale values in a curvilinear fashion, consistent with a prothetic continuum. In other words, their results indicated that stuttering severity should only be measured using DME, and not EAI. One potential difficulty with this study was the reliance on one data point, which was the most severe speaker in their data set (see Schiavetti et al., 1983, Figure 1). This one speaker could have influenced the relationship between ratings on the different types of scales, as the only speaker within a given range. As a result, we sought to replicate that study using a new set of speakers and listeners, while controlling for other variables, such as the provided definition of stuttering severity.

Figure 1.

Four screenshots of user interfaces. The first screenshot is for experiment 1 \u2013 EAI. A 7 point scale is provided for rating the speech sample in terms of the stuttering severity. The second screenshot is for experiment 1 \u2013 DME. The user listens to audio clips and rates the severity of a speech sample. A text box is provided for this purpose. The third screenshot is for experiment 2 \u2013 EAI. The user plays a sample and rates the severity of the sample on a 7 point scale. The fourth screenshot is for experiment 2 \u2013 DME. The user plays a sample and a reference and rates the sample relative to the reference. A text box is provided for this purpose.

User interfaces for providing ratings. Top row shows Experiment 1 (in person), whereas bottom row shows Experiment 2 (online). Left column shows EAI scales, whereas right column shows DME scaling procedures. EAI = equal-appearing interval; DME = direct magnitude estimation.

Research Aims

This study aimed to determine validity and reliability of EAI and DME for perceptual rating of stuttering severity. We conducted two experiments with differing methods to support the external validity of our results. The two experiments varied by the number and duration of samples (set number of syllables vs. set duration) and by the training given to participants (defining stuttering severity vs. allowing participants to define severity themselves). Based on previous research, we hypothesized that there would be a nonlinear relationship between EAI and DME ratings, suggesting that DME methods are required to rate stuttering severity appropriately. Finally, to demonstrate construct validity of listener ratings, we evaluated the relationships between listener ratings and other measures, including the SSI-4 (modified for audio-only samples), percent syllables stuttered, and speech rate.

Method

Two studies were completed in which listeners rated stuttering severity of audio samples from adults who stutter. In each study, raters were assigned to use EAI or DME scales. The sample length, number, and selection varied between the studies (see below).

Participants

Sixty individuals who were inexperienced with stuttering served as listeners in this study; 30 listeners participated in Experiment 1 (Exp1) and an additional 30 listeners participated in Experiment 2 (Exp2). Inexperienced listeners were selected because they are representative of unfamiliar communication partners.

All listeners were native English speakers (reported acquiring English before the age of 2 years) between the ages of 18 and 45 years. Listeners reported no history of childhood and/or current speech/language/reading/hearing concerns or the presence of any neurological diagnoses. To be included, listeners reported no formal coursework or familiarity with stuttering. Within each experiment, listeners were assigned to rating conditions so that each of the groups had a similar mean age and number of listeners who identified as men, women, and nonbinary/genderqueer people. Listeners were recruited via flyers in the community and on job boards. Participants provided written informed consent according to the protocol approved by the University of Washington Institutional Review Board (STUDY00013472) and were compensated for their time.

Exp1

In Exp1, 15 participants completed the DME protocol; they were aged 18–26 years (M = 20.9 years) and consisted of nine women, four men, and two nonbinary people. Fifteen participants completed the EAI protocol, and they were aged 18–27 years (M = 20.4 years) and consisted of 11 women and four men.

Exp2

In Exp2, 15 participants completed the DME protocol and were aged 18–35 years (M = 22.5 years) and consisted of 11 women, three men, and one genderqueer person. Fifteen participants completed the EAI protocol and were aged 18–36 years (M = 21.5 years) and consisted of six women, five men, three nonbinary people, and one genderqueer person.

Samples

Speech samples from teenage and adult speakers who stutter (50 speakers aged 15–62 years) were acquired from corpora of recordings from FluencyBank (Bernstein Ratner & MacWhinney, 2018) and University College London Archive of Stuttered Speech (UCLASS; Howell et al., 2009). The samples were collected on typical clinical equipment (video cameras). Samples were downloaded as videos from FluencyBank in .mp4 format and as audio from UCLASS in .wav format. Audio was extracted from videos with custom MATLAB software using the audioread function. In the original files, audio was compressed in .mp4 format with the Advanced Audio Coding codec and recorded at a high-quality sampling rate (44.1k or 48k). Stimuli were extracted and cropped with 0.25-s Hann windows on each end to avoid abrupt onset and offset. Cropped stimuli were peak amplitude normalized and saved as .wav files (Exp1) or .mp3 files (Exp2). Exported stimuli had 16 bits per sample, with a 44.1k or 48k sampling rate as determined by the original quality. Samples were audio-only to simplify ratings; previous studies show that severity ratings are similar with and without video (Martin & Haroldson, 1992; Vogel et al., 2015; Williams et al., 1963). Samples were of speakers who stutter reading a passage. Reading (vs. conversational) samples were chosen to limit the effect of circumlocution on the amount and type of stuttering and the possible effects of topic or language on ratings.

Although it was not our aim to directly compare the EAI/DME scales to other measures (percent syllables stuttered; SSI), it was necessary to have a broad range of samples for listeners to rate, and therefore, samples were analyzed by two trained graduate student clinicians experienced in clinical assessment of stuttering using the SSI-4. For each sample, %SS and average length of three longest stutters were calculated, as were the standard scores for the SSI-4 that those measures are converted to (task score, duration score). To demonstrate the range of samples, both percent syllables stuttered and audio-only SSI scores (sum of the task score plus stuttering duration score; no physical concomitant score) are shown in Table 1.

Table 1.

Stimulus characteristics of reading samples.

Speaker Which exp Sex Age Accent % SS
Audio-only SSI
Exp1 Exp2 Exp1 Exp2
16fa 1, 2 F 16 U.S. English 8.3 6.1 16 13.5
16fb 1, 2 F 16 U.S. English 10.1 8.3 16 14.5
16fc 2 F 16 U.S. English 7.5 15.5
16m 1, 2 M 16 U.S. English 6.0 3.0 13 12.5
17f 1, 2 F 17 U.S. English 1.8 0.5 7.5 8.5
24fa 1, 2 F 24 U.S. English 3.7 0.8 13 8
24fc 1, 2 F 24 Nonnative English 11.9 10.5 18.5 16
24ma 1, 2 M 24 U.S. English 2.7 0.5 8 5
24mb 1, 2 M 24 Eastern U.S. 0.9 0.4 8 6
25m 1, 2 M 25 U.S. English 15.6 20.6 16.5 18
26f 1, 2 F 26 U.S. English 0.9 0.0 7 4
27f 1, 2 F 27 U.S. English 1.8 0.0 9.5 4
27mb 2 M 27 U.S. English 2.6 12
29ma 1, 2 M 29 U.S. English 0.0 0.0 0 0
29mb 1, 2 M 29 U.S. English 1.8 0.0 6.5 5.5
29mc 2 M 29 U.S. English 5.3 15
32m 2 M 32 U.S. English 2.0 12
33m 1, 2 M 33 U.S. English 6.9 5.1 20 14.5
34m 2 M 34 U.S. English 10.8 17
35mb 1, 2 M 35 U.S. English 11.0 6.4 16.5 16
39f 2 F 39 U.S. English 10.7 15
41f 1, 2 F 41 U.S. English 7.3 2.9 19 18.5
42m 1, 2 M 42 U.S. English 11.4 4.5 15 13.5
46ma 1, 2 M 46 U.S. English 0.9 0.0 2 0
50fa 2 F 50 U.S. English 0.8 9
50fb 1, 2 F 50 U.S. English 6.0 2.1 19 16.5
54f 2 F 54 U.S. English 5.5 11
57m 1, 2 M 57 U.S. English 2.3 0.5 6.5 5
60m 2 M 60 U.S. English 0.5 9
61m 2 M 61 Eastern U.S. 0.0 0
62f 2 F 62 U.S. English 0.0 4
62m 2 M 62 U.S. English 0.0 5
503 2 M adult U.S. English 0.0 5.5
507 2 M adult U.S. English 1.0 7.5
508 2 M adult U.S. English 0.0 0
509 2 F adult U.S. English 0.8 7.5
F_0101 2 F 17 British English 13.7 14
F_0818 2 F 16 British English 5.6 14.5
M_0061 2 M 18 British English 2.1 16
M_0065 2 M 20 British English 2.1 12
M_0078 2 M 17 British English 6.2 17.5
M_0104 2 M 15 British English 6.2 19
M_0253 2 M 15 British English 1.1 10
M_0545 2 M 15 British English 0.6 10
M_0760 2 M 15 Eastern U.S. 1.6 11
M_0874 2 M 15 British English 1.8 9.5
M_0876 2 M 15 British English 2.0 11
M_0880 2 M 15 British English 1.3 13
M_0999 2 M 17 British English 3.0 11
M_1011 2 M 15 British English 0.0 0

Note. English language experience was not available from FluencyBank or UCLASS; accentedness ratings are from one native U.S. English speaker (Midwestern) and are for reference. Ages were not available for three FluencyBank speakers. Note that when participants were in both studies, samples were of different portions of the reading. Audio-only SSI scores are the sum of the reading task score plus stuttering duration score; no physical concomitant score. Scores (%SS and audio-only SSI) are averaged across two trained raters. SSI = Stuttering Severity Instrument; F = female; M = male; UCLASS = University College London Archive of Stuttered Speech; %SS = percent stuttered syllables.

Trained graduate students were chosen to analyze the samples as this is common in research and clinic, as part of their training and to reduce burden on expert clinicians. There is evidence that students regularly underidentify stuttering moments (e.g., Brundage et al., 2006). Therefore 20% of the samples in Exp2 were analyzed by an expert clinician specializing in fluency with 25 years of experience. The correlation with the averaged student rating was high (Spearman's rho on audio-only SSI: .927). The expert clinician spent 6 hr carefully analyzing 12 samples, for an average of 30 min per 30 s sample. Due to the high correlation and the large burden on the expert clinician, the student ratings are used in the remainder of this study; however, the main results of the study do not rely on these ratings.

Exp1

Twenty samples were selected from the FluencyBank samples to reflect a range of stuttering severity, ranging from very mild to severe. The %SS ranged from 0% to 15.6% (M = 5.6%). The range of the audio-only SSI score (no physical concomitant rating) ranged from 0 to 20 (M = 11.9). These ranges indicate very mild to severe stuttering in the speech samples.

Samples for Exp1 were the same center portion of the Friuli reading passage from the SSI-4 of around 110 syllables (“Occupying the extreme northeast corner of Italy, Friuli's scenery ranges from rugged coastline along the eastern border to placid plains in the west and the majestic Alps in the north, where Italy butts up against Austria. Directly to the south is Venice, just a little more than an hour and a half away. Though off the beaten tourist track, Friuli is hard in the path of history.”). Though it is recommended to use a sample of 150 syllables for the SSI (Riley, 2009), a portion of 110 syllables (67 words) was chosen to optimize the total listening time and reduce likelihood of poor reliability due to listener fatigue. The start and end of the portion of the passage were manually selected. Readings varied from 24.5 to 69.5 s (M = 43.5) in length, with a range of 58–164 words per minute (M = 105.9).

Exp2

All 50 available samples from UCLASS and FluencyBank of speakers aged 15–62 years were used as stimuli in the second experiment. The first experiment used the same subsample of reading, such that the samples were all different durations: Those with more stuttering, slower rate, and/or longer duration of stuttering moments were longer. In Exp2, we wished to remove that particular confound and instead chose 30-s samples randomly. Readings varied from 25 to 165 syllables (M = 89.5) but were all 30 s long. Rate of speech ranged from 42 to 182 words per minute (M = 114).

Samples were rated by two graduate student clinicians experienced in clinical assessment of stuttering using the SSI-4. The averaged percent syllables stuttered along with an audio-only SSI (reading task score plus stuttering duration score; no physical concomitant score) are reported in Table 1. The reading %SS ranged from 0% to 20.6% (M = 3.3%). The audio-only SSI ranged from 0 to 19. These scores indicate a range of very mild to severe stuttering across the samples.

Experimental Protocol

Samples were pseudorandomized in order. Listeners were allowed to play each speech sample as many times as they wished at a comfortable volume. Each listener was assigned to one of the two rating conditions such that demographics were roughly equivalent between conditions.

Exp1

The 30 listeners in Exp1 participated in person and passed a pure-tone hearing screening at 25 dB hearing level for pure tones at 1000, 2000, 4000, and 8000 Hz unilaterally. They attended a 30- to 60-min session at the Quantitative Imaging for Learning, Language, & Speech Lab at the University of Washington. Every listener completed a brief training at the start of the session. They read definitions of stuttering and listened to examples of varying levels of severity. The definition was as follows: “Stuttering is a speech disorder that involves disruptions in the flow of speech. People who stutter know what they want to say but have difficulty saying it. For example, you may hear a speaker: have trouble starting a sound (e.g., ‘I want *cake’ * = a pause with tension), extend a sound (e.g., ‘I wwwwant cake’), or repeat sounds or syllables (e.g., ‘d d d dog’ or ‘ba ba ba baby’).” Unused samples from FluencyBank were chosen as training stimuli to represent a range of stuttering severity.

Exp2

The 30 listeners in Exp2 participated online using the platform Zoom. They attended a 60- to 90-min session, monitored by a research assistant who was available to answer questions. In contrast to Exp1, no definitions of stuttering were provided. Listeners instead were asked to use their own judgment about what “severity” meant. They did listen to two samples prior to the rating session as exposure to stuttered speech. If the participants asked more about the definition of “severity,” they were provided with continued reassurance to rate on their own judgment.

Rating Scales

Listeners were then provided with task-specific training on use of the given rating scale. All ratings were automatically collected by custom survey software (Exp1: MATLAB, Exp2: PsychoPy hosted on Pavlovia.org).

EAI Scale

A 7-point EAI scale was used (see Figure 1). Listeners listened to each speech sample and were instructed to rate the severity of stuttering on a scale where “1” equals no stuttering and “7” equals most severe stuttering. Listeners could listen to each speech sample as many times as needed.

DME Scale

For the DME condition, listeners were instructed to listen to the reference stimulus (modulus) first and rate each speech sample in relation to a reference, which was assigned a value of 100. The reference was chosen to represent mild–moderate stuttering severity with a 3.4%SS. Listeners could listen to the reference at any time throughout the task but were required to play it at the beginning of the task and then after every three samples to maintain consistency. Participants listened to each speech sample and assigned a rating based on its comparison with the reference. They were instructed that if they perceived a speech sample twice as severe as the reference, they should rate the sample 200. If they perceived it to be half as severe, they should rate it 50. They were told that they could assign any number to the sample. Raters listened to the reference and the sample as many times as they wished.

Summary of Differences From Exp1 to Exp2

As described above, the experimental protocol in Exp2 was altered in the following ways: All the participants completed the study online while being monitored by a research assistant and did not complete a hearing screening, while Exp1 was in person and had a hearing screening. There were 50 samples rather than 20, and in order to calculate intrarater reliability, each listener rated 50 unique samples and 10 repeated samples (20%). Each sample was 30 s in length from various portions of the reading sample, with a variable number of syllables. Samples from Exp1 were from primarily American English, with one speaker who appeared to speak English as an additional language. Exp2 included additional British English speakers. The participants were not provided with the explanation that the samples were of people who stutter, unlike in Exp1. They did not read a definition of stuttering and instead were instructed to rate the samples on their own judgment of “severity.”

Statistical Analyses

Data cleaning and statistical analyses were performed in R (4.3.2). To determine which of the two rating methods resulted in higher reliability, interrater reliability was computed for both studies and intrarater reliability was only calculated for Exp2 (included repeated samples). Pearson's correlations were used for intrarater reliability. For interrater reliability, intraclass correlation coefficients (ICCs) were calculated with two-way mixed average measures consistency ICC(3,k) (Shrout & Fleiss, 1979).

To determine the appropriateness of EAI ratings for stuttering severity, EAI mean ratings were plotted as a function of the DME geometric mean ratings according to the procedures of Stevens (1975) and Schiavetti et al. (1983). Mean ratings were plotted to determine whether the values of each scale are related linearly or curvilinearly, the latter of which would indicate that EAI scales do not capture the ends of the continuum well and should not be used (Schiavetti et al., 1983). Scatter plots were visually inspected to determine whether the relationship between EAI versus DME was linear or curvilinear. A linear regression and second- and third-order polynomial regressions were calculated to determine which model accounted for more variance. Then, an analysis of variance (ANOVA) compared models to determine which resulted in a better fit.

To demonstrate the external validity of both scales, EAI and DME ratings from Exp2 were plotted against other measures of stuttering severity. Samples were rated by two trained graduate student clinicians experienced in clinical assessment of stuttering using the SSI-4. As the listeners only heard the audio, the experienced raters in our study did not rate physical concomitants. Therefore, the three measures plotted here are percent syllables stuttered, speech rate (number of syllables counted by the raters in each 30-s sample), and a modified SSI-4: the %SS and the duration of the three longest disfluencies converted to standard scores and added together. This represents an audio-only SSI, which does not include a physical concomitant rating. The strength of these relationships was tested with nonparametric Spearman's rank coefficients.

Results

Thirty participants completed Exp1, and 30 completed Exp2. Listeners rated samples using one of two scales: EAI or DME. Scales are compared in terms of interrater reliability, intrarater reliability (Exp2 only), and construct validity by examining the relationships between EAI and DME. To provide an empirical comparison of results, we also extracted data from Schiavetti et al. (1983) and evaluated them with identical statistical methods.

Interrater Reliability

Average measures of ICCs were computed to measure interrater reliability to compare to Schiavetti et al. (1983). All reliability measures fell within the excellent level of reliability: Exp1 EAI: .99, DME: .94, and Exp2 EAI: .98, DME: .92.

Intrarater Reliability

Pearson's correlations were computed to measure intrarater reliability on Exp2 (10 samples repeated; 20%). EAI reliability ranged from .84 to .97 (M = 0.92). DME had one outlier with low reliability, for a range of .39–1 (M = 0.88).

Model of Best Fit—EAI Versus DME

Figure 2 shows the EAI data plotted against the DME data for each experiment. Visual inspection of results from both Exp1 and Exp2 reveals a linear relationship between the two rating scales. For Exp1, the linear model of best fit had an R2 = .9702. The curvilinear model accounted for the same amount of variance (R2 = .9686). An ANOVA comparing these two models showed that the curvilinear model did not result in a significantly better fit (F = 0.0955, p = .761).

Figure 2.

Three line graphs display experimental results. The first plot is titled Experiment 1. The y-axis represents the mean EAI rating and the x-axis represents the mean DME rating. A dotted vertical line intersects the x-axis at 100. A solid grey line runs between (0, 1) and (350, 7). A shaded green region surrounds the solid grey line. The solid grey line represents linear: EAI equals 1.07 plus 0.016 times DME. The shaded green region represents curvilinear: EAI equals 1.11 plus 0.015 times DME plus 0.000003 times DME superscript 2. The second plot is titled Experiment 2. The y-axis represents the mean EAI rating and the x-axis represents the mean DME rating. A dotted vertical line intersects the x-axis at 100. A solid grey line runs between (10, 1) and (200, 5). A shaded green region surrounds the solid grey line. The solid grey line represents linear: EAI equals 0.56 plus 0.022 times DME. The shaded green region represents curvilinear: EAI equals 0.73 plus 0.019 times DME plus 0.00001 times DME superscript 2. The third plot is titled Schiavetti et al., 1983. The y-axis represents the mean EAI rating and the x-axis represents the mean DME rating. A dotted vertical line intersects the x-axis at 10. A solid grey line runs between (0, 1.7) and (70, 7.1). A thick green line runs between (35, 5) and (80, 6.7). The solid grey line and the thick green line fits with the outlier. The solid grey line represents linear: EAI equals 1.69 plus 0.078 times DME. The thick green line represents curvilinear: EAI equals 1.14 plus 0.15 times DME plus negative 0.001 times DME superscript 2. A dotted black line runs between (0, 1.4) and (35, 5.5). A thick dotted green line runs between (0, 1) and (35, 5). The dotted black line and the thick dotted green line fit without outlier. The dotted black line represents linear: EAI equals 1.27 plus 0.12 times DME. The thick dotted green line represents curvilinear: EAI equals 1.08 plus 0.17 times DME plus negative 0.002 times DME superscript 2.

Comparison between mean DME ratings (x-axis) and EAI ratings (y-axis). Top left shows results from Experiment 1, and top right shows results from Experiment 2. Bottom panel shows data extracted from Schiavetti et al. (1983). Linear fits are shown in gray, whereas curvilinear (polynomial) fits are shown in green. Bottom panel also includes linear and polynomial fits of data without the most severe sample (black and dark green). All model equations are shown in legends. EAI = equal-appearing interval; DME = direct magnitude estimation.

Similar results were found in Exp2, in which the linear model had an R2 = .9082 and the curvilinear model had an R2 = .9075. An ANOVA comparing these two models showed that the curvilinear model did not result in a significantly better fit (F = 0.6607, p = .42). In other words, for both Exp1 and Exp2, the linear models best fit the data, suggesting that either type scale could be used to validly measure stuttering severity as perceived by listeners in this study.

To verify our methods and compare our results with those previously reported by Schiavetti et al. (1983), we extracted the data from Figure 1 in Schiavetti et al., 1983 with Plotdigitizer Online App (2024). The same methods as above were run with and without the outlier from their study (see bottom of Figure 2 for plot of extracted data). With the outlier (as in the original paper), the linear model resulted in R2 = .8391, while the curvilinear model had an R2 = .9583. In this case, the curvilinear model did account for a significantly better fit (F = 52.5, p < .0001). When removing the most severe speaker (perhaps an outlier), however, the linear model resulted in a similar amount of variance predicted, with the linear model accounting for R2 = .925 and the curvilinear model with an R2 = .9368. An ANOVA revealed a marginally nonsignificant better fit for the curvilinear model (F = 4.17, p = .06).

Correlations Between Severity Ratings and Sample Characteristics

Relationships between severity ratings and other measures of sample characteristics are shown in Figure 3. Figures 3A and 3B show listener ratings as predicted by SSI-4 (modified) ratings. A Spearman's rank correlation between SSI-4 (modified) measures and listener severity ratings using the EAI and DME scales were both very strong: ρ = .92 and ρ = .89, respectively (both p < .0001). Figures 3C and 3D show listener severity ratings versus the percentage of syllables stuttered. These also have very strong correlations, with ρ = .91 and ρ = .91 (both p < .0001). Finally, Figures 3E and 3F show listener ratings of severity against speech rate (number of syllables produced in 30 s). These have linear relationships, with very strong correlations of ρ = .83 between speech rate and severity ratings using the EAI scale and similarly strong correlations of ρ = .84 between speech rate and perceived severity using the DME scale (both p < .0001).

Figure 3.

Six scatterplots. Scatterplots A, C, and E in the left panel illustrate relationships to EAI ratings. Scatterplots B, D, and F in the right panel illustrate relationships to DME ratings. Plot A: SSI-4 (no concomitants). The y-axis represents the mean EAI rating and it ranges from 1 to 7. The x-axis represents the SSI-4 (no concomitant rating) and it ranges from 0 to 15. The data points indicate a positive correlation between the x and y variables. The correlation factor is rho equals 0.92. Plot B: SSI-4 (no concomitants). The y-axis represents the mean DME rating and it ranges from 0 to 250. The x-axis represents SSI-4 (no concomitant rating) and it ranges from 0 to 15. The data points indicate a positive correlation between the x and y variables. The correlation factor is rho equals 0.89. Plot C: Stuttering rate. The y-axis represents the mean EAI rating and it ranges from 1 to 7. The x-axis represents the percent syllables stuttered and it ranges from 0 to 20. The data points indicate a positive correlation between the x and y variables. The correlation coefficient is rho equals 0.91. Plot D: Stuttering rate. The y-axis represents the mean DME rating and it ranges from 0 to 250. The x-axis represents the percent syllables stuttered and it ranges from 0 to 20. The data points indicate a positive correlation between the x and y variables. The correlation coefficient is rho equals 0.91. Plot E: Speech rate. The y-axis represents the mean EAI rating and it ranges from 1 to 7. The x-axis represents the syllables in 30 seconds and it ranges from 30 to 120. The data points indicate a negative correlation between the x and y variables. The correlation coefficient is rho equals negative 0.83. Plot F: Speech rate. The y-axis represents the mean DME rating and it ranges from 0 to 250. The x-axis represents the syllables in 30 seconds and it ranges from 30 to 120. The data points indicate a negative correlation between the x and y variables. The correlation coefficient is rho equals negative 0.84.

Relationships between mean listener ratings and audio-only SSI-4 (top), stuttering rate (percent syllables stuttered; center), and speech rate (syllables in 30 s; bottom) in Experiment 2. Left column shows relationships to EAI ratings; right column shows relationships to DME ratings. SSI-4 = Stuttering Severity Instrument–Fourth Edition; EAI = equal-appearing interval; DME = direct magnitude estimation.

Discussion

The current study examined the validity and reliability of EAI and DME scaling for perceptual rating of stuttering severity to determine appropriate rating procedures in research and clinical contexts. Experiments 1 and 2 compared EAI and DME ratings of samples of stuttering; between the experiments, the samples differed by the number of speakers and speaker characteristics (e.g., accent), the instructions about stuttering severity given to listeners, and the portions of the audio sample selected (matched for syllables vs. matched for duration). Results suggested that linear models consistently provided the best fit for the data comparing EAI versus DME ratings. Both the current studies and the Schiavetti et al. (1983) study resulted in high interrater reliability. Intrarater reliability was also strong (> .75) for all but one DME rater. Taken together, these suggest that either EAI or DME is acceptable for rating stuttering severity; we conclude that the EAI method is preferable given its speed and ease of use.

Relationship to Previous Work

Earlier research indicated that stuttering severity could only be appropriately rated with DME scaling (Schiavetti et al., 1983). Exp1 was designed as a standalone experiment that was expected to replicate the Schiavetti et al. (1983) findings and serve as the basis for ongoing scale development. When the results did not replicate, a variety of confounds were raised by the study team. For example, it has been noted that speech rate is a metathetic percept, at least in fluent speech of people who stutter (Metz et al., 1990). We were concerned that listeners were biased toward attending to speech rate over stuttering severity due to the specific definition of stuttering provided (“Stuttering is a speech disorder that involves disruptions in the flow of speech … ”), or due to the inherent relationship between speech rate and sample length: Exp1 had samples that were the same length in syllables but differed in the overall duration of the sample. Finally, we had failed to include enough repeated samples to calculate intrarater reliability. Therefore, Exp2 was designed to alter these particular confounds by removing the stuttering definition and by making all samples the same duration (30 s; although then they had different numbers of syllables). In terms of having instructions at all (rather than concerns about any specific given definition), there is evidence that the presence of instructions affects the number of stuttering moments identified (Martin & Haroldson, 1981), but it was as yet undetermined if the presence of instructions would affect the relationship between DME and EAI ratings. Our results suggest that rating scales were equally affected by instructions such that the relationship between them stayed the same. Given the replication of our findings across different samples and experimental conditions (i.e., a linear relationship between DME and EAI), we are confident in these results.

There are several possible remaining explanations for the discrepancies between our findings and those of Schiavetti and colleagues. The foremost difference may be one of sample distribution. Our samples had a uniform spread of severity according to our raters: EAI scores ranged from 1 to 6.8 in Exp1 and from 1.07 to 6.4 in Exp2 (see y-axis of Figure 2). In this range, 25% of samples in Exp1 and 10% of samples in Exp2 had a rating above 5.2. In the Schiavetti et al. study, 19 of 20 of their samples were fairly evenly spaced below 5.2, and then one sample was 6.7; therefore, we surmise that this one sample might have been perceived more like an outlier to listeners, rather than as one of many more severe samples, as in the current study. Accordingly, when we extracted the data from Schiavetti et al. and ran the models without that one very severe sample, we found that a curvilinear model did not fit the data significantly better than a linear model; that is, without that sample, Schiavetti et al. would have concluded that EAI captured the data as well as the DME procedures. Yet, without that sample, their range would have been limited, rendering it difficult to assess the relationship across the severity continuum.

Other aspects of our study design may also help differentiate our results with Schiavetti et al. (1983). For example, Schiavetti and colleagues recruited listeners who were enrolled in an undergraduate course in stuttering, while the listeners in the present set of studies were naive and had no education in stuttering. Although one study has shown that naive listeners rate stuttering similarly to other groups including speech-language pathology students, speech-language pathologists, and stuttering specialists (Amir et al., 2018), previous studies have found contradicting results (Cordes & Ingham, 1994; Kully & Boberg, 1988). This difference in rater experience could help explain some differences in our results from Schiavetti.

Ratings Compared to Other Measurements

Correlations between the average of 15 untrained raters' perceptions of severity using both EAI and DME scales and SSI-4 scores (modified for audio-only; does not include physical concomitants), %SS, and speech rate were all very strong (ρ > .83), which aligns with previous literature (K. E. Lewis, 1995; O'Brian, Packman, Onslow, & O'Brian, 2004; Young, 1961). These results suggest that the listeners were capturing important dimensions of stuttering severity with much faster, simpler ratings that do not require specialized training, demonstrating possible clinical and research outcome benefits of such measures. These strong relationships add weight to our conclusion that EAI and DME both demonstrate construct validity.

Listener ratings may be more sensitive than stuttering counts or the SSI. For example, ratings are nonlinearly (logarithmically) related to the modified SSI-4 (see Figures 3A and 3B), such that the less severe samples reach a floor on the SSI but are still differentiated by listeners. Similarly, the ratings are nonlinearly (exponentially) related to the percent syllables stuttered (see Figures 3C and 3D). Notably, while five of 50 samples in Exp2 had a rating of 0%SS and therefore also 0 SSI, listeners still rated these samples between 1.06 and 2.33 on the EAI scale. This indicates that factors of speech other than number/duration of stuttering moments may be influencing listeners' perceptions of severity. For example, the final measure, speech rate, shows a linear relationship with listener ratings, suggesting that perhaps speech rate itself is salient to the listeners. The relationship of speech rate and stuttering severity has been the subject of a long line of research (Aron, 1967; O. N. Bloodstein, 1944; Young, 1961).

The correlation with speech rate is of interest here because speech rate has been shown to relate strongly to speech naturalness in people who stutter following a residential speech therapy program (Metz et al., 1990). In that study, samples from people who stutter that were deemed fluent were rated with EAI and DME. They found that EAI and DME ratings of speech naturalness showed a linear relationship and thus concluded that speech naturalness is a metathetic percept that could be scaled by either method (Metz et al., 1990). Speech rate was shown to be predictive of speech naturalness in these samples. In the current study, speech rate was also correlated with listener ratings; it is possible that listeners are primarily cued into rate, a possibly metathetic percept (at least of fluent speech, per Metz et al., 1990). However, speech rate was also well correlated with SSI-4 (modified) and %SS, which is expected: Samples with more stuttering also had fewer syllables completed in 30 s. Future studies could explicitly measure naturalness and severity to determine the relationships between these percepts.

Role of Global Severity Rating Scales in the Clinic

In clinical settings, adults who stutter often provide subjective accounts of their stuttering experiences, detailing frequency, intensity, and variability across different speaking situations. These self-reports offer valuable insights into the perceived impact of stuttering on daily communication. Clinicians often aim to quantify these subjective impressions through the SSI-4, while also noting informal observations including whether the client maintains their intended words and navigates stuttering events efficiently or exhibits avoidance or delay behaviors indicative of aversive reactions. Although the SSI-4 offers a structured way to assess stuttering severity, it is rarely repeated during therapy due to its time demands, unreliability, and limited ability to reflect moment-to-moment variability (Davidow, 2021). Quick, reliable measures that capture not only physical features of stuttering, such as frequency and duration, but also naturalness—used alongside clinical observation and client input—could enhance therapy by tracking change over time and deepening mutual understanding between clinician and client.

Limitations

Given the results of the current study, it appears that researchers may indeed use mean EAI or geometric mean DME measures to scale stuttering severity. However, in research studies, we are inherently measuring the reliability of averaged estimates. We did not ascertain the reliability or validity of single estimates (by clinicians, people who stutter, or naive listeners) and therefore cannot recommend individual ratings as reliable or accurate. Additional studies would be needed to determine the utility of single ratings for treatment tracking, or the number of raters needed to provide a stable rating (e.g., Abur et al., 2019). These types of studies would be important for establishing optimal numbers of raters when designing treatment studies.

We also used audio-only samples as a model of a phone call with an unfamiliar conversation partner and following earlier studies suggesting that ratings are similar between audio and video samples (Martin & Haroldson, 1992; Vogel et al., 2015; Williams et al., 1963). As more conversations in the workplace and family life occur over videochat, future research may wish to evaluate ratings of video samples, which would also include physical concomitants. In addition, our samples were of readings; conversational examples from the same participants exist, and follow-up studies should evaluate rating differences between reading and conversational samples.

Finally, severity measures themselves may not capture meaningful information for a given clinical or research purpose and certainly do not capture vital information about the experience of stuttering. Patient-reported outcomes should always be used in complement to listener ratings.

Conclusions

This study provides evidence that both DME and EAI scales are appropriate for assessing stuttering severity in adult speakers, as rated by unfamiliar listeners. This challenges an earlier finding that suggested that only DME was appropriate for ratings of severity. However, further research is necessary to determine the optimal number of raters needed for stable measurement and whether and how these ratings could complement current clinical protocols.

Data Availability Statement

The data set generated and analyzed during the current study is available from the corresponding author on reasonable request.

Acknowledgments

Portions of this project were supported through pilot funding to G.J.C. from the Eunice Kennedy Shriver National Institute of Child Health & Human Development of the National Institutes of Health under Award Number P2CHD101899. FluencyBank curation was supported by NIH NIDCD R01-DC015494. The authors wish to thank other members of the Quantitative Imaging for Learning, Language, & Speech Lab who recruited, screened, and ran participants on the protocol. This included Sophia Banel, Jiahe Zhang, Iris Mendoza Luna, and Haley Kindelberger.

Funding Statement

Portions of this project were supported through pilot funding to G.J.C. from the Eunice Kennedy Shriver National Institute of Child Health & Human Development of the National Institutes of Health under Award Number P2CHD101899. FluencyBank curation was supported by NIH NIDCD R01-DC015494. The authors wish to thank other members of the Quantitative Imaging for Learning, Language, & Speech Lab who recruited, screened, and ran participants on the protocol. This included Sophia Banel, Jiahe Zhang, Iris Mendoza Luna, and Haley Kindelberger.

References

  1. Abur, D., Enos, N. M., & Stepp, C. E. (2019). Visual analog scale ratings and orthographic transcription measures of sentence intelligibility in Parkinson's disease with variable listener exposure. American Journal of Speech-Language Pathology, 28(3), 1222–1232. 10.1044/2019_AJSLP-18-0275 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Amir, O., Shapira, Y., Mick, L., & Yaruss, J. S. (2018). The Speech Efficiency Score (SES): A time-domain measure of speech fluency. Journal of Fluency Disorders, 58, 61–69. 10.1016/j.jfludis.2018.08.001 [DOI] [PubMed] [Google Scholar]
  3. Aron, M. L. (1967). The relationships between measurements of stuttering behaviour. South African Journal of Communication Disorders, 14(1), 15–34. 10.4102/sajcd.v14i1.441 [DOI] [Google Scholar]
  4. Bainbridge, L. A., Stavros, C., Ebrahimian, M., Wang, Y., & Ingham, R. J. (2015). The efficacy of stuttering measurement training: Evaluating two training programs. Journal of Speech, Language, and Hearing Research, 58(2), 278–286. 10.1044/2015_JSLHR-S-14-0200 [DOI] [Google Scholar]
  5. Barsties, B., & De Bodt, M. (2015). Assessment of voice quality: Current state-of-the-art. Auris Nasus Larynx, 42(3), 183–188. 10.1016/j.anl.2014.11.001 [DOI] [PubMed] [Google Scholar]
  6. Bernstein Ratner, N., & MacWhinney, B. (2018). FluencyBank: A new resource for fluency research and practice. Journal of Fluency Disorders, 56, 69–80. 10.1016/j.jfludis.2018.03.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Berry, R. C., & Silverman, F. H. (1972). Equality of intervals on the Lewis–Sherman scale of stuttering severity. Journal of Speech and Hearing Research, 15(1), 185–188. 10.1044/JSHR.1501.185 [DOI] [PubMed] [Google Scholar]
  8. Blood, G. W., & Blood, I. M. (2016). Long-term consequences of childhood bullying in adults who stutter: Social anxiety, fear of negative evaluation, self-esteem, and satisfaction with life. Journal of Fluency Disorders, 50, 72–84. 10.1016/j.jfludis.2016.10.002 [DOI] [PubMed] [Google Scholar]
  9. Bloodstein, O., Ratner, N. B., & Brundage, S. B. (2021). A handbook on stuttering (7th ed.). Plural. https://www.pluralpublishing.com/publications/a-handbook-on-stuttering-seventh-edition [Google Scholar]
  10. Bloodstein, O. N. (1944). Studies in the psychology of stuttering: XIX. The relationship between oral reading rate and severity of stuttering. Journal of Speech Disorders, 9, 161–173. 10.1044/jshd.0902.161 [DOI] [Google Scholar]
  11. Brundage, S. B., Bothe, A. K., Lengeling, A. N., & Evans, J. J. (2006). Comparing judgments of stuttering made by students, clinicians, and highly experienced judges. Journal of Fluency Disorders, 31(4), 271–283. 10.1016/j.jfludis.2006.07.002 [DOI] [PubMed] [Google Scholar]
  12. Brundage, S. B., Ratner, N. B., Boyle, M. P., Eggers, K., Everard, R., Franken, M.-C., Kefalianos, E., Marcotte, A. K., Millard, S., Packman, A., Vanryckeghem, M., & Yaruss, J. S. (2021). Consensus guidelines for the assessments of individuals who stutter across the lifespan. American Journal of Speech-Language Pathology, 30(6), 2379–2393. 10.1044/2021_AJSLP-21-00107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Carter, A., Breen, L., Yaruss, J. S., & Beilby, J. (2017). Self-efficacy and quality of life in adults who stutter. Journal of Fluency Disorders, 54, 14–23. 10.1016/j.jfludis.2017.09.004 [DOI] [PubMed] [Google Scholar]
  14. Cordes, A. K. (2000). Individual and consensus judgments of disfluency types in the speech of persons who stutter. Journal of Speech, Language, and Hearing Research, 43(4), 951–964. 10.1044/jslhr.4304.951 [DOI] [Google Scholar]
  15. Cordes, A. K., & Ingham, R. J. (1994). The reliability of observational data: II. Issues in the identification and measurement of stuttering events. Journal of Speech and Hearing Research, 37(2), 279–294. 10.1044/jshr.3702.279 [DOI] [PubMed] [Google Scholar]
  16. Craig, A., Hancock, K., Tran, Y., Craig, M., & Peters, K. (2002). Epidemiology of stuttering in the community across the entire life span. Journal of Speech, Language, and Hearing Research, 45(6), 1097–1105. 10.1044/1092-4388(2002/088) [DOI] [Google Scholar]
  17. Croft, R. L., & Byrd, C. T. (2020). Self-compassion and quality of life in adults who stutter. American Journal of Speech-Language Pathology, 29(4), 2097–2108. 10.1044/2020_AJSLP-20-00055 [DOI] [PubMed] [Google Scholar]
  18. Davidow, J. H. (2021). Reliability and similarity of the Stuttering Severity Instrument–Fourth Edition and a global severity rating scale. Speech, Language and Hearing, 24(1), 20–27. 10.1080/2050571X.2020.1730545 [DOI] [Google Scholar]
  19. Eadie, T. L., & Doyle, P. C. (2002). Direct magnitude estimation and interval scaling of pleasantness and severity in dysphonic and normal speakers. The Journal of the Acoustical Society of America, 112(6), 3014–3021. 10.1121/1.1518983 [DOI] [PubMed] [Google Scholar]
  20. Engen, T. (1971). Psychophysics: II. Scaling methods. In Kling J. W. & Riggs L. A. (Eds.), Woodworth & Schlosberg's experimental psychology. Holt, Rinehart and Winston. [Google Scholar]
  21. Gerlach, H., Chaudoir, S. R., & Zebrowski, P. M. (2021). Relationships between stigma-identity constructs and psychological health outcomes among adults who stutter. Journal of Fluency Disorders, 70, Article 105842. 10.1016/j.jfludis.2021.105842 [DOI] [Google Scholar]
  22. Howell, P., Davis, S., & Bartrip, J. (2009). The University College London Archive of Stuttered Speech (UCLASS). Journal of Speech, Language, and Hearing Research, 52(2), 556–569. 10.1044/1092-4388(07-0129) [DOI] [Google Scholar]
  23. Ingham, R. J., Cordes, A. K., & Gow, M. L. (1993). Time-interval measurement of stuttering: Modifying interjudge agreement. Journal of Speech and Hearing Research, 36(3), 503–515. 10.1044/jshr.3603.503 [DOI] [PubMed] [Google Scholar]
  24. Kalinowski, J., Noble, S., Armson, J., & Stuart, A. (1994). Pretreatment and posttreatment speech naturalness ratings of adults with mild and severe stuttering. American Journal of Speech-Language Pathology, 3(2), 61–66. 10.1044/1058-0360.0302.61 [DOI] [Google Scholar]
  25. Kent, R. D. (1996). Hearing and believing: Some limits to the auditory–perceptual assessment of speech and voice disorders. American Journal of Speech-Language Pathology, 5(3), 7–23. 10.1044/1058-0360.0503.07 [DOI] [Google Scholar]
  26. Kully, D., & Boberg, E. (1988). An investigation of interclinic agreement in the identification of fluent and stuttered syllables. Journal of Fluency Disorders, 13(5), 309–318. 10.1016/0094-730X(88)90001-0 [DOI] [Google Scholar]
  27. Lewis, D., & Sherman, D. (1951). Measuring the severity of stuttering. Journal of Speech and Hearing Disorders, 16(4), 320–326. 10.1044/jshd.1604.320 [DOI] [Google Scholar]
  28. Lewis, K. E. (1995). Do SSI-3 scores adequately reflect observations of stuttering behaviors? American Journal of Speech-Language Pathology, 4(4), 46–59. 10.1044/1058-0360.0404.46 [DOI] [Google Scholar]
  29. Manning, W., & Gayle Beck, J. (2013). The role of psychological processes in estimates of stuttering severity. Journal of Fluency Disorders, 38(4), 356–367. 10.1016/j.jfludis.2013.08.002 [DOI] [PubMed] [Google Scholar]
  30. Martin, R. R. (1965). Direct magnitude–estimation judgments of stuttering severity using audible and audible-visible speech samples. Speech Monographs, 32(2), 169–177. 10.1080/03637756509375449 [DOI] [Google Scholar]
  31. Martin, R. R., & Haroldson, S. K. (1981). Stuttering identification: Standard definition and moment of stuttering. Journal of Speech and Hearing Research, 24(1), 59–63. 10.1044/jshr.2401.59 [DOI] [PubMed] [Google Scholar]
  32. Martin, R. R., & Haroldson, S. K. (1992). Stuttering and speech naturalness: Audio and audiovisual judgments. Journal of Speech and Hearing Research, 35(3), 521–528. 10.1044/jshr.3503.521 [DOI] [PubMed] [Google Scholar]
  33. McColl, D., & Fucci, D. (2006). Measurement of speech disfluency through magnitude estimation and interval scaling. Perceptual and Motor Skills, 102(2), 454–460. 10.2466/pms.102.2.454-460 [DOI] [PubMed] [Google Scholar]
  34. Metz, D. E., Schiavetti, N., & Sacco, P. R. (1990). Acoustic and psychophysical dimensions of the perceived speech naturalness of nonstutterers and posttreatment stutterers. Journal of Speech and Hearing Disorders, 55(3), 516–525. 10.1044/jshd.5503.516 [DOI] [PubMed] [Google Scholar]
  35. Naylor, R. V. (1953). A comparative study of methods of estimating the severity of stuttering. Journal of Speech and Hearing Disorders, 18(1), 30–37. 10.1044/jshd.1801.30 [DOI] [PubMed] [Google Scholar]
  36. O'Brian, S., Heard, R., Onslow, M., Packman, A., Lowe, R., & Menzies, R. G. (2020). Clinical trials of adult stuttering treatment: Comparison of percentage syllables stuttered with self-reported stuttering severity as primary outcomes. Journal of Speech, Language, and Hearing Research, 63(5), 1387–1394. 10.1044/2020_JSLHR-19-00142 [DOI] [Google Scholar]
  37. O'Brian, S., Packman, A., & Onslow, M. (2004). Self-rating of stuttering severity as a clinical tool. American Journal of Speech-Language Pathology, 13(3), 219–226. 10.1044/1058-0360(2004/023) [DOI] [PubMed] [Google Scholar]
  38. O'Brian, S., Packman, A., Onslow, M., & O'Brian, N. (2004). Measurement of stuttering in adults: Comparison of stuttering-rate and severity-scaling methods. Journal of Speech, Language, and Hearing Research, 47(5), 1081–1087. 10.1044/1092-4388(2004/080) [DOI] [Google Scholar]
  39. Onslow, M., Jones, M., O'Brian, S., Packman, A., Menzies, R., Lowe, R., Arnott, S., Bridgman, K., de Sonneville, C., & Franken, M. C. (2018). Comparison of percentage of syllables stuttered with parent-reported severity ratings as a primary outcome measure in clinical trials of early stuttering treatment. Journal of Speech, Language, and Hearing Research, 61(4), 811–819. 10.1044/2017_JSLHR-S-16-0448 [DOI] [Google Scholar]
  40. Plotdigitizer Online App. (2024). [Computer software]. PORBITAL. https://plotdigitizer.com/app
  41. Riley, G. D. (2009). Stuttering Severity Instrument–Fourth Edition (SSI-4) [Computer software]. http://www.therapro.com/Browse-Category/Fluency-Speech-Mechanisms/Stuttering-Severity-Instrument-Fourth-Edition-SSI-4.html
  42. Schiavetti, N., Sacco, P. R., Metz, D. E., & Sitler, R. W. (1983). Direct magnitude estimation and interval scaling of stuttering severity. Journal of Speech and Hearing Research, 26(4), 568–573. 10.1044/jshr.2604.568 [DOI] [PubMed] [Google Scholar]
  43. Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. 10.1037/0033-2909.86.2.420 [DOI] [PubMed] [Google Scholar]
  44. Stevens, S. S. (1975). Psychophysics: Introduction to its perceptual, neural, and social prospects. Wiley. [Google Scholar]
  45. Stipancic, K. L., Whelan, B.-M., Laur, L., Zhao, Y., Rohl, A., Choi, I., & Kuruvilla-Dugdale, M. (2024). Tipping the scales: Indiscriminate use of interval scales to rate diverse dysarthric features. Journal of Speech, Language, and Hearing Research, 67(10), 3673–3685. 10.1044/2024_JSLHR-23-00785 [DOI] [Google Scholar]
  46. Tichenor, S. E., & Yaruss, J. S. (2021). Variability of stuttering: Behavior and impact. American Journal of Speech-Language Pathology, 30(1), 75–88. 10.1044/2020_AJSLP-20-00112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Toner, M. A., & Emanuel, F. W. (1989). Direct magnitude estimation and equal appearing interval scaling of vowel roughness. Journal of Speech and Hearing Research, 32(1), 78–82. 10.1044/jshr.3201.78 [DOI] [PubMed] [Google Scholar]
  48. Tuthill, C. (1940). A quantitative study of extensional meaning with special reference to stuttering. Journal of Speech Disorders, 5(2), 189–191. 10.1044/jshd.0502.189 [DOI] [Google Scholar]
  49. Van Borsel, J., & Eeckhout, H. (2008). The speech naturalness of people who stutter speaking under delayed auditory feedback as perceived by different groups of listeners. Journal of Fluency Disorders, 33(3), 241–251. 10.1016/j.jfludis.2008.06.004 [DOI] [PubMed] [Google Scholar]
  50. Vogel, A. P., Block, S., Kefalianos, E., Onslow, M., Eadie, P., Barth, B., Conway, L., Mundt, J. C., & Reilly, S. (2015). Feasibility of automated speech sample collection with stuttering children using interactive voice response (IVR) technology. International Journal of Speech-Language Pathology, 17(2), 115–120. 10.3109/17549507.2014.923511 [DOI] [PubMed] [Google Scholar]
  51. Williams, D. E., Wark, M., & Minifie, F. D. (1963). Ratings of stuttering by audio, visual, and audiovisual cues. Journal of Speech and Hearing Research, 6(1), 91–100. 10.1044/jshr.0601.91 [DOI] [PubMed] [Google Scholar]
  52. Wingate, M. E. (1964). A standard definition of stuttering. Journal of Speech and Hearing Disorders, 29(4), 484–489. 10.1044/jshd.2904.484 [DOI] [PubMed] [Google Scholar]
  53. Yairi, E., & Ambrose, N. (2013). Epidemiology of stuttering: 21st century advances. Journal of Fluency Disorders, 38(2), 66–87. 10.1016/j.jfludis.2012.11.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Yaruss, J. S. (1997). Clinical measurement of stuttering behaviors. Contemporary Issues in Communication Science and Disorders, 24(Spring), 27–38. 10.1044/cicsd_24_S_27 [DOI] [Google Scholar]
  55. Yaruss, J. S. (1998). Real-time analysis of speech fluency: Procedures and reliability training. American Journal of Speech-Language Pathology, 7(2), 25–37. 10.1044/1058-0360.0702.25 [DOI] [Google Scholar]
  56. Young, M. A. (1961). Predicting ratings of stuttering severity. Journal of Speech and Hearing Disorders. Monograph Supplement, 7, 31–54. [Google Scholar]
  57. Zraick, R. I., & Liss, J. M. (2000). A comparison of equal-appearing interval scaling and direct magnitude estimation of nasal voice quality. Journal of Speech, Language, and Hearing Research, 43(4), 979–988. 10.1044/jslhr.4304.979 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data set generated and analyzed during the current study is available from the corresponding author on reasonable request.


Articles from Journal of Speech, Language, and Hearing Research : JSLHR are provided here courtesy of American Speech-Language-Hearing Association

RESOURCES