Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Mar 1.
Published in final edited form as: Perspect ASHA Spec Interest Groups. 2022 Jul 22;7(4):1275–1283. doi: 10.1044/2022_persp-21-00332

Graduate Student Clinicians’ Perceptions of Child Speech Sound Errors

Seyoung Jung a,b, Linye Jing a, Maria Grigos a
PMCID: PMC10907014  NIHMSID: NIHMS1967881  PMID: 38433852

Abstract

Purpose:

Speech-language pathologists (SLPs) rely on auditory perception to form judgments on child speech. This can be challenging for graduate student clinicians with limited clinical experience as they often need to judge children’s speech errors using their auditory perception. This study examined how consistently graduate student clinicians used a 3-point perceptual rating scale to judge child speech.

Method:

Twenty-four graduate student clinicians rated single words produced by children with typically developing speech and language skills and children with speech sound disorders. All participants rated the productions using a 3-point scale, where “2” was an accurate production, “1” was a close approximation, and “0” was an inaccurate production. Ratings were solely based on the auditory signal. These ratings were compared to a consensus rating formed by two experienced SLPs.

Results:

Graduate student clinicians reached substantial agreement with the expert SLP rating. They reached the highest percentage agreement when rating accurate productions, and the lowest agreement when rating inaccurate productions.

Conclusions:

Graduate student clinicians reached substantial agreement with expert SLP rating in judging child speech using a 3-point scale when provided with detailed descriptions of each rating category. These results are consistent with previous findings on the role that clinical experience plays in speech error perception tasks and highlight the need for additional listening training in speech-language pathology graduate programs.


Auditory perception is an essential tool used by speech-language pathologists (SLPs) in judging children’s speech errors. It can take years of training, however, for clinicians to adequately refine their listening skills to accurately and consistently identify fine details in errored speech. Previous studies have shown that clinicians with more experience perform better in judging speech errors than inexperienced listeners (Munson et al., 2012; Wolfe et al., 2003). This presents a challenge as many graduate students with limited clinical experience often need to judge children’s speech using their auditory perception as part of supervised assessments and interventions early in their graduate program. Thus, there is a need to better understand how reliable graduate students are in judging speech errors in children. The primary purpose of this study was to examine how consistently graduate student clinicians used a 3-point perceptual rating scale to judge child speech.

Researchers have used varied approaches to study the influence of clinical experience on perceptual judgments of speech by comparing listeners with some experience to listeners with little or no experience (Klein et al., 2012; Munson et al., 2012; Wolfe et al., 2003). Munson et al. (2012) compared experienced SLPs who had an average of 13 years of clinical experience to students with no background in speech-language pathology. In the experiment, both groups judged phonetic details in children’s speech using a visual analog scale (VAS). The participants listened to children’s productions of /s/, /θ/, /t/, /k/, /d/, and /g/ in consonant–vowel (CV) word forms and identified how far the productions were from the two ideal targets (i.e., /s/–/θ/, /t/–/k/, /d/–/g/) displayed on each end of the VAS. The results showed that the group of experienced SLPs had higher interrater reliability when rating all sets of stimuli as compared to the student listeners. Ratings by the SLPs were also more highly correlated with acoustic characteristics of the stimuli than those from the students. These results demonstrated that the experienced listeners were more accurate and consistent in identifying phonetic details in child speech than the group with no background in speech-language pathology (Munson et al., 2012).

To capture how student perceptions may vary, Wolfe et al. (2003) compared three groups of speech-language pathology graduate students with varied degrees of clinical experience in judging children’s /r/ and /w/ productions. Groups 1 and 2 each had clinical experience, although they differed in the populations they were exposed to. In particular, Group 1 had experience with remediation of /r/ errors, whereas Group 2 did not. Group 3 had no clinical experience. The students were presented with stimuli that varied in spectral and temporal parameters of /r/ and /w/ (in word initial position of CVC stimuli) and were instructed to judge whether the phonemes were closer to /r/ or /w/ using a binary rating format. The results showed that the group with /r/ treatment experience (Group 1) had stronger phonetic perception for /r/ and /w/ phonemes and were more sensitive to the acoustic cues for /r/ and /w/ than other groups of students. The group with /r/ experience also demonstrated greater consistency in their sound identification compared to the other two groups (Wolfe et al., 2003), suggesting that students with more experience were more accurate in forming judgments based on their auditory perception.

The binary ratings used in many studies do not offer much flexibility for examining intermediary changes between correct and incorrect sound productions. By employing a 3-point rating scale, Klein et al. (2012) studied the judgments formed by speech-language pathology graduate student clinicians in comparison to practicing clinicians and attempted to capture whether listeners use an intermediary category. Using a 3-point scale, each listener independently judged the productions as correct production, close approximation, and incorrect production. The close approximation category reflected an incorrect production that was in the direction of the target. The graduate student clinicians did not consistently match the expert SLP ratings. Percent agreement among the graduate students varied by rating category and was 79.38% for correct productions, 71.31% for close approximations, and 60.82% for inaccurate productions (Klein et al., 2012). The observation that the graduate student clinicians had greater difficulty distinguishing distortions from more involved errors suggests that varied degrees of /r/ misarticulation may be challenging for students to distinguish. The extent to which this finding generalizes to other speech sound errors is not known.

In contrast to assessing individual sound production, several small treatment studies involving children with childhood apraxia of speech (CAS) used a 3-point perceptual rating scale to measure whole word accuracy (Maas et al., 2012; Maas & Farinella, 2012; Strand et al., 2006). Within these works, the scale was used to monitor treatment progress using probe testing, which involved judging target words elicited in a random order either before or after a treatment session. Although this method does not capture the fine details yielded from transcription, it quickly offers information about performance accuracy and is believed to better illustrate acquisition compared to words produced within a practice session (Strand, 2020). The 3-point scale used in these studies included categories for accurate productions, close approximations, and inaccurate productions. Compared to binary ratings often used within the context of intervention (i.e., correct vs. incorrect), the 3-point scale offered the possibility of capturing productions that moved closer to the correct target, which can aid in clinical decision making.

Several studies reported interrater reliability between listeners using the 3-point rating scale to measure whole word accuracy (Maas et al., 2012; Maas & Farinella, 2012; Strand et al., 2006). In Strand et al. (2006), three experienced SLPs judged words produced by four children with CAS and reported reliability ranging from 70% to 100% between the three clinicians. Similar results were reported in Maas et al. (2012) where two experienced SLPs who used the same 3-point rating scale had interrater reliability ranging from 71% to 94% when judging words produced by four children with CAS. Jing and Grigos (2021) explored the reliability of the 3-point rating scale in greater depth where 30 experienced SLPs judged speech produced by typically developing (TD) children and children diagnosed with speech sound disorders (SSDs). Analyses were performed using Kappa statistics and interpreted using Landis and Koch (1977). The majority of SLPs (87%) reached substantial interrater reliability when compared to a consensus judgment formed by two SLPs with expertise in CAS. Accurate productions had the highest interrater reliability followed by productions with extensive errors. The productions with minor errors had the lowest agreement. Taken together, these studies illustrate that while the 3-point rating scale does not replace transcription, it can be a reliable and efficient tool for experienced SLPs to use when judging speech errors in children. It is unclear whether inexperienced listeners, such as graduate student clinicians, would yield similar reliability results.

An important variable that can influence performance on any rating task is the detail and clarity of the rating categories provided to the listeners. For instance, the graduate student clinicians in Klein et al. (2012) were given a simple description of the rating categories. They were instructed to assign an intermediate rating to productions that were close to the target, and an inaccurate rating to productions with a phonemic substitution or a severe distortion. The authors attributed moderate reliability (61% agreement), in part, to difficulty distinguishing between varying degrees of distortion. As a result, the listeners did not consistently identify severe distortions as inaccurate and instead identified most distortions as intermediate (Klein et al., 2012). To alleviate such challenges, graduate students can be provided with a more detailed description of the rating categories, as has been done in previous studies (Jing & Grigos, 2021; Maas et al., 2012; Strand et al., 2006). This included instructing listeners to assign an intermediate rating to productions that included a consonant error with a change in only one distinctive feature or only one mild vowel distortion (Strand et al., 2006), providing examples of errors using narrow transcription (Maas et al., 2012), and offering training on using the rating prior to the experiment (Jing & Grigos, 2021). A similar approach is employed in the current work to explore whether graduate students can achieve high levels of reliability when using a 3-point rating scale with a more specific description of rating categories.

The primary objectives of this study were to (a) examine how consistent graduate student clinicians are in using a 3-point scale to judge speech errors produced by children and (b) explore whether consistency between listeners differed across ratings assigned to accurate productions, close approximations, and inaccurate productions. We captured ratings across a range of speech sound errors, in contrast to past work that focused on a narrow set of error patterns (Klein et al., 2012; Munson et al., 2012; Wolfe et al., 2003). The rationale for this approach is that children with SSD produce a wide variety of errors, which can be challenging to judge perceptually, particularly for graduate student clinicians with limited experience. Since clinicians heavily rely on their auditory perception, there is a need to better understand how graduate student clinicians develop these skills. We predicted that, on average, graduate students would reach moderate to substantial agreement given previous findings which found moderate reliability between graduate student ratings when specific rating criteria were not provided (Klein et al., 2012). We also anticipated that graduate students would have the highest agreement for accurate productions and the lowest agreement for close approximations.

Method

Participants

Listeners

This study included 24 female graduate student listeners between the ages of 22 and 55 years (M = 29.46, SD = 7.28). All graduate students were enrolled in a master’s program in Communicative Sciences and Disorders and had completed one course in phonetics and a graduate level course on speech sound disorders in children. The latter course incorporated eight listening labs that focused on refining perceptual judgments. The students did not have any clinical experience involving direct client contact. All students reported having no history of hearing or speech and language impairment and spoke English as their first language. Informed consent was obtained from each participant.

Speakers

A total of 21 children participated as speakers. They were divided into three groups: children diagnosed with CAS, children diagnosed with other types of SSD, and typically developing children (CAS = 7, SSD = 7, TD = 7). The mean age (and standard deviation) of each group was 52.3 months (9.6) for the CAS group, 55.4 months (11.3) for the SSD group, and 54.6 months (10.1) for the TD group. All children were reported to be monolingual English speakers. These speakers were a subset from a previous study on speech production involving children with CAS and other types of SSD (Grigos et al., 2015). All children were diagnosed with CAS or SSD by two American Speech-Language-Hearing Association (ASHA)–certified SLPs with expertise in pediatric motor speech disorders. Detailed information on the diagnostic procedure and speech characteristics of the participants can be found in Grigos et al. (2015).

Rating Scale

This study used a 3-point scale adapted from past research investigating children diagnosed with CAS (Maas et al., 2012; Strand et al., 2006). A detailed description of the scale is displayed in Table 1.

Table 1.

Description of the 3-point rating scale (Maas et al., 2012; Strand et al., 2006).

Rating Definition Criteria
2 Accurate productions
  • These productions have NO error

1 Close approximations
  • These productions have ONLY ONE of the following errors
    • Mild vowel distortion: sounds like a bad example of the target vowel, or between the target and another vowel, but is NOT a vowel substitution
    • Mild consonant error: has a change in ONLY ONE feature that affects the place, manner, or voicing of the consonant
    • Excessive lengthening: a consonant or a vowel is lengthened excessively
    • Prosodic error in bisyllabic words: inaccurate stress, equal stress, or word segmentation in bisyllabic words
0 Inaccurate productions
  • These productions have any error that do not qualify for a 1 and/ or have multiple errors. This includes but is not limited to:
    • Vowel substitution: sounds like a good exemplar of a different vowel
    • Consonant Substitution: more than one feature is different from target
    • Omission of sound/syllable
    • Addition of sound/syllable
    • Multiple errors: More than one sound error and/ or sound error plus prosodic error

Stimuli

Two sets of stimuli were used for this study. Expert ratings for all stimuli were established by two experienced SLPs based on a 3-point rating scale. Two SLPs transcribed all stimuli using narrow transcription and conducted an error analysis based on the transcription (see Consensus Rating for details of this procedure). The first stimuli set were used in the training portion of the task. The target words included out, eat, hi, go, mop, hot, bed, daddy, happy, bunny, paperclip, bobbypin, and masterpiece. These stimuli were used to elicit a wide range of error types represented in the experimental stimuli. There were 36 training words evenly divided across the three levels of the rating scale (“2” rating = 12 words; “1” rating = 12 words; “0” rating = 12 words). The errored productions were also divided across all error types listed in a description of the rating scale in Table 1. Productions with “1” and “0” ratings included error types that were similar to those in productions used in the experiment. All stimuli used in the training portion were produced by two children, one child with typically developing speech and language skills (age 12 years), and one child with an articulation disorder (age 10 years). These stimuli were produced in isolation and differed from those used in the experimental task to prevent any practice effect from the training session prior to the experiment.

The second stimuli set were used in the experimental task. The target words included Bob, Pop, Babybop, Puppypop, which were drawn from Grigos et al. (2015). The stimuli included CVC and CVCVCVC word shapes and were comprised of bilabial phonemes to be able to visualize lip and jaw movement using facial tracking technology, which was a focus of Grigos et al. (2015). Even though the stimuli consisted of simple word shapes and bilabial phonemes, the productions included a wide range of errors.

The original set of stimuli included three productions of each target word (Bob, Pop, Babybop, Puppypop) from each child, which yielded 252 productions. Stimuli that were not suitable for the online listening task (e.g., background noise, low volume, overtalk) were excluded, which then reduced the set of stimuli to 128 productions. A total of 128 stimuli were balanced across three ratings (“2” rating = 43 words; “1” rating = 45 words; “0” rating = 40 words). The distribution of all stimuli by children is displayed in Table 2. All recordings were completed in a sound-attenuated booth in the Department of Communicative Sciences and Disorders at New York University using a digital minidisc recorder (SonyHHB50).

Table 2.

Distribution of training and experiment stimuli by children and group.

Group 2 rated stimuli 1 rated stimuli 0 rated stimuli Grand total
Training SSD 8 SSD 8 SSD 7 SSD 23
TD 4 TD 4 TD 5 TD 13
Total no. 12 Total no. 12 Total no. 12 Total no. 36
Experiment CAS 12 CAS 21 CAS 19 CAS 52
SSD 15 SSD 7 SSD 14 SSD 36
TD 16 TD 17 TD 7 TD 40
Total no. 43 Total no. 45 Total no. 40 Total no. 128

Note. SSD = speech sound disorder; TD = typically developing; CAS = childhood apraxia of speech.

Consensus Rating

To establish an expert rating, two ASHA-certified SLPs specializing in pediatric speech disorders independently rated all stimuli using a 3-point scale: 0 = inaccurate, 1 = close approximation, 2 = accurate. The two SLPs had an average of 15 years of clinical experience working with pediatric motor speech disorders. Both SLPs had not previously listened to the tokens and were unaware of identifying information about the children. They did not have access to acoustic information or transcriptions while completing the ratings. To avoid reliance on visual information to form judgments, the SLPs rated the tokens solely based on auditory signals. Point to point agreement between the two SLP raters was 85.95%. Both SLP raters simultaneously reexamined any tokens not meeting agreement. A consensus was reached for 100% of the disputed items.

Procedure

This study was approved by the New York University Committee on Activities Involving Human Subjects. All graduate student clinicians completed a consent form and a questionnaire online. This questionnaire was used to gain information about each student’s academic and clinical background and history of speech, language, hearing impairment, as well as neurological disorders. Once the participants completed the questionnaire, they were directed to a website to begin the rating experiment. First, they were instructed to review a detailed description of the 3-point rating scale (see Table 1) and to have a copy of the rating scale as a reference while they complete the training session and the experiment. They were also instructed to complete the training and the experiment in a quiet room while wearing headphones. Access to both tasks was provided through links made available by the experimenters. The delivery of the experiment and the collection of participants’ responses were completed by an experiment presentation software written in JavaScript, Experigen (Becker & Levine, 2010), which was the same platform used in Jing and Grigos (2021).

The training session was presented in three separate blocks of 12 words. The target word was displayed on the screen along with a play button to play the audio stimulus. Participants listened to and rated the words that were randomized by speaker and rating of the production. Participants were able to play the word as many times as needed. They completed the ratings based on the auditory signal without any visual information. They rated each word by choosing 0, 1, and 2 displayed on the screen and confirmed their answer by clicking “confirm.” After each response, feedback with a rationale for the correct answer was provided. The explanation was provided on every trial, even if the participant’s rating was accurate. Participants were required to achieve 80% or above on a training block to advance to the experiment. Participants who did not reach 80% reliability on the first training block completed the second training block and subsequently the third training block if necessary. All participants achieved 80% reliability by the third training block. This approach was used in order to ensure that all participants completing the experiment understood the rating scale and instructions correctly. All participants completed the experiment in one sitting. As done in the training session, all stimuli presented in the experiment were randomized by speaker and rating. The participants rated the stimuli using the same steps from the training session, but no feedback and rationale for the correct answer was provided.

Statistical Analysis

Interrater agreement between each graduate student clinician and the expert SLP rating were measured using the Kappa statistic. Kappa values with quadratic weighting were used to compare the variability between pairs of ratings (e.g., reliability between “0” and “1” ratings) to the overall variability of the data set (Vanbelle, 2016). Each participant’s Kappa values were examined across the rating categories and the values were interpreted based on guidelines by Landis and Koch (1977) where > .81 = excellent agreement, .61–.80 = substantial agreement, .41–.60 = moderate agreement, and < .41 = fair to poor agreement.

Percentage agreement between the graduate students for each rating category was examined by first calculating each student’s percentage agreement with the expert SLP rating for each token. Then the average percentage agreement of the graduate students was calculated to determine a mean percentage agreement for each rating category. A one-way analysis of variance (ANOVA) and post hoc comparisons using Tukey’s honestly significant difference test were conducted to compare these percentage agreements of graduate students’ rating among the three rating categories.

Results

Interrater agreement between each graduate student and the expert SLP rating was calculated across all rating categories. Table 3 displays the Kappa value, standard error, and 95% confidence interval for each graduate student clinician. The agreement level across the graduate students varied, ranging from .51 to .73 (moderate to substantial agreement). On average, graduate students showed a mean Kappa value of .62 (SD = 0.06), reaching substantial agreement with the expert SLP rating.

Table 3.

Reliability of each graduate student’s rating with experienced speech-language pathologist rating.

Participant Kappa SE 0.95 CI
P1 .66 0.07 [0.53, 0.80]
P2 .62 0.07 [0.48, 0.76]
P3 .55 0.07 [0.42, 0.68]
P4 .62 0.06 [0.49, 0.74]
P5 .52 0.07 [0.38, 0.65]
P6 .67 0.06 [0.55, 0.79]
P7 .66 0.06 [0.54, 0.78]
P8 .62 0.06 [0.49, 0.74]
P9 .58 0.07 [0.45, 0.72]
P10 .71 0.06 [0.58, 0.83]
P11 .64 0.06 [0.52, 0.76]
P12 .60 0.07 [0.47, 0.73]
P13 .60 0.06 [0.49, 0.72]
P14 .53 0.07 [0.40, 0.66]
P15 .51 0.08 [0.35, 0.66]
P16 .68 0.07 [0.55, 0.82]
P17 .73 0.06 [0.62, 0.85]
P18 .58 0.07 [0.43, 0.72]
P19 .71 0.06 [0.60, 0.82]
P20 .66 0.06 [0.54, 0.79]
P21 .69 0.06 [0.58, 0.80]
P22 .53 0.07 [0.40, 0.66]
P23 .65 0.07 [0.51, 0.79]
P24 .66 0.07 [0.53, 0.79]

Note. SE = standard error; CI = confidence interval.

Percentage agreement for each rating category was calculated and these scores were averaged across the graduate students to yield a mean percentage agreement score for each rating category. On average, the graduate students reached agreement with the expert SLPs’ rating for 54.27% of inaccurate productions (“0” ratings), 55.74% of close approximations (“1” ratings), and 80.71% of accurate productions (“2” ratings). The percentage ranged from 4.17% to 100% for inaccurate productions, 4.17% to 91.67% for close approximations, and 29.17% to 100% for accurate productions. The graduate students reached highest percentage agreement with expert rating for accurate productions and the lowest percentage agreement for inaccurate productions. Table 4 displays the mean percentage agreement (and standard deviation) of graduate students’ ratings with experienced SLPs’ ratings across the different ratings categories.

Table 4.

Mean percentage agreement (standard deviation) of graduate student clinicians and expert speech-language pathologist rating.

Expert rating Graduate student rating
0 1 2
0 54.27% (29.06) 33.39% (22.06) 10.58% (17.23)
1 11.02% (16.13) 55.74% (24.74) 33.24% (29.31)
2 2.52% (5.48) 16.77% (13.79) 80.71% (16.90)

A one-way ANOVA revealed that the percentage agreement between the graduate student clinicians and expert SLPs was significantly different across the rating categories, F(2, 125) = 14. 95, p < .001. A post hoc Tukey test revealed significantly higher percentage agreement for “2” rated words than “0” rated words, p < .001, “2” rated words than “1” rated words, p < .001, but not between “1” rated words and “0” rated words, p = .97. ANOVA and post hoc results are provided in Table 5 and 6.

Table 5.

One-way analysis of variance of graduate student clinicians’ percentage agreement.

Source of variation SS df MS F p value F crit
Between groups 1.761 2 0.881 14.953 < .001 3.068689
Within groups 7.361 125 0.059
Total 9.122 127

Note. SS = sum of squares; MS = mean squares; F crit = F critical value.

Table 6.

Multiple comparisons of percentage agreement for three rating categories (0, 1, 2).

(I) consensus (J) consensus Mean difference (I-J) SE Sig. 95% CI
Lower bound Upper bound
0 1 0.00296 0.052 0.998 −0.1221 0.1280
2 −.24676* 0.053 0.000 −0.3732 −0.1203
1 0 −.00296 0.052 0.998 −0.1280 0.1221
2 −.24972* 0.051 0.000 −0.3725 −0.1270
2 0 .24676* 0.053 0.000 0.1203 0.3732
1 .24972* 0.051 0.000 0.1270 0.3725

Note. CI = confidence interval; SE = standard error; Sig = significance.

*

The mean difference is significant at the .05 level.

Discussion

This study examined how accurate and consistent graduate student clinicians are in judging children’s speech errors using a 3-point rating scale. We also explored whether the graduate students’ percentage agreement with the expert SLP rating varied between rating categories (i.e., accurate productions, close approximations, and inaccurate productions). Graduate students reached substantial agreement with the expert rating on average, and demonstrated highest percentage agreement when rating accurate productions and lowest agreement when rating inaccurate productions. As predicted, the level of agreement between the graduate students and the expert rating ranged from moderate to substantial.

This work has important clinical implications for graduate students in communicative sciences and disorders in that it shows that the 3-point rating scale is a moderately reliable tool for graduate students to use. A 3-point rating scale quickly yields information on articulatory accuracy at a whole-word level. Hence, it can provide opportunity for graduate student clinicians to efficiently quantify a child’s progress before and after treatment. Alternative measures such as narrow transcription and percentage of consonants correct provide specific information that can be used to make more detailed clinical decisions. While a 3-point rating scale does not replace these measures, it is a time-efficient way for graduate student clinicians to reliably assess child speech in a clinical setting.

One of the aims of this study was to examine whether using this scale with more detailed criteria to define each rating category would improve graduate student clinicians’ accuracy, as has been done with experienced SLPs (Maas et al., 2012; Strand et al., 2006). In Klein et al. (2012), graduate students used a 3-point scale to judge children’s /r/ productions. Their Kappa values ranged from .26 to .66 (poor to substantial agreement) and was .57 (moderate agreement) on average. In this study, graduate students used the scale with specific criteria and showed higher Kappa values that ranged from .51 to .73 (moderate to substantial agreement) and .62 (substantial agreement) on average. These results suggest that one way for clinicians with limited experience to improve their accuracy and consistency in judging speech productions is to have specific criteria that aid listeners in distinguishing between the rating categories. This is important because clinicians must accurately and consistently judge their clients’ speech to reliably assess progress during treatment.

Although graduate students reached substantial percentage agreement on average, they did not perform as well as experienced SLPs in previous studies that used a similar 3-point scale to judge a range of errors produced by children with CAS (Maas et al., 2012; Strand et al., 2006). Recall that interrater reliability ranged from 70% to 100% in Strand et al. (2006), and 71% to 94% in Maas et al. (2012). This suggests that clinicians with more experience perform better in perceptually judging speech accuracy than those with less experience, which is consistent with findings of past studies (Munson et al., 2012; Wolfe et al., 2003).

Another aim of this study was to examine whether consistency between listeners differed across ratings assigned to accurate productions, close approximations, and inaccurate productions. Although graduate students achieved excellent agreement (80.71%) when rating accurate productions, they only reached moderate agreement (54.27%) when rating inaccurate productions and moderate agreement (55.74%) when rating close approximations. Similarly, Jing and Grigos (2021) reported that a group of experienced SLPs reached highest agreement (79.2%) when rating accurate productions but lower agreement when rating inaccurate productions (66.63%) and close approximations (58.22%). These results support our predictions that it is more difficult for listeners to perceptually judge speech accuracy within inaccurate productions than in accurate productions and are consistent with past research (Klein et al., 2012; Sharf & Ohde, 1983). The current work differed from these studies, however, in that listeners were provided with specific criteria to define each rating category and were tasked with rating a range of speech errors, not only those involving consonantal and vocalic /r/.

The wide range of speech errors that graduate students were tasked to judge in this study is one possible reason for their lower percentage agreement when rating inaccurate productions and close approximants, compared to accurate productions. These errors are challenging to judge perceptually, especially for individuals with limited clinical experience. One example would be differentiating between a vowel distortion and substitution. Students were instructed to give a close approximation rating for productions with a mild vowel distortion and an inaccurate rating for vowel substitutions. Making these distinctions and forming consistent judgments require training to help listeners tune into fine details of the vowel production. The same could be said about the perception of consonant errors. A close approximation rating was given for productions with a consonant error with a change in only one distinctive feature, and an inaccurate rating was assigned to productions that involved a change in more than one distinctive feature. While the graduate students all completed coursework that included instructions on distinctive features, their ability to apply that information within a listening task likely takes more practice.

Since the accuracy and consistency of clinical judgments is expected to improve with greater exposure to disordered speech, the findings of this study highlight the need for graduate students to receive perceptual training as part of their graduate coursework to refine their clinical judgments. Future research should explore methods that can further improve graduate students’ perceptions of errored speech. One possible direction of additional research is to examine whether different types of errors within inaccurate productions and close approximations were particularly challenging for the graduate students to judge. Jing and Grigos (2021) explored whether error type influence listeners’ accuracy in judging child speech and showed that experienced SLPs reached highest agreement when rating voicing errors and lowest agreement when rating vowel distortions. Such error type analysis has not been conducted with graduate students and would illuminate where clinical training should place greater emphasis on certain error patterns over others.

There are several limitations of this work. For one, all participants completed both the training and the experimental task remotely through an online platform. Even though they were instructed to complete the tasks in a quiet place and use headphones, the environment in which they completed the tasks and the quality of headphones they used were not controlled by the experimenters. As a result, it is possible that the participants did not comply with the instructions to wear headphones and that the auditory signals varied in quality, which may have influenced their ratings. Secondly, although all participants reported to have normal hearing, no hearing screening was conducted by the experimenters to confirm this information.

Another limitation of the study was that the stimuli were comprised of simple word shapes and bilabial phonemes. Although these stimuli included a wide range of errors, they are not representative of the range of error types that may be commonly seen in children with CAS and other types of SSD. Further research involving stimuli that include a wider range of error types and more complex word shapes is needed to examine the reliability of graduate students using this scale to rate child speech.

Additionally, all listeners in this study rated the stimuli solely based on auditory signals without any visual information. This decision was made to avoid reliance on visual information to form auditory-perceptual judgment. While visual information can aid in ratings, it can also mislead listeners. For instance, a listener who observed visible labial closure for a plosive may believe that they heard a word-final plosive, even if the plosive was not released. To avoid such situations, this study involved auditory-perceptual ratings without access to visual information. In a real-world clinical setting, however, clinicians have access to both visual and auditory information. They often rely on visual information as they make auditory-perceptual judgments, which may influence the accuracy of their judgment. To explore such possibility, future research should examine whether ratings would differ when visual information is provided to the listeners.

Acknowledgments

This research was partially supported by National Institute on Deafness and Other Communicative Disorders Grants R03DC009079 and R01DC018581 awarded to Maria I. Grigos and by the NYU Steinhardt Department of Communicative Sciences and Disorders Award to Honors Students awarded to Seyoung Jung. The authors acknowledge the graduate students who participated in this study. Special appreciation is extended to Susannah Levi for her useful comments, guidance, and support for research, as well as to members of New York University Motor Speech Lab and Honors Research Seminar for their support throughout the project.

Footnotes

Disclosure: The authors have declared that no competing financial or nonfinancial interests existed at the time of publication.

Data Availability Statement

The data sets generated during and/or analyzed during this study are available from the corresponding author on reasonable request.

References

  1. Becker M, & Levine J (2010). Experigen: An online experiment platform. https://github.com/tlozoot/experigen
  2. Grigos MI, Moss A, & Lu Y (2015). Oral articulatory control in childhood apraxia of speech. Journal of Speech, Language, and Hearing Research, 58(4), 1103–1118. 10.1044/2015_JSLHR-S-13-0221 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Jing L, & Grigos MI (2021). Speech-language pathologists’ ratings of speech accuracy in children with speech sound disorders. American Journal of Speech-Language Pathology, 31(1), 419–430. 10.1044/2021_AJSLP-20-00381 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Klein HB, Grigos MI, McAllister Byun T, & Davidson L (2012). The relationship between inexperienced listeners’ perceptions and acoustic correlates of children’s /r/ productions. Clinical Linguistics & Phonetics, 26(7), 628–645. 10.3109/02699206.2012.682695 [DOI] [PubMed] [Google Scholar]
  5. Landis JR, & Koch GG (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. 10.2307/2529310 [DOI] [PubMed] [Google Scholar]
  6. Maas E, Butalla CE, & Farinella KA (2012). Feedback frequency in treatment for childhood apraxia of speech. American Journal of Speech-Language Pathology, 21(3), 239–257. 10.1044/1058-0360(2012/11-0119) [DOI] [PubMed] [Google Scholar]
  7. Maas E, & Farinella KA (2012). Random versus blocked practice in treatment for childhood apraxia of speech. Journal of Speech, Language, and Hearing Research, 55(2), 561–578. 10.1044/1092-4388(2011/11-0120) [DOI] [PubMed] [Google Scholar]
  8. Munson B, Johnson JM, & Edwards J (2012). The role of experience in the perception of phonetic detail in children’s speech: A comparison between speech-language pathologists and clinically untrained listeners. American Journal of Speech-Language Pathology, 21(2), 124–139. 10.1044/1058-0360(2011/11-0009) [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Sharf DJ, & Ohde RN (1983). Perception of distorted “R” sounds in the synthesized speech of chlldren and adults. Journal of Speech and Hearing Research, 26(4), 516–524. 10.1044/jshr.2604.516 [DOI] [PubMed] [Google Scholar]
  10. Strand EA (2020). Dynamic temporal and tactile cueing: A treatment strategy for childhood apraxia of speech. American Journal of Speech-Language Pathology, 29(1), 30–48. 10.1044/2019_AJSLP-19-0005 [DOI] [PubMed] [Google Scholar]
  11. Strand EA, Stoeckel R, & Baas B (2006). Treatment of severe childhood apraxia of speech: A treatment efficacy study. Journal of Medical Speech-Language Pathology, 14(4), 297–307. [Google Scholar]
  12. Vanbelle S (2016). A new interpretation of the weighted kappa coefficients. Psychometrika, 81(2), 399–410. 10.1007/s11336-014-9439-4 [DOI] [PubMed] [Google Scholar]
  13. Wolfe V, Martin D, Borton T, & Youngblood HC (2003). The effect of clinical experience on cue trading for the /r-w/ contrast. American Journal of Speech-Language Pathology, 12(2), 221–228. 10.1044/1058-0360(2003/068) [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data sets generated during and/or analyzed during this study are available from the corresponding author on reasonable request.

RESOURCES